That was a nasty one, Debug build would never have any issue (even tried with 64 threads!), but Release build would deadlock nearly immediately, even with only 2 threads! What happened here (I think) is that gcc optimizer would generate a specific path endlessly looping when initial value of virtual_lock was FLT_MAX, by-passing re-assignment from v_no[0] and the atomic cas completely. Which would have been correct, should v_no[0] not have been shared (and modified) by multiple threads. ;) Idea of that (broken) for loop was to avoid completely calling the atomic cas as long as v_no[0] was locked by some other thread, but... Guess the avoided/missing memory barrier was the root of the issue here. Lesson of the evening: Remember kids, do not trust your compiler to understand all possible threading-related side effects, and be explicit rather than elegant when using atomic ops! Side-effect lesson: do check both release and debug builds when messing with said atomic ops...