Understanding Threads: Atomic Operations and Memory Ordering


A common question about threads is:

I need a read/write access to a variable from two or more threads. Should I protect an access to a variable by critical section?

A simple answer is: yes, if you want to be on the safe side always protect the variable by a critical section. A deeper answer requires an understanding of the two concepts: atomic operations and memory ordering.

The “atomic” means that the result of the operation is always valid. Suppose we have two threads: thread A writes to a variable one of the two possible values (0 or 1); thread B reads the variable. If both writes and reads are atomic, then the result of the read operation in thread B can be only 0 or 1 – it is impossible to read a different value from the variable, even if write and read operations are performed at the same time.

That is what an official documentation on Intel processors says about guaranteed atomic operations:

Guaranteed Atomic Operations

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:

    Reading or writing a byte
    Reading or writing a word aligned on a 16-bit boundary
    Reading or writing a doubleword aligned on a 32-bit boundary

The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:

    Reading or writing a quadword aligned on a 64-bit boundary

Does the above mean that if an access to a variable is carried out atomically by processor there is no need to protect an access to a variable by critical section? A general answer is NO. Sometimes atomic operations are not enough and we also should take a memory ordering into account.

Consider the following situation:

thread A executes the following code:

  K:= 0;
  L:= 0;
// [..] - a lot of code skipped
  K:= 5;
  L:= 1;

thread B executes the following code:

if L = 1 then N:= N + K;

Read and write access to K and L variables is atomic. What is wrong with the above code?
The result of N:= N + K; is predictable only if the processor enforces strong memory ordering, i.e. thread B reads the changed K and L values in the same order as thread A writes them, so that if thread B sees that L has changed (if L = 1) it assumes that K has also changed (and should be read as 5).
Modern processors does not enforce strong memory ordering. You can read about memory ordering (and atomic operations) in official Intel manual (currently available here), but a good practice is not to rely on a particular kind of memory ordering in a particular processor. We should not assume that when thread B reads L as 1 it will next read K as 5 – it is also possible that it will read K as 0, so the result of

if L = 1 then N:= N + K;

is unpredictable. (Of course the result is always predictable if the above code is executed in thread A).

A possible solution in the above situation is to protect read and write access to L variable by a critical section – a critical section creates a kind of memory barrier and enforces local strong memory ordering. A critical section also makes an access to a variable atomic, so if you protect multithread-accessed variables by critical sections you are on the safe side.

Thread synchronization in Windows XP and Vista


An implementation of the thread locks have changed between Windows XP and Vista. To make a long story short, XP locks were fair; starting from Vista (and also Windows 2003 SP1) thread locks are unfair. A lock fairness means that the threads that tries to hold a lock (for example, to enter a critical section) forms a FIFO queue – the first thread that is entered the queue is the first thread to get hold the lock. The above statement does not mean though that Vista threads get hold of the lock in random order. I have tried to write a simple demo code to understand the difference in thread synchronization between XP and Vista and failed. The difference is more subtle and not easy to demonstrate.
To be exact, the locks were never 100% fair – reason is kernel mode APC (Asynchronous Procedure Calls) that can shuffle FIFO order both in XP and Vista. This consideration is closely related to well-known PulseEvent function issue (PulseEvent function is deprecated now). Still if there are no APC calls to interfere, the FIFO order is maintained.
With Windows XP we can think that the system never releases a lock if there is a thread waiting for it – the system just passes a lock from the thread that released it to the thread already waiting for it.
The situation is different with Vista.
Imagine the following scenario: a thread A entered a critical section; a thread B is entering the same critical section (waiting for the thread A to leave it); a thread C is executing and is about to enter the critical section. With Windows XP the thread C have no chance to sneak before the thread B and get hold of the lock first. Starting from Vista the thread C have a chance to do it.
The reason to change the lock’s internals is to make the system more multiprocessor (or multicore) friendly. Consider the threads A, B and C are executed on separate processors. If the thread C get hold of the lock just released by thread A (before the thread B), it avoids entering a wait state and also avoids the overhead related (context switching). The thread B is already in wait state, now it will have to wait the thread C to release the lock, but since the thread C has avoided the wait state completely the total system performance is increased. We can think that locks are used more effectively now – the situation when some thread holds a critical section but not executes a code protected by the critical section (performing a context switching) is less probable.