TThread Facts

4

Delphi RTL TThread class has been an object of criticism since its introduction. Some criticism was deserved, some not. TThread implementation was gradually improving with every Delphi version. Here I am neither criticizing nor advocating the TThread class, just listing some details of TThread implementation in modern Delphi versions that Delphi programmer should know.

  1. TThread class does not allow setting a thread stack size – it is always equal to default value, usually 1 Mb. If you need a different value you can call BeginThread function directly, but that is a low level solution and not as handy as using TThread class;
  2. TThread constructor has the only parameter (CreateSuspended: Boolean). Irrespective of the parameter value the underlying Windows thread object is always created in suspended state:

        FCreateSuspended := CreateSuspended {...};
        FHandle := BeginThread(nil, 0, @ThreadProc, Pointer(Self), CREATE_SUSPENDED, FThreadID);
    

    A thread can start execution in AfterConstruction method (which is called after all inherited constructors):

      if not FCreateSuspended {...} then Resume;
    

    or later;

  3. Synchronization with the main thread does not work in console application. That means that you should not use Synchronize and Queue methods of TThreads class in console application. The same also applies to OnTerminate event which calls Synchronize method internally; you can override protected DoTerminate method instead;
  4. If you set FreeOnTerminate = True for a TThread instance, the instance is destroyed in the context of underlying background thread. Usually the destruction context does not matter, but in some cases it does. For example, create a TTimer instance in the constructor of TThread descendant, destroy timer in the destructor and see what happens (hint: an invisible window of TTimer instance is created in the main thread and destroyed in a different thread);
  5. TThread does not guarantee that Execute method will be called. I have bumped into this problem a little before.

Despite all issues we are better to use TThread class ‘as is’ in GUI applications. For my experiments I have written a simple alternative thread wrapper class that allows setting a thread stack size, always calls Execute method and does not contain synchronization methods; it is well suited for console applications.

Semaphore throttle

15

Suppose we have an algorithm that uses N parallel threads, and we have a system with M CPU cores, N > M. Running the algorithm on the system leads to performance loss because the threads are contend for available CPU cores and cause time-consuming thread context switching. A better approach is to execute threads sequentially on the available CPU cores, to avoid the unnecessary thread context switching. There is a simple technique called semaphore throttle which limits the number of active, contending threads.
It is interesting to estimate how efficient semaphore throttle is on real system. Or, in other words, how much is the performance loss caused by thread context switching.
As a test problem I have chosen the sum of arithmetic progression: S = 1 + 2 + 3 +… + Count * 64 . To find the sum I run 64 threads, each thread calculates a partial sum from 1 + I * Count to (I + 1) * Count, I = [0..63].
To obtain valid timings it is important that each thread is executed sufficiently long, so that the thread is preempted many times by other contending threads (when the scheduler’s time quantum ends). To increase the thread execution time I turned optimization off and chosen the Count value as much as possible for the resulting sum to fit into int64 range, but it appeared insufficient. Finally I decided to insert an additional loop into the thread function so that the thread execution time exceeded 1 second on my system:

unit SumThreads;

interface

uses
  Windows, Classes;

type
  TSumThread = class(TThread)
  private
    FSum: Int64;
    FBase: Integer;
    FCount: Integer;
    FSemaphore: THandle;
  protected
    procedure Execute; override;
  public
    constructor Create(ABase, ACount: Integer; ASemaphore: THandle);
    property Sum: Int64 read FSum;
  end;

implementation

{$O-}

constructor TSumThread.Create(ABase, ACount: Integer; ASemaphore: THandle);
begin
  FBase:= ABase;
  FCount:= ACount;
  FSemaphore:= ASemaphore;
  inherited Create(False);
end;

procedure TSumThread.Execute;
var
  Cnt, Value, J: Integer;
  S: Int64;

begin
  if FSemaphore <> 0 then
    WaitForSingleObject(FSemaphore, INFINITE);
  try
// to increase execution time the calculation is repeated
    J:= 20;
    repeat
      Value:= FBase;
      Cnt:= FCount;
      S:= 0;
      repeat
        S:= S + Value;
        Inc(Value);
        Dec(Cnt);
      until Cnt = 0;
      FSum:= S;
      Dec(J);
    until J = 0;
  finally
    if FSemaphore <> 0 then
      ReleaseSemaphore(FSemaphore, 1, nil);
  end;
end;

end.

To run the test I have written a simple console application that receives the number of concurrent threads as command line parameter, and outputs the number of concurrent threads and total execution time in seconds:

program throttle;

{$APPTYPE CONSOLE}

uses
  Windows,
  SysUtils,
  Classes,
  Diagnostics,
  SumThreads in 'SumThreads.pas';

const
  CHUNK = 30000000;

var
  Semaphore: THandle;
  Count, I: Integer;
  Sum: Int64;
  Threads: array[0..63] of TSumThread;
  Handles: array[0..63] of THandle;
  StopWatch: TStopWatch;

begin
  try
    if ParamCount <> 1 then
      raise Exception.Create('Number of concurrent threads not defined');
    Count:= StrToInt(ParamStr(1));
    if (Count < 0) or (Count > 64) then
      raise Exception.Create('Invalid number of concurrent threads');
    if Count <> 0 then begin
      Semaphore:= CreateSemaphore(nil, Count, Count, nil);
      Win32Check(Bool(Semaphore));
    end
    else
      Semaphore:= 0;
    try
      StopWatch:= TStopWatch.StartNew;
      for I:= 0 to 63 do begin
        Threads[I]:= TSumThread.Create(1 + I * CHUNK, CHUNK, Semaphore);
        Handles[I]:= Threads[I].Handle;
      end;
      if WaitForMultipleObjects(64, @Handles, True, INFINITE) = WAIT_FAILED
        then raise Exception.Create('WaitForMultipleObjects Failed');
      StopWatch.Stop;
      Sum:= 0;
      for I:= 0 to 63 do begin
        Sum:= Sum + Threads[I].Sum;
        Threads[I].Free;
      end;
    finally
      CloseHandle(Semaphore);
    end;
    Writeln(Count:5, ' -- ', StopWatch.Elapsed.TotalSeconds:3:2);
//    Writeln('Number of concurrent threads: ', Count);
//    Writeln('Time elapsed (seconds): ', StopWatch.Elapsed.TotalSeconds:3:2);
//    Writeln('Sum obtained: ', Sum);
//    Sum:= CHUNK * 64; // number of summands
//    Sum:= (Sum * (Sum + 1)) div 2;
//    Writeln('Sum expected: ', Sum);
  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
end.

I ran the tests on my notebook with 64-bit Windows 7 SP1 and CPU Intel Core i3 M380 using a simple batch file throttle.bat:

@echo off
time /t
throttle 1
throttle 2
throttle 3
throttle 4
throttle 5
throttle 6
throttle 8
throttle 16
throttle 32
throttle 64
throttle 0
time /t

The resulting log is

13:17
    1 -- 109.17
    2 -- 55.44
    3 -- 60.14
    4 -- 67.56
    5 -- 67.85
    6 -- 67.42
    8 -- 67.39
   16 -- 67.52
   32 -- 67.56
   64 -- 67.59
    0 -- 67.55
13:30

It can be clearly seen that my CPU has 2 physical cores (the number of logical cores for Core i3 M380 is 4); the fastest execution time is achieved by allowing two threads to be executed concurrently. Running 3 concurrent threads is slower, running 4 concurrent threads is even more slow. But there is no further performance degradation.
If you think of the results you can guess that if N concurrent threads are executed on M CPU cores the performance should degrade from N = M to N = 2M because of the increasing number of thread context switching (I have not tested it since I have no appropriate systems). For N > 2M the number of thread context switching should not grow anymore – now we have at least 2 concurrent threads per each core, each thread is preempted when every scheduler’s timeslice ends, that is the ‘worst case’ situation and it is reached at N = 2M.
A little arithmetic shows that properly tuned semaphore throttle can give about 15-20% performance gain on my notebook. Different systems can show different results.

——————-

You can download tests (both source code and executable) here

 

Sleep sort and TThread corner case

10

If you have not heard it yet – an anonymous genius from 4chan invented a sleep sort, brilliant esoteric sorting algorithm. I have written sleep sort implementation based on Delphi TThread class for rosettacode project, and started to experiment with the code. One of the working variants is:

program SleepSortDemo2;

{$APPTYPE CONSOLE}

uses
  SysUtils, Classes, SyncObjs;

type
  TSleepThread = class(TThread)
  private
    FValue: Integer;
    FLock: TCriticalSection;
  protected
    constructor Create(AValue: Integer; ALock: TCriticalSection);
    procedure Execute; override;
  end;

const
  ArrLen = 16;

var
  A: array[0..ArrLen - 1] of Integer;
  Threads: array[0..ArrLen - 1] of TThread;
  Lock: TCriticalSection;
  I: Integer;

constructor TSleepThread.Create(AValue: Integer; ALock: TCriticalSection);
begin
  FValue:= AValue;
  FLock:= ALock;
  inherited Create(False);
end;

procedure TSleepThread.Execute;
begin
  Sleep(1000 * FValue);
  FLock.Acquire;
  Write(FValue:3);
  FLock.Release;
end;

begin
  for I:= 0 to ArrLen - 1 do begin
    A[I]:= Random(15);
    Write(A[I]:3);
  end;
  Writeln;

  Lock:= TCriticalSection.Create;
  for I:= 0 to ArrLen - 1 do
    Threads[I]:= TSleepThread.Create(A[I], Lock);
  for I:= 0 to ArrLen - 1 do begin
    Threads[I].WaitFor;
    Threads[I].Free;
  end;
  Lock.Free;

  Writeln;
  Readln;
end.

Now, if you look at TThread source code you can see that TThread.WaitFor is called from TThread.Destroy, so it seems that there is no need for a separate  Threads[I].WaitFor call and the line #53 can be commented. Try it, and the code does not work anymore (at least it does not work on the system I used for testing – Windows 7 SP1, Celeron 530 CPU, 2 Gb RAM).

After pondering at the problem for some time I understood that TThread.Destroy just does not wait for the thread to terminate. But why? Look at the code snippet from TThread.Destroy:

  if (FThreadID <> 0) and not FFinished and not FExternalThread then
  begin
    Terminate;
    if FCreateSuspended then
      Resume;
    WaitFor;
  end;

All conditions are satisfied. You can set a breakpoint on WaitFor line and see that WaitFor method is actually called…

The answer was found in the ThreadProc function. Here is a code snippet from it:

  ..
  try
    if not Thread.Terminated then
    try
      Thread.Execute;
    except
      Thread.FFatalException := AcquireExceptionObject;
    end;
  finally
  ..

See what happens? TThread.Destroy calls Terminate and sets TThread.Terminated flag. But the thread has not started yet. Now the thread starts, ThreadProc function checks TThread.Terminated flag, and the Execute method is never called!

Synchronization in Delphi TThread class

4

VCL sources is a very interesting reading, if you have a time for it. Many Delphi programmers (like me) use VCL TThread class as a simple wrapper for WinAPI kernel thread object, avoiding to use Synchronize and Queue methods by using a custom message processing instead. Still it is worthwhile to understand the inner workings of VCL thread synchronization.

Disclaimer: I am currently using Delphi 2009, so some details of the next study may be different for the other Delphi (VCL) versions.

TThread class affords two ways for a background (“worker”) thread to execute a code in context of the main (GUI) thread. You can use either one of the Synchronize method overloads which force the worker thread into a wait state until the code is executed, or else you can use one of the Queue method overloads that just put the code into the main thread’s synchronization queue and allows a worker thread to continue. The definitions of the Synchronize and Queue methods in TThread class are:

type
  TThreadMethod = procedure of object;
  TThreadProcedure = reference to procedure;

procedure Synchronize(AMethod: TThreadMethod); overload;
procedure Synchronize(AThreadProc: TThreadProcedure); overload;
class procedure Synchronize(AThread: TThread; AMethod: TThreadMethod); overload;
class procedure Synchronize(AThread: TThread; AThreadProc: TThreadProcedure); overload;

procedure Queue(AMethod: TThreadMethod); overload;
procedure Queue(AThreadProc: TThreadProcedure); overload;
class procedure Queue(AThread: TThread; AMethod: TThreadMethod); overload;
class procedure Queue(AThread: TThread; AThreadProc: TThreadProcedure); overload;

We see that Synchronize and Queue methods can be used either with ordinary methods (TThreadMethod) or anonymous methods (TThreadProcedure).

Both Synchronize and Queue methods internally call a private Synchronize method overload, which is examined in detail next.

class procedure TThread.Synchronize(ASyncRec: PSynchronizeRecord; QueueEvent: Boolean = False); overload;

The first argument is a pointer to TSynchronizeRecord structure (see definition below) which in turn contains a pointer to a code to be executed; a code pointer can be either ordinary method pointer (TThreadMethod) or anonymous method’s reference (TThreadProcedure)

type
  PSynchronizeRecord = ^TSynchronizeRecord;
  TSynchronizeRecord = record
    FThread: TObject;
    FMethod: TThreadMethod;
    FProcedure: TThreadProcedure;
    FSynchronizeException: TObject;
  end;

  TSyncProc = record
    SyncRec: PSynchronizeRecord;
    Queued: Boolean;
    Signal: THandle;
  end;

The second argument (QueueEvent) is False for Synchronize overloads and True for Queue overloads.

Now back to the TThread.Synchronize code:

var
  SyncProc: TSyncProc;
  SyncProcPtr: PSyncProc;
begin
  if GetCurrentThreadID = MainThreadID then begin
    if Assigned(ASyncRec.FMethod) then
      ASyncRec.FMethod()
    else if Assigned(ASyncRec.FProcedure) then
      ASyncRec.FProcedure();
  end

The above is a kind of “fool protection”. If Synchronize (or Queue) is called from the main thread, the correspondent code is just executed. The rest of the method’s code is dealing with the worker’s thread calls:

  else begin
     if QueueEvent then
      New(SyncProcPtr)
    else
      SyncProcPtr := @SyncProc;

For Synchronize calls we use stack variable to store TSyncProc structure; we can’t use stack for asynchronous Queue calls, so we allocate a new TSyncProc variable on the heap.

    if not QueueEvent then
      SyncProcPtr.Signal := CreateEvent(nil, True, False, nil)
    else
      SyncProcPtr.Signal := 0;

For Synchronize calls we need a signaling event to wake up the worker thread after the main thread have finished executing the synchronized code; for the Queue calls we does not interfere into thread scheduling, so we need not a signaling event.

      try
        EnterCriticalSection(ThreadLock);
      try
        SyncProcPtr.Queued := QueueEvent;
        if SyncList = nil then
          SyncList := TList.Create;
        SyncProcPtr.SyncRec := ASyncRec;
        SyncList.Add(SyncProcPtr);
        SetEvent(SyncEvent);
        if Assigned(WakeMainThread) then
          WakeMainThread(SyncProcPtr.SyncRec.FThread);

We are preparing an information for the main thread about the code to execute. I will discuss how (and where) the main thread finds out that it has some worker code to execute a little later.

        if not QueueEvent then begin
          LeaveCriticalSection(ThreadLock);
          try
            WaitForSingleObject(SyncProcPtr.Signal, INFINITE);
          finally
            EnterCriticalSection(ThreadLock);
          end;
        end;

A very interesting code fragment (executed only for Synchronize calls). We release the ThreadLock critical section (to enable other threads to execute Synchronize or Queue calls), wait the main thread to execute the code and enter the ThreadLock again.
The above code fragment is exactly the job for a condition variable. Condition variable is a synchronization primitive first introduced in Windows Vista (in Windows part of the world; it probably always existed in Unix/Linux). Have VCL supported only Vista and above Windows versions, the above code fragment could be replaced like this:

        if not QueueEvent then begin
          SleepConditionVariableCS(CondVar, ThreadLock, INFINITE);
        end;

(well, not so simple; we should initialize the CondVar condition variable before we can use it. See msdn for details).

The use of the conditional variables makes the code more effective (thread enters a wait state and releases the specified critical section as an atomic operation) and more readable.

      finally
        LeaveCriticalSection(ThreadLock);
      end;
    finally
      if not QueueEvent then
        CloseHandle(SyncProcPtr.Signal);
    end;
    if not QueueEvent and Assigned(ASyncRec.FSynchronizeException) then
      raise ASyncRec.FSynchronizeException;
  end;
end;

That is all. Both Synchronize and Queue calls prepare a TSyncProc structure containing a pointer to the code to be executed by the main thread, insert the structure into the global SyncList list, and with Synchronize call a worker thread also enters a wait state until the main thread sets a signaling event in the TSyncProc structure.

Now about how the main thread finds out that the worker threads have prepared TSyncProc structures for him. The answer is WakeMainThread global variable defined in Classes.pas:

var
  WakeMainThread: TNotifyEvent = nil;

in the above Synchronize procedure we have

      if Assigned(WakeMainThread) then
          WakeMainThread(SyncProcPtr.SyncRec.FThread);

The WakeMainThread is assigned during the application initialization by the following code:

procedure TApplication.WakeMainThread(Sender: TObject);
begin
  PostMessage(Handle, WM_NULL, 0, 0);
end;

and WM_NULL message is processed in application window procedure as follows:

procedure TApplication.WndProc(var Message: TMessage);
[..]
    with Message do
      case Msg of
[..]

        WM_NULL:
          CheckSynchronize;
[..]

We see that every Synchronize or Queue method call posts WM_NULL message to the hidden application window; the WM_NULL message is processed by calling CheckSynchronize procedure. CheckSynchronize procedure scans the global SyncList list of the TSyncProc structures, executes the code pointed by them and, for Synchronize calls only, sets a signaling event in the TSyncProc structures after executing the code to wake up a worker thread.

What is the resume of the above analysis? Both Synchronize and Queue methods internally use window message processing (in the hidden application window), and in fact there is a little difference with custom message processing here. Due to design reasons a custom message processing is preferable when designing multithreaded components (with their own window procedures, created for example by Classes.AllocateHWnd function). Usually there is no practical reason to use custom message processing when designing multithreaded applications – Synchronize and Queue methods of the TThread class are just as good.

Understanding Threads: Atomic Operations and Memory Ordering

3

A common question about threads is:

I need a read/write access to a variable from two or more threads. Should I protect an access to a variable by critical section?

A simple answer is: yes, if you want to be on the safe side always protect the variable by a critical section. A deeper answer requires an understanding of the two concepts: atomic operations and memory ordering.

The “atomic” means that the result of the operation is always valid. Suppose we have two threads: thread A writes to a variable one of the two possible values (0 or 1); thread B reads the variable. If both writes and reads are atomic, then the result of the read operation in thread B can be only 0 or 1 – it is impossible to read a different value from the variable, even if write and read operations are performed at the same time.

That is what an official documentation on Intel processors says about guaranteed atomic operations:

Guaranteed Atomic Operations

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:

    Reading or writing a byte
    Reading or writing a word aligned on a 16-bit boundary
    Reading or writing a doubleword aligned on a 32-bit boundary

The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:

    Reading or writing a quadword aligned on a 64-bit boundary

Does the above mean that if an access to a variable is carried out atomically by processor there is no need to protect an access to a variable by critical section? A general answer is NO. Sometimes atomic operations are not enough and we also should take a memory ordering into account.

Consider the following situation:

thread A executes the following code:

  K:= 0;
  L:= 0;
// [..] - a lot of code skipped
  K:= 5;
  L:= 1;

thread B executes the following code:

if L = 1 then N:= N + K;

Read and write access to K and L variables is atomic. What is wrong with the above code?
The result of N:= N + K; is predictable only if the processor enforces strong memory ordering, i.e. thread B reads the changed K and L values in the same order as thread A writes them, so that if thread B sees that L has changed (if L = 1) it assumes that K has also changed (and should be read as 5).
Modern processors does not enforce strong memory ordering. You can read about memory ordering (and atomic operations) in official Intel manual (currently available here), but a good practice is not to rely on a particular kind of memory ordering in a particular processor. We should not assume that when thread B reads L as 1 it will next read K as 5 – it is also possible that it will read K as 0, so the result of

if L = 1 then N:= N + K;

is unpredictable. (Of course the result is always predictable if the above code is executed in thread A).

A possible solution in the above situation is to protect read and write access to L variable by a critical section – a critical section creates a kind of memory barrier and enforces local strong memory ordering. A critical section also makes an access to a variable atomic, so if you protect multithread-accessed variables by critical sections you are on the safe side.

Thread synchronization in Windows XP and Vista

0

An implementation of the thread locks have changed between Windows XP and Vista. To make a long story short, XP locks were fair; starting from Vista (and also Windows 2003 SP1) thread locks are unfair. A lock fairness means that the threads that tries to hold a lock (for example, to enter a critical section) forms a FIFO queue – the first thread that is entered the queue is the first thread to get hold the lock. The above statement does not mean though that Vista threads get hold of the lock in random order. I have tried to write a simple demo code to understand the difference in thread synchronization between XP and Vista and failed. The difference is more subtle and not easy to demonstrate.
To be exact, the locks were never 100% fair – reason is kernel mode APC (Asynchronous Procedure Calls) that can shuffle FIFO order both in XP and Vista. This consideration is closely related to well-known PulseEvent function issue (PulseEvent function is deprecated now). Still if there are no APC calls to interfere, the FIFO order is maintained.
With Windows XP we can think that the system never releases a lock if there is a thread waiting for it – the system just passes a lock from the thread that released it to the thread already waiting for it.
The situation is different with Vista.
Imagine the following scenario: a thread A entered a critical section; a thread B is entering the same critical section (waiting for the thread A to leave it); a thread C is executing and is about to enter the critical section. With Windows XP the thread C have no chance to sneak before the thread B and get hold of the lock first. Starting from Vista the thread C have a chance to do it.
The reason to change the lock’s internals is to make the system more multiprocessor (or multicore) friendly. Consider the threads A, B and C are executed on separate processors. If the thread C get hold of the lock just released by thread A (before the thread B), it avoids entering a wait state and also avoids the overhead related (context switching). The thread B is already in wait state, now it will have to wait the thread C to release the lock, but since the thread C has avoided the wait state completely the total system performance is increased. We can think that locks are used more effectively now – the situation when some thread holds a critical section but not executes a code protected by the critical section (performing a context switching) is less probable.