Semaphore throttle

Suppose we have an algorithm that uses N parallel threads, and we have a system with M CPU cores, N > M. Running the algorithm on the system leads to performance loss because the threads are contend for available CPU cores and cause time-consuming thread context switching. A better approach is to execute threads sequentially on the available CPU cores, to avoid the unnecessary thread context switching. There is a simple technique called semaphore throttle which limits the number of active, contending threads.
It is interesting to estimate how efficient semaphore throttle is on real system. Or, in other words, how much is the performance loss caused by thread context switching.
As a test problem I have chosen the sum of arithmetic progression: S = 1 + 2 + 3 +… + Count * 64 . To find the sum I run 64 threads, each thread calculates a partial sum from 1 + I * Count to (I + 1) * Count, I = [0..63].
To obtain valid timings it is important that each thread is executed sufficiently long, so that the thread is preempted many times by other contending threads (when the scheduler’s time quantum ends). To increase the thread execution time I turned optimization off and chosen the Count value as much as possible for the resulting sum to fit into int64 range, but it appeared insufficient. Finally I decided to insert an additional loop into the thread function so that the thread execution time exceeded 1 second on my system:

unit SumThreads;


  Windows, Classes;

  TSumThread = class(TThread)
    FSum: Int64;
    FBase: Integer;
    FCount: Integer;
    FSemaphore: THandle;
    procedure Execute; override;
    constructor Create(ABase, ACount: Integer; ASemaphore: THandle);
    property Sum: Int64 read FSum;



constructor TSumThread.Create(ABase, ACount: Integer; ASemaphore: THandle);
  FBase:= ABase;
  FCount:= ACount;
  FSemaphore:= ASemaphore;
  inherited Create(False);

procedure TSumThread.Execute;
  Cnt, Value, J: Integer;
  S: Int64;

  if FSemaphore <> 0 then
    WaitForSingleObject(FSemaphore, INFINITE);
// to increase execution time the calculation is repeated
    J:= 20;
      Value:= FBase;
      Cnt:= FCount;
      S:= 0;
        S:= S + Value;
      until Cnt = 0;
      FSum:= S;
    until J = 0;
    if FSemaphore <> 0 then
      ReleaseSemaphore(FSemaphore, 1, nil);


To run the test I have written a simple console application that receives the number of concurrent threads as command line parameter, and outputs the number of concurrent threads and total execution time in seconds:

program throttle;


  SumThreads in 'SumThreads.pas';

  CHUNK = 30000000;

  Semaphore: THandle;
  Count, I: Integer;
  Sum: Int64;
  Threads: array[0..63] of TSumThread;
  Handles: array[0..63] of THandle;
  StopWatch: TStopWatch;

    if ParamCount <> 1 then
      raise Exception.Create('Number of concurrent threads not defined');
    Count:= StrToInt(ParamStr(1));
    if (Count < 0) or (Count > 64) then
      raise Exception.Create('Invalid number of concurrent threads');
    if Count <> 0 then begin
      Semaphore:= CreateSemaphore(nil, Count, Count, nil);
      Semaphore:= 0;
      StopWatch:= TStopWatch.StartNew;
      for I:= 0 to 63 do begin
        Threads[I]:= TSumThread.Create(1 + I * CHUNK, CHUNK, Semaphore);
        Handles[I]:= Threads[I].Handle;
      if WaitForMultipleObjects(64, @Handles, True, INFINITE) = WAIT_FAILED
        then raise Exception.Create('WaitForMultipleObjects Failed');
      Sum:= 0;
      for I:= 0 to 63 do begin
        Sum:= Sum + Threads[I].Sum;
    Writeln(Count:5, ' -- ', StopWatch.Elapsed.TotalSeconds:3:2);
//    Writeln('Number of concurrent threads: ', Count);
//    Writeln('Time elapsed (seconds): ', StopWatch.Elapsed.TotalSeconds:3:2);
//    Writeln('Sum obtained: ', Sum);
//    Sum:= CHUNK * 64; // number of summands
//    Sum:= (Sum * (Sum + 1)) div 2;
//    Writeln('Sum expected: ', Sum);
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);

I ran the tests on my notebook with 64-bit Windows 7 SP1 and CPU Intel Core i3 M380 using a simple batch file throttle.bat:

@echo off
time /t
throttle 1
throttle 2
throttle 3
throttle 4
throttle 5
throttle 6
throttle 8
throttle 16
throttle 32
throttle 64
throttle 0
time /t

The resulting log is

    1 -- 109.17
    2 -- 55.44
    3 -- 60.14
    4 -- 67.56
    5 -- 67.85
    6 -- 67.42
    8 -- 67.39
   16 -- 67.52
   32 -- 67.56
   64 -- 67.59
    0 -- 67.55

It can be clearly seen that my CPU has 2 physical cores (the number of logical cores for Core i3 M380 is 4); the fastest execution time is achieved by allowing two threads to be executed concurrently. Running 3 concurrent threads is slower, running 4 concurrent threads is even more slow. But there is no further performance degradation.
If you think of the results you can guess that if N concurrent threads are executed on M CPU cores the performance should degrade from N = M to N = 2M because of the increasing number of thread context switching (I have not tested it since I have no appropriate systems). For N > 2M the number of thread context switching should not grow anymore – now we have at least 2 concurrent threads per each core, each thread is preempted when every scheduler’s timeslice ends, that is the ‘worst case’ situation and it is reached at N = 2M.
A little arithmetic shows that properly tuned semaphore throttle can give about 15-20% performance gain on my notebook. Different systems can show different results.


You can download tests (both source code and executable) here



15 thoughts on “Semaphore throttle

  1. More than context switching itself, in more complex situations the performance degradation from having too many threads comes from the CPU memory cache.

    Switching is quite efficient on modern CPUs, but memory caches haven’t gone up anywhere near as much as main memory in size, and are still measured in kilobytes, and each thread typically has its own stack and its own heap-based objects, which all add up.

      • TNX.

        So the result of your exercise is basically that you should start as much threads as you have cores. Of course, then you can remove the semaphore part as it is no longer needed 🙂

      • @gabr – Not so obvious. The idea of semaphore throttle is to limit the number of concurrent threads without modifying an algorithm. Of course if you can limit the number of threads in an algorithm itself you need not a semaphore throttle.

      • Yes, I get that. It’s just that I can’t think of an algorithm, suited for N threads, which couldn’t be reduced down for any value of N.

  2. I gave it a run on my system (Core i7 920) with four physical cores:

    1 — 96.79
    2 — 48.43
    3 — 34.32
    4 — 25.88
    5 — 27.28
    6 — 28.41
    8 — 30.72
    16 — 30.92
    32 — 30.68
    64 — 30.53
    0 — 30.60

    So the numbers seem to follow your prediction.

    Using four cores not only has the best run time, but also keeps the system responsible.

  3. 2x Xeon E5430, each with 2 hyperthreaded cores, Windows 7.

    1 — 137.17
    2 — 68.62
    3 — 47.20
    4 — 34.31
    5 — 27.76
    6 — 23.53
    8 — 18.04
    16 — 18.02
    32 — 17.69
    64 — 17.35
    0 — 17.34

    As you can see, the processor is relatively slow (it is a 3 year old machine, if I recall correctly), but the thread switching is blindingly fast.

  4. i7-2630QM, 4 HT cores; Windows 7 64-bit (my previous testing on E5430 was also on 64-bit)

    1 — 112.15
    2 — 57.78
    3 — 40.45
    4 — 29.83
    5 — 29.67
    6 — 29.61
    8 — 29.58
    16 — 29.88
    32 — 29.90
    64 — 29.69
    0 — 29.79

    No speedup after 4 threads but also no slowdown.

    • So some systems shows a noticeable (10-20%) slowdown if number of threads exceeds the number of physical cores, some don’t. I can’t imagine why.

  5. Interestingly, I came about threads (in an internet forum sense) on GameDev, that advised to run about 2x as many threads as you have cores. I did a rather calculus-heavy bachelor thesis, which proved to be a nice playground for that, albeit being written in C#. My results were closer to the prediction from there, but I cannot back it up with numbers, since that stuff has long went down the drain of a series of HD failures.
    In general, my results showed a first minor peak at N=M, then worse, and the best performance at 2*N=M (roughly 10% more than (N=M)), and a tiny decrease from there, being almost constant up to values where it becomes obvious, that this many threads will kill execution time on their own, even if bare of workload.
    This I ran on a Core2Duo with a peak at 4 Threads, as well as on an i7 950 with 8 Threads peaking. Yeah, HT kicking at full benefit, and it was a huge load of single precision float work (Catmull-Rom-Splines and their intersections and stuff on .NET mostly).

    My conclusion from that was, if you are number crunching, N=M or 2*N=M is the way to go. And since the difference wasn’t that huge, N=M seems to be a fair shot indeed, as one would expect. (With M being the number of logical cores.)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s