Semaphore throttle

Suppose we have an algorithm that uses N parallel threads, and we have a system with M CPU cores, N > M. Running the algorithm on the system leads to performance loss because the threads are contend for available CPU cores and cause time-consuming thread context switching. A better approach is to execute threads sequentially on the available CPU cores, to avoid the unnecessary thread context switching. There is a simple technique called semaphore throttle which limits the number of active, contending threads.
It is interesting to estimate how efficient semaphore throttle is on real system. Or, in other words, how much is the performance loss caused by thread context switching.
As a test problem I have chosen the sum of arithmetic progression: S = 1 + 2 + 3 +… + Count * 64 . To find the sum I run 64 threads, each thread calculates a partial sum from 1 + I * Count to (I + 1) * Count, I = [0..63].
To obtain valid timings it is important that each thread is executed sufficiently long, so that the thread is preempted many times by other contending threads (when the scheduler’s time quantum ends). To increase the thread execution time I turned optimization off and chosen the Count value as much as possible for the resulting sum to fit into int64 range, but it appeared insufficient. Finally I decided to insert an additional loop into the thread function so that the thread execution time exceeded 1 second on my system:

unit SumThreads;

interface

uses
  Windows, Classes;

type
  TSumThread = class(TThread)
  private
    FSum: Int64;
    FBase: Integer;
    FCount: Integer;
    FSemaphore: THandle;
  protected
    procedure Execute; override;
  public
    constructor Create(ABase, ACount: Integer; ASemaphore: THandle);
    property Sum: Int64 read FSum;
  end;

implementation

{$O-}

constructor TSumThread.Create(ABase, ACount: Integer; ASemaphore: THandle);
begin
  FBase:= ABase;
  FCount:= ACount;
  FSemaphore:= ASemaphore;
  inherited Create(False);
end;

procedure TSumThread.Execute;
var
  Cnt, Value, J: Integer;
  S: Int64;

begin
  if FSemaphore <> 0 then
    WaitForSingleObject(FSemaphore, INFINITE);
  try
// to increase execution time the calculation is repeated
    J:= 20;
    repeat
      Value:= FBase;
      Cnt:= FCount;
      S:= 0;
      repeat
        S:= S + Value;
        Inc(Value);
        Dec(Cnt);
      until Cnt = 0;
      FSum:= S;
      Dec(J);
    until J = 0;
  finally
    if FSemaphore <> 0 then
      ReleaseSemaphore(FSemaphore, 1, nil);
  end;
end;

end.

To run the test I have written a simple console application that receives the number of concurrent threads as command line parameter, and outputs the number of concurrent threads and total execution time in seconds:

program throttle;

{$APPTYPE CONSOLE}

uses
  Windows,
  SysUtils,
  Classes,
  Diagnostics,
  SumThreads in 'SumThreads.pas';

const
  CHUNK = 30000000;

var
  Semaphore: THandle;
  Count, I: Integer;
  Sum: Int64;
  Threads: array[0..63] of TSumThread;
  Handles: array[0..63] of THandle;
  StopWatch: TStopWatch;

begin
  try
    if ParamCount <> 1 then
      raise Exception.Create('Number of concurrent threads not defined');
    Count:= StrToInt(ParamStr(1));
    if (Count < 0) or (Count > 64) then
      raise Exception.Create('Invalid number of concurrent threads');
    if Count <> 0 then begin
      Semaphore:= CreateSemaphore(nil, Count, Count, nil);
      Win32Check(Bool(Semaphore));
    end
    else
      Semaphore:= 0;
    try
      StopWatch:= TStopWatch.StartNew;
      for I:= 0 to 63 do begin
        Threads[I]:= TSumThread.Create(1 + I * CHUNK, CHUNK, Semaphore);
        Handles[I]:= Threads[I].Handle;
      end;
      if WaitForMultipleObjects(64, @Handles, True, INFINITE) = WAIT_FAILED
        then raise Exception.Create('WaitForMultipleObjects Failed');
      StopWatch.Stop;
      Sum:= 0;
      for I:= 0 to 63 do begin
        Sum:= Sum + Threads[I].Sum;
        Threads[I].Free;
      end;
    finally
      CloseHandle(Semaphore);
    end;
    Writeln(Count:5, ' -- ', StopWatch.Elapsed.TotalSeconds:3:2);
//    Writeln('Number of concurrent threads: ', Count);
//    Writeln('Time elapsed (seconds): ', StopWatch.Elapsed.TotalSeconds:3:2);
//    Writeln('Sum obtained: ', Sum);
//    Sum:= CHUNK * 64; // number of summands
//    Sum:= (Sum * (Sum + 1)) div 2;
//    Writeln('Sum expected: ', Sum);
  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
end.

I ran the tests on my notebook with 64-bit Windows 7 SP1 and CPU Intel Core i3 M380 using a simple batch file throttle.bat:

@echo off
time /t
throttle 1
throttle 2
throttle 3
throttle 4
throttle 5
throttle 6
throttle 8
throttle 16
throttle 32
throttle 64
throttle 0
time /t

The resulting log is

13:17
    1 -- 109.17
    2 -- 55.44
    3 -- 60.14
    4 -- 67.56
    5 -- 67.85
    6 -- 67.42
    8 -- 67.39
   16 -- 67.52
   32 -- 67.56
   64 -- 67.59
    0 -- 67.55
13:30

It can be clearly seen that my CPU has 2 physical cores (the number of logical cores for Core i3 M380 is 4); the fastest execution time is achieved by allowing two threads to be executed concurrently. Running 3 concurrent threads is slower, running 4 concurrent threads is even more slow. But there is no further performance degradation.
If you think of the results you can guess that if N concurrent threads are executed on M CPU cores the performance should degrade from N = M to N = 2M because of the increasing number of thread context switching (I have not tested it since I have no appropriate systems). For N > 2M the number of thread context switching should not grow anymore – now we have at least 2 concurrent threads per each core, each thread is preempted when every scheduler’s timeslice ends, that is the ‘worst case’ situation and it is reached at N = 2M.
A little arithmetic shows that properly tuned semaphore throttle can give about 15-20% performance gain on my notebook. Different systems can show different results.

——————-

You can download tests (both source code and executable) here

 

Advertisements

15 thoughts on “Semaphore throttle

  1. More than context switching itself, in more complex situations the performance degradation from having too many threads comes from the CPU memory cache.

    Switching is quite efficient on modern CPUs, but memory caches haven’t gone up anywhere near as much as main memory in size, and are still measured in kilobytes, and each thread typically has its own stack and its own heap-based objects, which all add up.

      • TNX.

        So the result of your exercise is basically that you should start as much threads as you have cores. Of course, then you can remove the semaphore part as it is no longer needed 🙂

      • @gabr – Not so obvious. The idea of semaphore throttle is to limit the number of concurrent threads without modifying an algorithm. Of course if you can limit the number of threads in an algorithm itself you need not a semaphore throttle.

      • Yes, I get that. It’s just that I can’t think of an algorithm, suited for N threads, which couldn’t be reduced down for any value of N.

  2. I gave it a run on my system (Core i7 920) with four physical cores:

    16:52
    1 — 96.79
    2 — 48.43
    3 — 34.32
    4 — 25.88
    5 — 27.28
    6 — 28.41
    8 — 30.72
    16 — 30.92
    32 — 30.68
    64 — 30.53
    0 — 30.60
    16:59

    So the numbers seem to follow your prediction.

    Using four cores not only has the best run time, but also keeps the system responsible.

  3. 2x Xeon E5430, each with 2 hyperthreaded cores, Windows 7.

    1 — 137.17
    2 — 68.62
    3 — 47.20
    4 — 34.31
    5 — 27.76
    6 — 23.53
    8 — 18.04
    16 — 18.02
    32 — 17.69
    64 — 17.35
    0 — 17.34

    As you can see, the processor is relatively slow (it is a 3 year old machine, if I recall correctly), but the thread switching is blindingly fast.

  4. i7-2630QM, 4 HT cores; Windows 7 64-bit (my previous testing on E5430 was also on 64-bit)

    1 — 112.15
    2 — 57.78
    3 — 40.45
    4 — 29.83
    5 — 29.67
    6 — 29.61
    8 — 29.58
    16 — 29.88
    32 — 29.90
    64 — 29.69
    0 — 29.79

    No speedup after 4 threads but also no slowdown.

    • So some systems shows a noticeable (10-20%) slowdown if number of threads exceeds the number of physical cores, some don’t. I can’t imagine why.

  5. Interestingly, I came about threads (in an internet forum sense) on GameDev, that advised to run about 2x as many threads as you have cores. I did a rather calculus-heavy bachelor thesis, which proved to be a nice playground for that, albeit being written in C#. My results were closer to the prediction from there, but I cannot back it up with numbers, since that stuff has long went down the drain of a series of HD failures.
    In general, my results showed a first minor peak at N=M, then worse, and the best performance at 2*N=M (roughly 10% more than (N=M)), and a tiny decrease from there, being almost constant up to values where it becomes obvious, that this many threads will kill execution time on their own, even if bare of workload.
    This I ran on a Core2Duo with a peak at 4 Threads, as well as on an i7 950 with 8 Threads peaking. Yeah, HT kicking at full benefit, and it was a huge load of single precision float work (Catmull-Rom-Splines and their intersections and stuff on .NET mostly).

    My conclusion from that was, if you are number crunching, N=M or 2*N=M is the way to go. And since the difference wasn’t that huge, N=M seems to be a fair shot indeed, as one would expect. (With M being the number of logical cores.)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s