Semaphore throttle

15

Suppose we have an algorithm that uses N parallel threads, and we have a system with M CPU cores, N > M. Running the algorithm on the system leads to performance loss because the threads are contend for available CPU cores and cause time-consuming thread context switching. A better approach is to execute threads sequentially on the available CPU cores, to avoid the unnecessary thread context switching. There is a simple technique called semaphore throttle which limits the number of active, contending threads.
It is interesting to estimate how efficient semaphore throttle is on real system. Or, in other words, how much is the performance loss caused by thread context switching.
As a test problem I have chosen the sum of arithmetic progression: S = 1 + 2 + 3 +… + Count * 64 . To find the sum I run 64 threads, each thread calculates a partial sum from 1 + I * Count to (I + 1) * Count, I = [0..63].
To obtain valid timings it is important that each thread is executed sufficiently long, so that the thread is preempted many times by other contending threads (when the scheduler’s time quantum ends). To increase the thread execution time I turned optimization off and chosen the Count value as much as possible for the resulting sum to fit into int64 range, but it appeared insufficient. Finally I decided to insert an additional loop into the thread function so that the thread execution time exceeded 1 second on my system:

unit SumThreads;

interface

uses
  Windows, Classes;

type
  TSumThread = class(TThread)
  private
    FSum: Int64;
    FBase: Integer;
    FCount: Integer;
    FSemaphore: THandle;
  protected
    procedure Execute; override;
  public
    constructor Create(ABase, ACount: Integer; ASemaphore: THandle);
    property Sum: Int64 read FSum;
  end;

implementation

{$O-}

constructor TSumThread.Create(ABase, ACount: Integer; ASemaphore: THandle);
begin
  FBase:= ABase;
  FCount:= ACount;
  FSemaphore:= ASemaphore;
  inherited Create(False);
end;

procedure TSumThread.Execute;
var
  Cnt, Value, J: Integer;
  S: Int64;

begin
  if FSemaphore <> 0 then
    WaitForSingleObject(FSemaphore, INFINITE);
  try
// to increase execution time the calculation is repeated
    J:= 20;
    repeat
      Value:= FBase;
      Cnt:= FCount;
      S:= 0;
      repeat
        S:= S + Value;
        Inc(Value);
        Dec(Cnt);
      until Cnt = 0;
      FSum:= S;
      Dec(J);
    until J = 0;
  finally
    if FSemaphore <> 0 then
      ReleaseSemaphore(FSemaphore, 1, nil);
  end;
end;

end.

To run the test I have written a simple console application that receives the number of concurrent threads as command line parameter, and outputs the number of concurrent threads and total execution time in seconds:

program throttle;

{$APPTYPE CONSOLE}

uses
  Windows,
  SysUtils,
  Classes,
  Diagnostics,
  SumThreads in 'SumThreads.pas';

const
  CHUNK = 30000000;

var
  Semaphore: THandle;
  Count, I: Integer;
  Sum: Int64;
  Threads: array[0..63] of TSumThread;
  Handles: array[0..63] of THandle;
  StopWatch: TStopWatch;

begin
  try
    if ParamCount <> 1 then
      raise Exception.Create('Number of concurrent threads not defined');
    Count:= StrToInt(ParamStr(1));
    if (Count < 0) or (Count > 64) then
      raise Exception.Create('Invalid number of concurrent threads');
    if Count <> 0 then begin
      Semaphore:= CreateSemaphore(nil, Count, Count, nil);
      Win32Check(Bool(Semaphore));
    end
    else
      Semaphore:= 0;
    try
      StopWatch:= TStopWatch.StartNew;
      for I:= 0 to 63 do begin
        Threads[I]:= TSumThread.Create(1 + I * CHUNK, CHUNK, Semaphore);
        Handles[I]:= Threads[I].Handle;
      end;
      if WaitForMultipleObjects(64, @Handles, True, INFINITE) = WAIT_FAILED
        then raise Exception.Create('WaitForMultipleObjects Failed');
      StopWatch.Stop;
      Sum:= 0;
      for I:= 0 to 63 do begin
        Sum:= Sum + Threads[I].Sum;
        Threads[I].Free;
      end;
    finally
      CloseHandle(Semaphore);
    end;
    Writeln(Count:5, ' -- ', StopWatch.Elapsed.TotalSeconds:3:2);
//    Writeln('Number of concurrent threads: ', Count);
//    Writeln('Time elapsed (seconds): ', StopWatch.Elapsed.TotalSeconds:3:2);
//    Writeln('Sum obtained: ', Sum);
//    Sum:= CHUNK * 64; // number of summands
//    Sum:= (Sum * (Sum + 1)) div 2;
//    Writeln('Sum expected: ', Sum);
  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
end.

I ran the tests on my notebook with 64-bit Windows 7 SP1 and CPU Intel Core i3 M380 using a simple batch file throttle.bat:

@echo off
time /t
throttle 1
throttle 2
throttle 3
throttle 4
throttle 5
throttle 6
throttle 8
throttle 16
throttle 32
throttle 64
throttle 0
time /t

The resulting log is

13:17
    1 -- 109.17
    2 -- 55.44
    3 -- 60.14
    4 -- 67.56
    5 -- 67.85
    6 -- 67.42
    8 -- 67.39
   16 -- 67.52
   32 -- 67.56
   64 -- 67.59
    0 -- 67.55
13:30

It can be clearly seen that my CPU has 2 physical cores (the number of logical cores for Core i3 M380 is 4); the fastest execution time is achieved by allowing two threads to be executed concurrently. Running 3 concurrent threads is slower, running 4 concurrent threads is even more slow. But there is no further performance degradation.
If you think of the results you can guess that if N concurrent threads are executed on M CPU cores the performance should degrade from N = M to N = 2M because of the increasing number of thread context switching (I have not tested it since I have no appropriate systems). For N > 2M the number of thread context switching should not grow anymore – now we have at least 2 concurrent threads per each core, each thread is preempted when every scheduler’s timeslice ends, that is the ‘worst case’ situation and it is reached at N = 2M.
A little arithmetic shows that properly tuned semaphore throttle can give about 15-20% performance gain on my notebook. Different systems can show different results.

——————-

You can download tests (both source code and executable) here