Suppose we have an algorithm that uses N parallel threads, and we have a system with M CPU cores, N > M. Running the algorithm on the system leads to performance loss because the threads are contend for available CPU cores and cause time-consuming thread context switching. A better approach is to execute threads sequentially on the available CPU cores, to avoid the unnecessary thread context switching. There is a simple technique called semaphore throttle which limits the number of active, contending threads.
It is interesting to estimate how efficient semaphore throttle is on real system. Or, in other words, how much is the performance loss caused by thread context switching.
As a test problem I have chosen the sum of arithmetic progression: S = 1 + 2 + 3 +… + Count * 64 . To find the sum I run 64 threads, each thread calculates a partial sum from 1 + I * Count to (I + 1) * Count, I = [0..63].
To obtain valid timings it is important that each thread is executed sufficiently long, so that the thread is preempted many times by other contending threads (when the scheduler’s time quantum ends). To increase the thread execution time I turned optimization off and chosen the Count value as much as possible for the resulting sum to fit into int64 range, but it appeared insufficient. Finally I decided to insert an additional loop into the thread function so that the thread execution time exceeded 1 second on my system:
unit SumThreads;
interface
uses
Windows, Classes;
type
TSumThread = class(TThread)
private
FSum: Int64;
FBase: Integer;
FCount: Integer;
FSemaphore: THandle;
protected
procedure Execute; override;
public
constructor Create(ABase, ACount: Integer; ASemaphore: THandle);
property Sum: Int64 read FSum;
end;
implementation
{$O-}
constructor TSumThread.Create(ABase, ACount: Integer; ASemaphore: THandle);
begin
FBase:= ABase;
FCount:= ACount;
FSemaphore:= ASemaphore;
inherited Create(False);
end;
procedure TSumThread.Execute;
var
Cnt, Value, J: Integer;
S: Int64;
begin
if FSemaphore <> 0 then
WaitForSingleObject(FSemaphore, INFINITE);
try
// to increase execution time the calculation is repeated
J:= 20;
repeat
Value:= FBase;
Cnt:= FCount;
S:= 0;
repeat
S:= S + Value;
Inc(Value);
Dec(Cnt);
until Cnt = 0;
FSum:= S;
Dec(J);
until J = 0;
finally
if FSemaphore <> 0 then
ReleaseSemaphore(FSemaphore, 1, nil);
end;
end;
end.
To run the test I have written a simple console application that receives the number of concurrent threads as command line parameter, and outputs the number of concurrent threads and total execution time in seconds:
program throttle;
{$APPTYPE CONSOLE}
uses
Windows,
SysUtils,
Classes,
Diagnostics,
SumThreads in 'SumThreads.pas';
const
CHUNK = 30000000;
var
Semaphore: THandle;
Count, I: Integer;
Sum: Int64;
Threads: array[0..63] of TSumThread;
Handles: array[0..63] of THandle;
StopWatch: TStopWatch;
begin
try
if ParamCount <> 1 then
raise Exception.Create('Number of concurrent threads not defined');
Count:= StrToInt(ParamStr(1));
if (Count < 0) or (Count > 64) then
raise Exception.Create('Invalid number of concurrent threads');
if Count <> 0 then begin
Semaphore:= CreateSemaphore(nil, Count, Count, nil);
Win32Check(Bool(Semaphore));
end
else
Semaphore:= 0;
try
StopWatch:= TStopWatch.StartNew;
for I:= 0 to 63 do begin
Threads[I]:= TSumThread.Create(1 + I * CHUNK, CHUNK, Semaphore);
Handles[I]:= Threads[I].Handle;
end;
if WaitForMultipleObjects(64, @Handles, True, INFINITE) = WAIT_FAILED
then raise Exception.Create('WaitForMultipleObjects Failed');
StopWatch.Stop;
Sum:= 0;
for I:= 0 to 63 do begin
Sum:= Sum + Threads[I].Sum;
Threads[I].Free;
end;
finally
CloseHandle(Semaphore);
end;
Writeln(Count:5, ' -- ', StopWatch.Elapsed.TotalSeconds:3:2);
// Writeln('Number of concurrent threads: ', Count);
// Writeln('Time elapsed (seconds): ', StopWatch.Elapsed.TotalSeconds:3:2);
// Writeln('Sum obtained: ', Sum);
// Sum:= CHUNK * 64; // number of summands
// Sum:= (Sum * (Sum + 1)) div 2;
// Writeln('Sum expected: ', Sum);
except
on E: Exception do
Writeln(E.ClassName, ': ', E.Message);
end;
end.
I ran the tests on my notebook with 64-bit Windows 7 SP1 and CPU Intel Core i3 M380 using a simple batch file throttle.bat:
@echo off time /t throttle 1 throttle 2 throttle 3 throttle 4 throttle 5 throttle 6 throttle 8 throttle 16 throttle 32 throttle 64 throttle 0 time /t
The resulting log is
13:17
1 -- 109.17
2 -- 55.44
3 -- 60.14
4 -- 67.56
5 -- 67.85
6 -- 67.42
8 -- 67.39
16 -- 67.52
32 -- 67.56
64 -- 67.59
0 -- 67.55
13:30
It can be clearly seen that my CPU has 2 physical cores (the number of logical cores for Core i3 M380 is 4); the fastest execution time is achieved by allowing two threads to be executed concurrently. Running 3 concurrent threads is slower, running 4 concurrent threads is even more slow. But there is no further performance degradation.
If you think of the results you can guess that if N concurrent threads are executed on M CPU cores the performance should degrade from N = M to N = 2M because of the increasing number of thread context switching (I have not tested it since I have no appropriate systems). For N > 2M the number of thread context switching should not grow anymore – now we have at least 2 concurrent threads per each core, each thread is preempted when every scheduler’s timeslice ends, that is the ‘worst case’ situation and it is reached at N = 2M.
A little arithmetic shows that properly tuned semaphore throttle can give about 15-20% performance gain on my notebook. Different systems can show different results.
——————-
You can download tests (both source code and executable) here