Bitwise operations on big integers

0

Standard fixed-sized negative integers are stored in two’s complement format; for arbitrary-precision big integers the two’s complement format means infinite size, so internally it is not used. Still bitwise operation on big integers are implemented as if negative big integer values are stored in two’s complement format.

As a result bitwise boolean operations (and, or, xor) applied to big integers produce the same results as with standard fixed-sized integers:

procedure Test1(I, J: Integer);
var
  BigI, BigJ: BigInteger;

begin
  BigI:= I;
  BigJ:= J;
  Assert(BigI and BigJ = I and J);
  Assert(BigI or BigJ = I or J);
  Assert(BigI xor BigJ = I xor J);
end;

There is a difference between standard Delphi integer types and big integers in shift operations. Shift operations on big integers preserve sign. That means any shift applied to negative big integer results in negative big integer (the same is for non-negative values):

procedure Test2(I: BigInteger; N: Cardinal);
begin
  Assert(((I shl N) < 0) = (I < 0));
  Assert(((I shr N) < 0) = (I < 0));
end;

That is a natural consequence of the infinite-sized two’s complement negative values. Shift operations on big integers are arithmetic shifts rather than logical shifts. On the other hand ‘shl’ and ‘shr’ operations on the standard Delphi integer types are implemented as logical shifts and does not preserve sign:

procedure Test3;
var
  I: Integer;

begin
  I:= -1;
  Writeln(I shr 2);   // 1073741823, because 'shr' is logical shift
  I:= 1000000000;
  Writeln(I shl 2);   // -294967296, because of 32-bit overflow
end;

PS: right now TForge does not support bitwise operations on BigInteger type, they will be implemented soon.

‘var’ vs ‘out’ as a function argument

4

The Delphi Documentation is known to be a confusing and untrustful source of information. I am not quoting the current documentation content here and hope it will be fixed – right now it is a nonsense.

You can replace ‘var’ by ‘out’ and get binary identical code if only POD types are used.

Still the ‘out’ keyword is not an alias for ‘var’. The difference appears when lifetime managed types (strings, dynamic arrays, interfaces) are involved:

program Test1;

{$APPTYPE CONSOLE}

uses
  SysUtils;

type
  TIntArray = array of Integer;

var
  S: string;
  II: IInterface;
  A: TIntArray;
  Obj: TObject;

procedure TestString(out AStr: string);
begin
  Assert(AStr = '');
  Writeln('Passed');
end;

procedure TestInterface(out AII: IInterface);
begin
  Assert(AII = nil);
  Writeln('Passed');
end;

procedure TestDynArray(out AArr: TIntArray);
begin
  Assert(AArr = nil);
  Writeln('Passed');
end;

procedure TestObject(out AObj: TObject);
begin
  Assert(AObj = nil);  // Failed
  Writeln('Passed');
end;

begin
  try
    try
      S:= '123';
      TestString(S);
      II:= TInterfacedObject.Create;
      TestInterface(II);
      A:= TIntArray.Create(42);
      TestDynArray(A);
      Obj:= TObject.Create;
      TestObject(Obj);
    except
      on E: Exception do
        Writeln(E.ClassName, ': ', E.Message);
    end;
  finally
    Obj.Free;
    ReadLn;
  end;
end.

The above code shows that when you use ‘out’ instead of ‘var’ in a function declaration and pass a reference counted instance the compiler makes the input value of an instance unavailable in a function. The ‘out’ keyword guaranties that the input value of a reference counted instance will never be changed. That is the only difference between ‘var’ and ‘out’ parameters I could find. The 4th test (procedure TestObject) shows that ‘out’ does not protect input value of an object (non-refcounted instance).
Sometimes use of ‘out’ keyword leads to unexpected results (the next example is taken from Alex Ciobanu blog):

procedure CopyStr(const AStr: String; out AOutStr: String);
begin
  AOutStr := AStr;
end;

var
  SomeStr: String;

begin
  SomeStr := 'Hello World';
  CopyStr(SomeStr, SomeStr);
  WriteLn(SomeStr);
  ReadLn;
end.

Big Integer musings

2

When writing a multiple precision integer arithmetic library you should have in mind two aims – usability and performance, and to reach both is not a trivial task. A modern multiple precision integer implementation in Delphi should be as easy to use as that:

var
  A, B, C: TBigInteger;

begin
  A:= 1;
  B:= 2;
  C:= A + B;
  Writeln(C.AsString);
end.

Internally a big integer is some data structure that should be initialized prior it can be used, but there is no explicit initialization in the above code example. Initialization occures when a big integer is assigned first time, and the requirement that a value should be assigned prior it can be used is the same for plain integer type. The only difference here is that a most probable result of using unassigned (= uninitialized) big integer is an access violation error.

Delphi provides operator overloading to implement initialization by assignment, so a big integer should be implemented as an advanced record with overloaded Implicit operator. Next point that should taken into account is an assignment of two big integers:

var
  A, B: TBigInteger;

begin
  A:= 1;
  B:= A;
  Writeln(B.AsString);
end.

The assignment of two big integers cannot be overloaded, and that means that a big integer data should be reference counted; the assignment B:= A just increments a ref counter. A ref counted implementation has its pros and contras from the performance point of view, but there is no alternative here if you are aiming usability too. Reference counting can be implemented either by using a dynamic array to hold big integer data or by using interface, and interface implementation leads to a cleaner code.

Next logical conclusion is that a big integer data should be immutable and write accessed only as a whole. You can’t change just one byte of big integer data because of side effect when ref counter > 1. Fortunately there is no need in byte-level write access to a big integer data. Any write access to a big integer should create a new data structure with ref counter = 1 and decrement ref counter of an old one.

TThread Facts

4

Delphi RTL TThread class has been an object of criticism since its introduction. Some criticism was deserved, some not. TThread implementation was gradually improving with every Delphi version. Here I am neither criticizing nor advocating the TThread class, just listing some details of TThread implementation in modern Delphi versions that Delphi programmer should know.

  1. TThread class does not allow setting a thread stack size – it is always equal to default value, usually 1 Mb. If you need a different value you can call BeginThread function directly, but that is a low level solution and not as handy as using TThread class;
  2. TThread constructor has the only parameter (CreateSuspended: Boolean). Irrespective of the parameter value the underlying Windows thread object is always created in suspended state:

        FCreateSuspended := CreateSuspended {...};
        FHandle := BeginThread(nil, 0, @ThreadProc, Pointer(Self), CREATE_SUSPENDED, FThreadID);
    

    A thread can start execution in AfterConstruction method (which is called after all inherited constructors):

      if not FCreateSuspended {...} then Resume;
    

    or later;

  3. Synchronization with the main thread does not work in console application. That means that you should not use Synchronize and Queue methods of TThreads class in console application. The same also applies to OnTerminate event which calls Synchronize method internally; you can override protected DoTerminate method instead;
  4. If you set FreeOnTerminate = True for a TThread instance, the instance is destroyed in the context of underlying background thread. Usually the destruction context does not matter, but in some cases it does. For example, create a TTimer instance in the constructor of TThread descendant, destroy timer in the destructor and see what happens (hint: an invisible window of TTimer instance is created in the main thread and destroyed in a different thread);
  5. TThread does not guarantee that Execute method will be called. I have bumped into this problem a little before.

Despite all issues we are better to use TThread class ‘as is’ in GUI applications. For my experiments I have written a simple alternative thread wrapper class that allows setting a thread stack size, always calls Execute method and does not contain synchronization methods; it is well suited for console applications.

Semaphore throttle

15

Suppose we have an algorithm that uses N parallel threads, and we have a system with M CPU cores, N > M. Running the algorithm on the system leads to performance loss because the threads are contend for available CPU cores and cause time-consuming thread context switching. A better approach is to execute threads sequentially on the available CPU cores, to avoid the unnecessary thread context switching. There is a simple technique called semaphore throttle which limits the number of active, contending threads.
It is interesting to estimate how efficient semaphore throttle is on real system. Or, in other words, how much is the performance loss caused by thread context switching.
As a test problem I have chosen the sum of arithmetic progression: S = 1 + 2 + 3 +… + Count * 64 . To find the sum I run 64 threads, each thread calculates a partial sum from 1 + I * Count to (I + 1) * Count, I = [0..63].
To obtain valid timings it is important that each thread is executed sufficiently long, so that the thread is preempted many times by other contending threads (when the scheduler’s time quantum ends). To increase the thread execution time I turned optimization off and chosen the Count value as much as possible for the resulting sum to fit into int64 range, but it appeared insufficient. Finally I decided to insert an additional loop into the thread function so that the thread execution time exceeded 1 second on my system:

unit SumThreads;

interface

uses
  Windows, Classes;

type
  TSumThread = class(TThread)
  private
    FSum: Int64;
    FBase: Integer;
    FCount: Integer;
    FSemaphore: THandle;
  protected
    procedure Execute; override;
  public
    constructor Create(ABase, ACount: Integer; ASemaphore: THandle);
    property Sum: Int64 read FSum;
  end;

implementation

{$O-}

constructor TSumThread.Create(ABase, ACount: Integer; ASemaphore: THandle);
begin
  FBase:= ABase;
  FCount:= ACount;
  FSemaphore:= ASemaphore;
  inherited Create(False);
end;

procedure TSumThread.Execute;
var
  Cnt, Value, J: Integer;
  S: Int64;

begin
  if FSemaphore <> 0 then
    WaitForSingleObject(FSemaphore, INFINITE);
  try
// to increase execution time the calculation is repeated
    J:= 20;
    repeat
      Value:= FBase;
      Cnt:= FCount;
      S:= 0;
      repeat
        S:= S + Value;
        Inc(Value);
        Dec(Cnt);
      until Cnt = 0;
      FSum:= S;
      Dec(J);
    until J = 0;
  finally
    if FSemaphore <> 0 then
      ReleaseSemaphore(FSemaphore, 1, nil);
  end;
end;

end.

To run the test I have written a simple console application that receives the number of concurrent threads as command line parameter, and outputs the number of concurrent threads and total execution time in seconds:

program throttle;

{$APPTYPE CONSOLE}

uses
  Windows,
  SysUtils,
  Classes,
  Diagnostics,
  SumThreads in 'SumThreads.pas';

const
  CHUNK = 30000000;

var
  Semaphore: THandle;
  Count, I: Integer;
  Sum: Int64;
  Threads: array[0..63] of TSumThread;
  Handles: array[0..63] of THandle;
  StopWatch: TStopWatch;

begin
  try
    if ParamCount <> 1 then
      raise Exception.Create('Number of concurrent threads not defined');
    Count:= StrToInt(ParamStr(1));
    if (Count < 0) or (Count > 64) then
      raise Exception.Create('Invalid number of concurrent threads');
    if Count <> 0 then begin
      Semaphore:= CreateSemaphore(nil, Count, Count, nil);
      Win32Check(Bool(Semaphore));
    end
    else
      Semaphore:= 0;
    try
      StopWatch:= TStopWatch.StartNew;
      for I:= 0 to 63 do begin
        Threads[I]:= TSumThread.Create(1 + I * CHUNK, CHUNK, Semaphore);
        Handles[I]:= Threads[I].Handle;
      end;
      if WaitForMultipleObjects(64, @Handles, True, INFINITE) = WAIT_FAILED
        then raise Exception.Create('WaitForMultipleObjects Failed');
      StopWatch.Stop;
      Sum:= 0;
      for I:= 0 to 63 do begin
        Sum:= Sum + Threads[I].Sum;
        Threads[I].Free;
      end;
    finally
      CloseHandle(Semaphore);
    end;
    Writeln(Count:5, ' -- ', StopWatch.Elapsed.TotalSeconds:3:2);
//    Writeln('Number of concurrent threads: ', Count);
//    Writeln('Time elapsed (seconds): ', StopWatch.Elapsed.TotalSeconds:3:2);
//    Writeln('Sum obtained: ', Sum);
//    Sum:= CHUNK * 64; // number of summands
//    Sum:= (Sum * (Sum + 1)) div 2;
//    Writeln('Sum expected: ', Sum);
  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
end.

I ran the tests on my notebook with 64-bit Windows 7 SP1 and CPU Intel Core i3 M380 using a simple batch file throttle.bat:

@echo off
time /t
throttle 1
throttle 2
throttle 3
throttle 4
throttle 5
throttle 6
throttle 8
throttle 16
throttle 32
throttle 64
throttle 0
time /t

The resulting log is

13:17
    1 -- 109.17
    2 -- 55.44
    3 -- 60.14
    4 -- 67.56
    5 -- 67.85
    6 -- 67.42
    8 -- 67.39
   16 -- 67.52
   32 -- 67.56
   64 -- 67.59
    0 -- 67.55
13:30

It can be clearly seen that my CPU has 2 physical cores (the number of logical cores for Core i3 M380 is 4); the fastest execution time is achieved by allowing two threads to be executed concurrently. Running 3 concurrent threads is slower, running 4 concurrent threads is even more slow. But there is no further performance degradation.
If you think of the results you can guess that if N concurrent threads are executed on M CPU cores the performance should degrade from N = M to N = 2M because of the increasing number of thread context switching (I have not tested it since I have no appropriate systems). For N > 2M the number of thread context switching should not grow anymore – now we have at least 2 concurrent threads per each core, each thread is preempted when every scheduler’s timeslice ends, that is the ‘worst case’ situation and it is reached at N = 2M.
A little arithmetic shows that properly tuned semaphore throttle can give about 15-20% performance gain on my notebook. Different systems can show different results.

——————-

You can download tests (both source code and executable) here