Big Integer musings

2

When writing a multiple precision integer arithmetic library you should have in mind two aims – usability and performance, and to reach both is not a trivial task. A modern multiple precision integer implementation in Delphi should be as easy to use as that:

var
  A, B, C: TBigInteger;

begin
  A:= 1;
  B:= 2;
  C:= A + B;
  Writeln(C.AsString);
end.

Internally a big integer is some data structure that should be initialized prior it can be used, but there is no explicit initialization in the above code example. Initialization occures when a big integer is assigned first time, and the requirement that a value should be assigned prior it can be used is the same for plain integer type. The only difference here is that a most probable result of using unassigned (= uninitialized) big integer is an access violation error.

Delphi provides operator overloading to implement initialization by assignment, so a big integer should be implemented as an advanced record with overloaded Implicit operator. Next point that should taken into account is an assignment of two big integers:

var
  A, B: TBigInteger;

begin
  A:= 1;
  B:= A;
  Writeln(B.AsString);
end.

The assignment of two big integers cannot be overloaded, and that means that a big integer data should be reference counted; the assignment B:= A just increments a ref counter. A ref counted implementation has its pros and contras from the performance point of view, but there is no alternative here if you are aiming usability too. Reference counting can be implemented either by using a dynamic array to hold big integer data or by using interface, and interface implementation leads to a cleaner code.

Next logical conclusion is that a big integer data should be immutable and write accessed only as a whole. You can’t change just one byte of big integer data because of side effect when ref counter > 1. Fortunately there is no need in byte-level write access to a big integer data. Any write access to a big integer should create a new data structure with ref counter = 1 and decrement ref counter of an old one.

Parsing UTF8 strings

0

Windows API functions like CharNextExA do not support UTF8 encoding; you can test CharNextExA with UTF8 codepage (65001) and UTF8 strings and see it does not work. If you need to calculate for example the number of unicode codepoints in UTF8 string you should parse UTF8 string manually. Fortunately the task is very simple and boils down to the following function:

function UTF8IsLeadChar(Ch: AnsiChar): Boolean;
begin
  Result:= Ord(Ch) and $C0 <> $80;
end;

the function that returns number of unicode codepoints in UTF8 string looks as follows:

function UTF8CharCount(const S: UTF8String): Integer;
var
  P: PAnsiChar;
  I: Integer;

begin
  I:= 0;
  P:= PAnsiChar(S);
  Result:= 0;
  while I < Length(S) do begin
    if UTF8IsLeadChar(P^) then Inc(Result);
    Inc(P);
    Inc(I);
  end;
end;