2011. június 12., vasárnap

Dynamic Arrays


Problem/Question/Abstract:

Dynamic Arrays overview

Answer:

The Long and Winding Way of the Dynamic Array

Dynamic arrays were introduced to Object Pascal in Delphi 4. It was, however, not the first attempt of the Pascal/Delphi team to evolve the static array of Wirth's original Pascal.

Before going any further, let's first clarify some terminology. The terms "static" and "dynamic" are now applied at least four ways:

For arrays whose boundaries may vary (dynamic), versus arrays with constant boundaries (static).
For the methods of assigning memory to the variables: Either their relative addresses are known at compile time (static), or the addresses are assigned by the system at run time (dynamic). Correspondingly, there are two methods of memory allocation: on the stack (static), or on the heap (dynamic).
There are two methods referring to variables: either directly by their names (static), with one-to-one correspondence between a variable and its instance in memory; or indirectly via pointers (dynamic) where a variable and its instance are not the same.
Methods in the class declaration may be either static or virtual/dynamic.

This article follows the evolution of the "dynamic" concept with regard to arrays only. We will analyze its particularities for all array types appearing in Borland's Object Pascal. In addition to the standard arrays (static), there are now three more types in the family of arrays: open, variant, and dynamic. Why so many? Although it's beyond the scope of this article, we should also add to this list all the different types of strings in Delphi 4, which are a family of special character arrays whose diversity is even larger.

The Origin

The idea of dynamic arrays has a long history, beginning as early as ALGOL-60. In ALGOL-60, the syntax of array type looks similar to that of Pascal. For example, a declaration:

Real array A[M1: N1, M2: N2, M3: N3]

defines a 3D array of Real numbers. Only constants (integer numbers) are allowed as the array boundaries in the outermost block - just as in Pascal; in this case, the array is called static. But in the inner blocks of ALGOL-60, unlike Pascal, the boundaries may be variables (see Figure 1).

begin
  Real array A[1: 100, 0..10]; { Static array in ALGOL-60. }
  Integer M1, N1, M2, N2; { Variables. }
  { M1, N1, M2, N2 have to be defined. }
  ...
  begin
    Real array B[M1: N1, M2: N2]; { Dynamic array in ALGOL-60. }
    Real array C[1: 3, 1: 4]; { "Static" array in ALGOL-60. }
    ...
  end;
  ...
end
Figure 1: Variable boundaries in the inner blocks of ALGOL-60.

In the inner blocks, the array boundaries could be any arithmetic expression with the only requirement that numeric values were assigned to all variables in the boundaries before entering the block. Therefore, beginning with ALGOL-60, the concept of a dynamic array means that its boundaries may vary during run time, while for static arrays only explicit numbers are allowed in the boundary expressions.

For arrays that are local variables (like arrays B and C in Figure 1), the memory was allocated when entering the block and released when leaving it. Because declarations of the variables local in a block could appear only in the beginning of that block, no complications with redefining dynamic arrays occurred, and obviously such arrays could not be kept till the next entrance in this block. Nevertheless, in ALGOL-60, one could use the specifier own for any local variable including dynamic arrays, which meant that, although the visibility of the concerned variables obeyed the rule of scope, their values were kept available after re-entering the block. According to the Revised Report on ALGOL-60, this persistency in case of dynamic arrays ought to be followed also for the subset of indexes that are valid for the current and previous versions of the own array, although specific compilers could limit or simplify that behavior.

Note: Regarding compile-time versus run-time assignment of memory to the variables, it was typically run-time assignment in ALGOL-60 versus compile time in Pascal, although both use the stack.

The Long and Winding Road

In Wirth's Pascal, programmers could derive an unlimited number of new types from the basic types, but, for some reasons, the basic array type strictly required only constant boundaries and, therefore, was allowed to deal only with static arrays. Wirth's motivation probably was to have very fast and efficient one-pass compilers, for which all but indirect variables (i.e. except instances of pointer types) are compile-time variables providing the most efficient access. As a drawback, there was no way to overcome the static nature of the arrays, e.g. to develop procedures that deal with vectors, matrixes, and other structures of arbitrary sizes. We had to hard-code all array sizes as the maximum possible numeric values in the const section at least once.

For one-dimensional arrays, with a constant low boundary Low1 and variable boundary High1, we could resort to a trick involving indirect variables (pointers):

type
  TMaxArray = array[Low1..MaxInteger] of AnyType;
  PArray = ^TMaxArray;

and later allocate memory to the instance of PArray according to the actual value of High1:

GetMem(PArray, (High1 - Low1 + 1) * SizeOf(AnyType));

Then, we can use index expressions like PArray^[k]; we can omit ^ because of the undocumented syntax feature of Delphi. Unfortunately, for multi-dimensional arrays, this approach with indirect variables doesn't work as simply with just one pointer. Another drawback is that we have to deal with pointers instead of direct variables (as we are responsible for allocating and de-allocating memory and other possible confusions connected with indirect access).

Open Arrays

The open array, introduced in Delphi 1, was the first extension of Pascal's concept of static arrays, but it wasn't really a type like the others that could be used to declare variables. Instead, it was an intrinsic type, applicable only to formal parameters in procedures and functions. If a formal parameter looks like this:

FormalArr: array of TSomeType;

then the actual parameter may be either a one-dimensional array of type TSomeType, just one variable, or the so-called open array constructor [Expr1, Expr2, ..., ExprN] - all of type TSomeType. The latter is a nice feature not available in ALGOL-60, otherwise the behavior of the open array as a formal parameter suffers two serious drawbacks. First, only one-dimensional arrays as actual parameters are allowed. Second, whatever the low and high boundaries of the actual array are, they're always mapped to the zero-based formal open array. The same is true for the open array constructor. The expressions are numbered starting with 0, which looks a little confusing. This zero-based indexing of the open arrays isn't consistent with the more convenient Pascal array boundaries of low..high type, but, for some reason, Borland still adheres to this principle.

The open arrays enabled us to overcome a serious limitation of static arrays, and allowed us to deal comfortably with one-dimensional arrays. Thus, procedures for simple vectors of any size like scalar product, average, maximum, minimum, etc. were no longer a problem.

The variant open array formal parameter looks like:

FormalVarArr: array of const ;

and is intended only to transfer the open array constructors containing expressions of different types as an actual parameter - an extension of the similar feature for open arrays and the predecessor of the more general idea of the type variant.

Variant Arrays

The Variant type, introduced by Delphi 2, was a very powerful extension of Pascal intended for different purposes. We're going to discuss it here only with regard to the concept of dynamic arrays. A variable declared as Variant may represent a multi-dimensional dynamic array, but you need a special non-Pascal statement to specify the dimensions and the type of elements:

var
  vArr: Variant;
  {...}
begin
  {...}
  vArr := VarArrayCreate([Low1, High1, ..., LowN, HighN],
    ElementType);

where the boundaries may be variables and ElementType belongs to the fixed list of basic Pascal types denoted by the identifiers of the format varXXXX, for example varInteger, varDouble. After that we can consider vArr as an N-dimensional index variable vArr[i1,i2,...,iN]. This implementation is the closest to the notion of dynamic arrays as it appeared in ALGOL-60; it really made possible the multi-dimensional rectangular arrays with the variable boundaries of low..high type. Unfortunately, access to elements of variant arrays is at least 10 times slower than to static arrays; a simple benchmark that transposes a big matrix (e.g. A[i,j]:=A[j,i], i=1,...,N; j = 1,...,i-1) demonstrates it well. Also, in terms of memory consumption, any variant variable requires a 16-byte overhead. Although it doesn't seem too much if one variable represents a big array, it's something to keep in mind in case of many non-array variants.

The re-dimensioning of variant arrays is possible within the same block, but only for the last (right-most) dimension; the special function, varArrayRedim, does the job.

As to the efficiency of access to the elements, it may be improved to the level as fast as that of static arrays via the special procedure varArrayLock. It returns a pointer that is assignment compatible with pointers to any static array, but meaningful only if that static type corresponds exactly to the dimensions specified in the varArrayCreate. For example, for vArr, the corresponding static array type must be:

TvArrStat = array[LoN..HiN, ..., Lo1..Hi1] of ElementType;

with the dimensions specified in the order inverse to that in the varArrayCreate (why?!) and all LoN, HiN, ... Lo1, Hi1 being constants equal to the current values of the corresponding variable boundaries. Then, providing the declaration:

var
  vArrLock: ^TvArrStat;

the statement:

vArrLock := varArrayLock(vArr)

allows us to use the index variable vArrLock^[iN,...,i1] (or simply vArrLock[iN,...,i1]) with the access speed as quick as static arrays. We increased speed, but, to deal with variable boundaries, we must explicitly declare as many different static array types as we are going to have in run time. For example, we may need to prepare in advance several type declarations:

type
  T200x200 = array[1..200; 1..200] of Real;
  T150x150 = array[1..150; 1..150] of Real;
  {...}

and then specify the variable dimensions in the varArrayCreate according to one of these types - not a very convenient technique.

The interesting feature of variant arrays is that the ElementType may be variant, too:

vArr := VarArrayCreate([Low1, High1, varVariant)

In particular, it means that individual elements vArr[k] may be defined as a variant array again with any number of dimensions of any size:

vArr[k] := VarArrayCreate([kLow, kHigh, varDouble)

This creates an illusion as though we can treat vArr as a two-dimensional, non-rectangular array. (The examples of non-rectangular arrays of two dimensions are triangle matrixes, or matrixes with just a few stripes. For three dimensions, it may be an integer grid of points inside a pyramid.) Unfortunately, the variant array of variant arrays doesn't work like the similar construction of the standard Pascal arrays. Providing the declaration of vArr given previously, the code compiles for the index variable like vArr[i,j], but stops with a run-time error (because vArr is created as one-dimensional). Surprisingly, vArr[i][j] - that should be the synonym in Pascal - shows different behavior: It even produces a syntax error if it appears in the left side of the assignment statement; a := vArr[i][j] compiles and runs correctly, while vArr[i][j] := b doesn't, resulting in a syntax error.

So we see that although the variant array type allows some functionality of the ALGOL-60's dynamic arrays, the variant arrays are far more complex, slow, and not consistent with the Pascal's concept of arrays both in syntax and semantics.

Dynamic Arrays of Delphi 4

Finally, here is the latest attempt to implement the dynamic arrays (covered only on two pages of the Object Pascal Language Guide!). The declaration of one-dimensional dynamic arrays looks like this:

type
  TDynArray1 = array of baseType;

boundaries [...] must be omitted, where the baseType may be also a static array type or a dynamic array type again. This allows the declaration of the multi-dimensional "mixture" as well as "purely" dynamic arrays. For example, the declaration for three dimensions takes the following form:

type
  TDynArray3 = array of array of array of baseType;

This syntax allows us to declare a certain number of dimensions, but not their sizes (which require special consideration). Thus, if the baseType is of the non-array type, a variable:

var
  A, B: TDynArray3

may be used with up to three indexes, e.g. A[i,j,k]. Otherwise, if the baseType = array[1..100;1..200] of Double, this variable may appear with up to five indexes A[k1,k2,k3,k4,k5].

After a dynamic array variable is declared, it still cannot be used unless the special statement SetLength specifies the sizes of all dimensions and allocates the required memory. This shows the important difference between the static and dynamic arrays. The latter are - but only partially behave like - hidden pointers, i.e. a dynamic array variable is not strictly associated with its memory image, the instance, but rather separates from it.

Thus, the above-mentioned variable A (without indexes), or A[i], or A[i,j] (with number of indexes less than the declared number) are all hidden pointers. As such, at certain moments they may point nowhere, or more than one hidden pointer may point to the same instance. For example, after the assignment A := B, both A and B point to the same instance of B, so that any change to the elements of A affects B; this contradicts the usual meaning of the assignment statement. While the instance of A (if it exists) seems to be lost because it's pointed to by nothing, it doesn't cause a memory leak, which is prevented by the so-called reference count technique implemented for the dynamic arrays. For that reason, two consecutive statements - SetLength(A, ...) and SetLength(A, ...) - don't cause the loss of the piece of memory allocated in the first statement (leak) - unlike the similar situation, say, for classes. The sequence:

X := TAnyClass.Create;
X := TAnyClass.Create;

is a mistake. Also, the assignment:

A := nil
  
actually signals to the system that the memory (instance) must be freed, which is never the case for classes or pointers.

And even if:

A[i1, i2, i3] = B[i1, i2, i3]

for all indexes, it never means that the conditions:

A = B or A[i1] = B[i1] or A[i1, i2] = B[i1, i2]

are True, because these partially-indexed variables point to different locations.

In terms of persistency, while leaving and re-entering a block, the dynamic array variables behave like all other local (static) variables: leaving the block, the variables and their instances are freed automatically. For local variables of the types class and pointer it's wrong to leave the block without freeing all the instances of all such variables - the reason why such variables must be declared, for example, global. The user should nil a dynamic array only if it's important to free the memory before leaving the block. Hence, dynamic arrays are much safer than classes and pointers.

Thus, both the syntax and semantics of dynamic arrays differ from those of static arrays. Two system procedures, SetLength and Copy, previously intended to deal with strings, are applied now also for dynamic arrays. To define the sizes of dimensions - and the allowed index space - we must use the system procedure SetLength(A, Length1,...) with a non-fixed number of the integer parameters Length1,..., LengthN. At least one of them is always required to specify the size of the left-most dimension. If the number of dimensions is more than 1, the remaining sizes may be specified either in the same SetLength statement, or later in other such statements individually for each sub-array element. The former method defines rectangular arrays, similar to those known in ALGOL-60 or standard Pascal (but with mandatory zero low indexes), while the latter enables the so-called non-rectangular arrays.

For example, providing:

var
  A: TDynArray3

the single statement:

SetLength(A, N1, N2, N3)

defines the rectangular array with the index field [0..N1-1; 0..N2-1; 0..N3-1], while the statement:

SetLength(A, N1)

defines the size for the first dimension as N1 and correspondingly the index field for the first index as [0.. N1-1]. This postpones the definition of the 2 other dimensions for each A[k] individually. Figure 2 defines two types of triangle matrixes.

var
  A, B: array of array of Double;
  N, i: Integer;
begin
  { Defining N. }
  SetLength(A, N);
  SetLength(B, N);
  for i := 0 to N - 1 do
  begin
    { Lower-left triangle matrix; index field 0<=i<=N-1, 0<=j<=i }
    SetLength(A[i], i + 1);
    { Upper-left triangle matrix; index field 0<=i<=N-1, 0<=j<=N-i-1 }
    SetLength(B[i], N - i);
  end;
  { ...}
end;
Figure 2: Two types of triangle matrixes.

Unfortunately, because of the limitation imposed by zero-based indexing, dynamic arrays don't allow us to define the lower- and upper-right triangle matrixes, matrixes with several diagonal stripes this way. Figure 3 shows some examples of three-dimensional dynamic arrays of a 3- and 4-lateral pyramid-type.

var
  C, D: array of array of array of Double;
  N, i, j: Integer;
begin
  { Defining N }
  SetLength(C, N);
  SetLength(D, N);
  for i := 0 to N - 1 do
  begin
    { 4-lateral pyramid; index field  0<=i<=N-1, 0<=j,k<=i }
    SetLength(C[i], i + 1, i + 1);
    { 3-lateral pyramid }
    SetLength(D[i], i + 1);
    for j := 0 to i do
      { index field  0<=i<=N-1, 0<=j<=i, 0<=k <=j }
      SetLength(D[i, j], j + 1)
  end;
  {...}
end;
Figure 3: 3D dynamic arrays of a 3- and 4-lateral pyramid type.

As to the speed of access to elements of dynamic arrays, it's almost as high as for static arrays, at least one- and two-dimensional ones, as the benchmark with matrix transposing proves. For static arrays, the memory location of each element in a multi-dimensional array is known as soon as the index expression computes. Thus, for a static element, such as A[k1,k2,k3], the relative location may look like this:

N2N3 * k1 + N3 * k2 + k3

Instead, for dynamic arrays to access an element of K-dimensional array, the code must sequentially de-reference K pointers to the respective one-dimensional sub-arrays. Both approaches seem compatible.

Language Barrier

The evolution of dynamic arrays in Borland Pascal/Delphi wasn't straightforward. With regard to the functionality, the multi-dimensional rectangular variant arrays are the closest to the concept of dynamic arrays as they first appeared in ALGOL-60, but variant arrays are 10 times slower than static ones, and they differ in syntax. In addition, for the concept of a variant array of variant arrays, both syntax and semantics remain not quite clear.

The dynamic arrays in Delphi 4 exceed the arrays of ALGOL-60, at least in that they can be non-rectangular and still be as fast as static arrays in Pascal. Unfortunately, this notion reveals several language inconsistencies:

Why the mandatory zero-based indexing when the static and variant arrays don't require it? The low..high indexing in standard Pascal is an important feature, and can be very helpful in many applications. Also, the most natural and safe method of numbering in structures like vectors and matrixes is to number the elements in a vector corresponding to the index field: 1..N, not 0..N-1.
Why the special syntax in the declaration that omits the boundaries [ ]? As a result, the separate statement, SetLength, is later always required. True, SetLength allows to re-define the dimensions several times within a block, if it's really needed, but for the more typical case when the dimensions and the sizes are declared once at the beginning of the block, the standard form array[low..high] is better, because it is consistent with the syntax. The compiler knows by itself if the boundary expression is constant or variable, therefore it could implement the static or dynamic models according to the situation.
The assignment statements with incomplete indexes for dynamic array variables such as A := B have a quite unusual meaning: Any change in an element A[k] affects also B[k] because A and B refer to the same instance. This behavior contradicts the standard meaning of the assignment statement and is dangerous, so that incomplete indexes in assignment statements for dynamic arrays shouldn't be allowed.
The behavior of dynamic arrays as hidden pointers is better and safer than the behavior of classes (also hidden pointers) or pointers. Why not improve the behavior of classes and pointers to the level of dynamic arrays so that all indirect reference mechanisms in the language follow uniform rules?
Logically, the Delphi type class must be simply an indirect reference version of the type object introduced in Turbo-Pascal 5.5. The only difference should be in the method of memory allocation: for the class on the heap, and for the object on the stack. This is safer and doesn't involve the dangerous separation of variables from their instances. Delphi 4 and all previous versions support the type object for backward compatibility, but there are still certain differences between syntax and semantics of both types.

Conclusion

The fact that now there are as many as four different array types with quite different and inconsistent syntax and semantics in Borland Pascal/Delphi doesn't seem to be a good thing. Too many - and not always good - new features have been added to standard Pascal, which makes it cluttered, hectic, and less safe. It looks like the language doesn't evolve according to a well-developed fundamental plan; rather, it's trying to cope somehow with all different and inconsistent features of the very complex operating environment.

Delphi is still an unparalleled software development tool, but it's getting more and more complex, while its documentation and Help system lag behind. Even the Object Pascal Language Guide, the fundamental document of the language, is neither complete nor clear, or strict and formal enough as one should expect from a document of this type. This bears no resemblance to that high standard of documentation that Borland was proud of in the era of Borland Turbo Pascal. Back to the future?

I am very thankful to Dr Manfred Mackeben for his patience in reading and improving this text and for many valuable notes.

Nincsenek megjegyzések:

Megjegyzés küldése