No More Array Errors (Part II) - Array_alg(C++)

Algorithm families

Array Algorithms often come in several versions. There are, in fact, 33 families of algorithms, each containing one or more versions. Each family is distinguished by its major purpose and has a single manual entry describing each of the versions. For example, you may recall reading about rem() and rem_s() in the section ``An array-based implementation of Set(C++)''. Both are versions of the rem(C++) family. All the versions in this family, according to the manual page, ``remove elements of an array that satisfy a given criterion.'' The family contains a total of twelve versions:

       rem         rem_ps          rem_rs
       rem_c       rem_psc         rem_rsc
       rem_p       rem_r           rem_s
       rem_pc      rem_rc          rem_sc

Fortunately for the programmer, the system of versions is highly regular and therefore easy to remember. As illustrated by the rem(C++) family, versions are identified by letter suffixes. A version without a suffix is called the plain version. Letters in the suffix have the following meanings:

c -- copy version.: A copy version leaves its input array undisturbed and writes its result to an output array.
p -- predicate version: A predicate version takes a programmer-defined function and calls it whenever it needs to test a single value.
r -- relational version: A relational version takes a programmer-defined function and calls it whenever it needs to compare two values.
s -- stable version: A stable version does not unnecessarily disturb the order of its input array.

As you can see by looking at the rem(C++) family, the version matrix can sometimes be densely populated. You should not expect every family to offer every possible version combination, however. Whenever a particular version is missing, it is for a good reason. For example, there is never a copy version of an algorithm which does not modify its array (for example, bin_loc(C++) or random(C++). Even algorithms that do modify their arrays sometimes lack a copy version. For example, there is no copy version of unique(C++). In this case, the version is missing because it would be just as efficient for you to first call copy() and then call the plain version of unique(C++), which modifies its array in place. As a final example, there is no stable version of merge_sort(C++) because the plain version is already stable.

The rationale for stable and copy versions should be easy to understand. It is perhaps not so easy to understand the rationale for predicate and relational versions. Some examples should help clarify the need for these.

Consider the problem of partitioning the values in an array into two groups of values. The specification for the part(C++) family says that its algorithms ``partition an array into two groups of elements such that the elements in the left group (the group with lower array indices) satisfy a given criterion and those in the right group (the group with higher array indices) do not.'' The plain version, part(), uses equality with a given value as the criterion. In the following example, the call to part() moves all zero elements to the left and returns a pointer just beyond the last zero:

       #include <Array_alg.h>
       int b[10] = {0,1,0,2,0,3,0,4,0,5};
       int* p = part(0,b,b+10)

After the call, b might contain

        0   0   0   0   0   5   4   3   2   1
                           *

or, it might contain

        0   0   0   0   0   3   1   4   2   5
                           *

because part() is not stable (the asterisk marks the location pointed to by the return value). We know this because the description of part() says so explicitly, but we could have guessed it from the fact that the part(C++) family contains several stable versions.

But what if we wanted to partition elements based on some criterion other than equality with a given value? For example, what if we wanted to move even elements to the left (toward lower array indices) and odd elements to the right (toward higher array indices)? The solution is to provide a function testing for evenness and pass it to part_p(), the predicate version of part(C++). If stability is desired, we must also use a stable version:

       #include <Array_alg.h>
       int b[10] = {0,1,0,2,0,3,0,4,0,5};
       int is_even(const int* ip){
           return (*ip)%2==0;
       }
       int* p = part_ps(is_even,b,b+10);

After the call, b will contain

        0   0   2   0   4   0   1   3   5
                               *

In this example, the function is_even() is of type PRED(int), a type defined in Array_alg.h. Functions of this type are expected to return 1 indicating ``the criterion is true'' and zero otherwise.

To illustrate the need for relational versions, consider the problem of sorting an array of structures:

       #include <String.h>
       #include <Array_alg.h>
       struct Record{
           String  key;
           int     value;
       };
       Record r[100];

The versions of the sort(C++) family ``sort an array in place.'' According to the manual entry, the plain version requires that Record have a less-than operator that defines a total ordering relation on type Record. To use the plain version, we must therefore define such an operator:

        int operator<(const Record& r1,const Record& r2){
           return r1.key<r2.key;
       }
       ...
       sort(r,r+100);

Suppose, however, that the records have multiple keys:

       struct Record{
           String  key1;
           String  key2;
           int     value;
       };
       Record r[100];

If we want to be able to sort Records on either key or key combination, the only solution would be to define a so-called ``relation'' for each key:

       int compare_1(const Record* r1,const Record* r2){
           return strcmp(r1->key1,r2->key2);
       }
       int compare_2(const Record* r1,const Record* r2){
           return strcmp(r1->key2,r2->key2);
       }

We could then use sort_r(), the relational version of sort(C++), to sort on the first key

       sort_r(compare_1,r,r+100);

or the second key,

       sort_r(compare_2,r,r+100);

but if we wanted to sort on both keys, with key1 as the primary key and key2 as the secondary key, the following

       sort_r(compare_2,r,r+100);
       sort_r(compare_1,r,r+100);

would not work because sort_r() is not stable. To maintain the original order of records with equal primary keys, the second sort must use the stable relational version:

       sort_r(compare_2,r,r+100);
       sort_rs(compare_1,r,r+100);

In this example, both compare_1 and compare_2 are of type REL(Record) , a type defined in Array_alg.h. The required semantics of such so-called ``relations'' vary from algorithm to algorithm. For algorithms of the sort(C++) family, relations are expected to define a total order relation on the type parameter: they are expected to return a negative, zero, or positive value, depending on whether their first argument is less than, equal to, or greater than their second argument, respectively. The String(C++) function strcmp() has just the semantics needed to implement such a relation.