Sets - Sets(C++)

Space-time tradeoffs

If we measure the execution time of the two word-counting programs developed in the section, ``A Bag example'' for moderate-size input files, we will observe that the Bag version runs faster than the Map version, and furthermore that the discrepancy between the two versions increases with the size of the input file. For example, running each version on a file containing 175,000 words gives:

       Map version:        real    2m39.48s
           user    2m32.28s
           sys     0m3.18s

       Bag version:        real    2m16.01s
           user    1m58.81s
           sys     0m5.38s

which represents a 22% user-time advantage for the Bag version (27% if we consider insertion alone). The difference can be partially explained by noting that while Map insertion runs in O(ln N), where N is the number of elements in the Map, Bag insertion, runs in O(L*L), where L is the average number of elements with the same hash value.

For the 175,000-word example, the value of L was 1.2. We determined this using the function histogram():

       main(){
           Bag<String> count;
           String word;
           while(cin >> word){
               count.insert(word);
           }
           Map<Set_or_Bag_hashval,unsigned> m;
           count.histogram(m);
           cout << "L = " << count.size()/m.size() << "\ n";
       }

histogram() returns a Map that gives, for each hash value currently in the Bag, the number of elements in the Bag with that hash value. Since m.size() is the number of unique hash values and count.size() is the number of elements in the Bag, their ratio is the desired average. You can use other statistics based on the histogram to fine-tune your hash functions.

As L gets larger, the execution speed of Set and Bag operations gets slower, approaching O(N*N) in the limit, which occurs when we use a hash function that hashes every element to the same value. Not surprisingly, however, it is this very degenerate case that results in the lowest storage overhead. (As we will see in, ``Internal representation of Sets'', the elements are all stored in a single list.) When prototyping Set or Bag applications, or when you know in advance that Sets and Bags will never have more than a few elements, this behavior may be acceptable.

To underscore the need for a good hash function, consider the following example:

       Set_or_Bag_hashval Set<int>::hash(const int& t) {
           if(t % 2 == 0){
               return 2;
           }else if (t < 0){
               return 1;
           }else{
               return t+2;
           }
       }

       main(){
           Set<int> s;
           for(int i = -20; i<1000; i++){
               s.insert(i);
           }
       }

With this hash function, the 510 even values hash to the value 2, the 10 negative odd values hash to the value 1, and each of the remaining 500 values hashes to a unique value. Timing the program on a Sun 3/60 gives

           real    0m55.55s
           user    0m28.18s
           sys     0m1.45s

an average user time of .03 seconds per element! Using the perfect hash function shown earlier:

       Set_or_Bag_hashval Set<int>::hash(const int& t) {
           return (Set_or_Bag_hashval)t;
       }

the same program gives

           real    0m3.11s
           user    0m0.93s
           sys     0m0.26s

an average user time of less than one millisecond per element.