Section 12.4
Programming with Collection Classes


IN THIS SECTION, we finish the chapter and the book by looking at a few programming examples that use the collection and map classes which were discussed in Sections 1, 2, and 3.


Symbol Tables

We begin with a straightforward but important application of Maps. When a compiler reads the source code of a program, it encounters definitions of variables, subroutines, and classes. The names of these things can be used later in the program. The compiler has to remember the definition of each name, so that it can recognize the name and apply the definition when the name is encountered later in the program. This is a natural application for a Map. The name can be used as a key in the map. The value associated to the key is the definition of the name, encoded somehow as an Object. A Map that is used in this way is called a symbol table.

In a compiler, the values in a symbol table can be quite complicated, since the compiler has to deal with names for various sorts of things, and it needs a different type of information for each different type of name. We will keep things simple by looking at a symbol table in another context. Suppose that we want a program that can evaluate expressions entered by the user, and suppose that the expressions can contain variables, in addition to operators, numbers, and parentheses. For this to make sense, we need some way of assigning values to variables. When a variable is used in an expression, we need to retrieve the variable's value. A symbol table can be used to store the variables' values. The keys for the symbol table are variable names. The value associated with a key is the value of that variable.

To demonstrate the idea, we'll use a rather simple-minded program in which the user types commands such as:

            let x = 3 + 12
            print 2 + 2
            print 10*x +17
            let rate = 0.06
            print 1000*(1+rate)

The only two commands that the program understands are "print" and "let". When a "print" command is executed, the computer evaluates the expression and displays the value. If the expression contains a variable, the computer has to look up the value of that variable in the symbol table. A "let" command is used to give a value to a variable. The computer has to store the value of the variable in the symbol table. (Note: The "variables" I am talking about here are not variables in the Java program. The Java program is executing a sort of program typed in by the user. I am talking about variables in the user's program. The user gets to make up variable names, so there is no way for the Java program to know in advance what the variables will be.)

In Section 11.5, we saw how to write a program, SimpleParser2.java, that can evaluate expressions that do not contain variables. The new program, SimpleParser5.java, is based on that older program. I will only talk about the parts that are relevant to the symbol table. Here is an applet that simulates the program:

(Applet "SimpleParser5Console" would be displayed here
if Java were available.)

The program uses a HashMap as the symbol table. A TreeMap could also be used, but since we don't have any reason to access the variables in alphabetical order, we don't need to have the keys stored in sorted order. The value of a variable is a double, but Java's collection and map classes can only hold objects. To get around this restriction, we have to use the wrapper class, Double. The variable's value, which is of type double is wrapped in an object of type Double. That object is stored in the HashMap, using the variable's name as the key.

Let symbolTable be the HashMap that is used as the symbol table. At the beginning of the program, we start with an empty map:

            symbolTable = new HashMap();

To execute a "let" command, the program uses the put() method to associate a value with the variable name. Suppose that the name of the variable is given by a String, varName. The command for setting the value of the variable to val would be:

            symbolTable.put(varName, new Double(val));

This associates the object new Double(val) with the key, varName. In the program SimpleParser5.java, you'll find this in the method named doLetCommand(). Just for fun, I decided to pre-define two variables named "pi" and "e" whose values are the usual mathematical constants pi and e. In Java, the values of these constants are given by Math.PI and Math.E. To make these variables available to the user of the program, they are added to the symbol table with the commands:

            symbolTable.put("pi", new Double(Math.PI));
            symbolTable.put("e", new double(Math.E));

When a variable is encountered in an expression, the get() method is used to retrieve its value. Since this method returns a value of type Object, we have to type-cast the return value to Double. If the variable has never been given a value, then the get() method returns null. We have to handle this in some way; I will consider it to be an error:

            Object symbolTableEntry = symbolTable.get(varName);
            if (symbolTableEntry == null) {
               ... // Throw an exception:  Undefined variable.
            }
            Double value = (Double)symbolTableEntry;
            double val = value.doubleValue();

The last line gets the double that is the actual value of the variable from the wrapper object. You will find this code, more or less, in a method named primaryValue() in the program SimpleParser5.java.

As you can see, aside from the necessity of using a wrapper class, Maps are really quite easy to use.


Sets Inside a Map

The objects in a collection or map can be of any type. They can even be collections. Here's an example where it's natural to store sets as values in a map.

Consider the problem of making an index for a book. An index consists of a list of terms that appear in the book. Next to each term is a list of the pages on which that term appears. To represent an index in a program, we need a data structure that can hold a list of terms, along with a list of pages for each term. Adding new data should be easy and efficient. When it's time to print the index, it should be easy to access the terms in alphabetical order. There are many ways this could be done, but I'd like to use Java's generic data structures and let them do as much of the work as possible.

We can think of an index as a Map that associates a list of page references to each term. The terms are keys, and the value associated with a given key is the list of page references for that term. A Map can be either a TreeMap or a HashMap, but only a TreeMap will make it easy to access the terms in sorted order. The value associated with a term is a list of page references. How can we represent such a value? If you think about it, you see that it's not really a list in the sense of Java's generic classes. If you look in any index, you'll see that a list of page references has no duplicates, so it's really a set rather than a list. Furthermore, the page references for a given term are always printed in increasing order, so we want a sorted set. This means that we should use a TreeSet to represent a list of page references. The values that we really want to put in this set are of type int, but once again we have to deal with the fact that generic data structures can only hold objects. We have to wrap each value of type int in an object belonging to the wrapper class, Integer.

To summarize, an index will be represented by a TreeMap. The keys for the map will be terms, which are of type String. The values in the map will be TreeSets. The TreeSet corresponding to a term contains Integers which give the page numbers of every page on which the term appears.

To make an index, we need to start with an empty TreeMap, look through the book, inserting every reference that we want to be in the index into the TreeMap, and then print out the data from the TreeMap. Let's leave aside the question of how we find the references to put in the index, and just look at how the TreeMap is used. The TreeMap can be created with the command:

         TreeMap index = new TreeMap();

Now, suppose that we find a reference to some term on some page. We need to insert this information into the index. To do this, we should look up the term in the index, using index.get(term). The return value is either null or is the set of page references that we already have for the term. If the return value is null, then this is the first page reference for the term, so we should add the term to the index, with a new set that contains the page reference we've just found. If the return value is non-null, we already have a set, and we should just add the new page reference to the set. Here is a subroutine that does this:

         void addReference(String term, int pageNum) {
               // Add a page reference to the index.
            TreeSet references; // The set of page references that we
                                //    have so far for the term.
            references = (TreeSet)index.get(term);  // Type-cast!
            if (references == null){
                   // This is the first reference that we have
                   // found for the term.  Make a new set containing
                   // the page number and add it to the index, with
                   // the term as the key.
                TreeSet firstRef = new TreeSet();
                firstRef.add( new Integer(pageNum) );
                index.put(term,firstRef);
            }
            else {
                  // references is the set of page references
                  // that we have found previously for the term.
                  // Add the new page number to that set.
               references.add( new Integer(pageNum) );
            }
         }

The only other thing we need to do with the index is print it out. We are going to have to print out sets of Integers. Let's write a separate subroutine to do that:

         void printIntegers( Set integers ) {
               // Assume that all the objects in the set are of
               // type Integer.  Print the integer values on
               // one line, separated by commas.  The commas
               // make this a little tricky.
            if (integers.isEmpty()) {
                 // There is nothing to print if the set is empty.
               return;
            }
            Integer integer;  // One of the Integers in the set.
            Iterator iter = integers.iterator(); // For traversing the set.
            integer = (Integer)iter.next(); // First item in the set.
                                            // We know the set is non-empty,
                                            // so this is OK.
            System.out.print(integer.intValue());  // Print the first item.
            while (iter.hasNext()) {
                  // Print additional items, if any, with separating commas.
               integer = (Integer)iter.next();
               System.out.print(", " + integer.intValue());
            }
         }

Finally, we need a routine that can iterate through all the terms in the map and print each term along with its list of page references. There are no iterators for maps. To iterate through a map, we need to use one of the Set views of the map. Since we want to print both the keys and the values, it's most efficient to use the entry set of the map. Here's the subroutine:

         void printIndex() {
               // Print each entry in the index.

            Set entries = index.entrySet();  
                 // The index viewed as a set of entries, where each
                 // entry has a key and a value.  The objects in
                 // this set are of type Map.Entry.

            Iterator iter = entries.iterator();
            
            while (iter.hasNext()) {
                  // Get the next entry from the entry set and print
                  // the term and list of page references that 
                  // it contains.
               Map.Entry entry = (Map.Entry)iter.next();
               String term = (String)entry.getKey();
               Set pages = (Set)entry.getValue();
               System.out.print(term + " ");
               printIntegers(pages);
               System.out.println();
            }
         }

This is not a lot of code, considering the complexity of the operations. The only really tricky part is the constant need for type-casting and the need to use wrapper objects for primitive types. Both of these are necessary because of the nature of generic programming in Java, that is, the fact that generic data structures hold values of type Object.

I have not written a complete indexing program, but the subroutines presented here are used in a related context in Exercise 12.5.


Using a Comparator

There is a potential problem with our solution to the indexing problem. If the terms in the index can contain both upper case and lower case letters, then the terms will not be in alphabetical order. The ordering on String is not alphabetical. It is based on the Unicode codes of the characters in the string. The codes for all the upper case letters are less than the codes for the lower case letters. So, for example, terms beginning with "Z" come before terms beginning with "a". If the terms are restricted to use lower case letters only (or upper case only), then the ordering would be alphabetical. But suppose that we allow both upper and lower case, and that we insist on alphabetical order. In that case, our index can't use the usual ordering for Strings. Fortunately, it's possible to specify a different method to be used for comparing the keys of a map. This is a typical use for a Comparator.

Recall that a Comparator is an object that implements the Comparator interface and defines the method:

            public int compare(Object obj1, Object obj2)

This method can be used to compare two objects. It should return an integer that is positive, zero, or negative, depending on whether obj1 is less than, equal to, or greater than obj2. We need a Comparator that will compare two Strings based on alphabetical order. The easiest (although not most efficient) way to do this is to convert the Strings to lower case and use the default comparison on the lower case Strings. The following class defines such a comparator:

         class AlphabeticalOrder implements Comparator {
                // Represents a Comparator that can be used for
                // comparing Strings according to alphabetical
                // order.  It is an error to apply this 
                // Comparator to objects that are non-strings.
            public int compare(Object obj1, Object obj2) {
               String str1 = (String)obj1;  // Type-cast objects to Strings.
               String str2 = (String)obj2;
               str1 = str1.toLowerCase();  // Convert to lower case.
               str2 = str2.toLowerCase();
               return str1.compareTo(str2);  // Compare lower-case Strings.
            }
         }

To solve our indexing problem, we just need to tell our index to use an object of type AlphabeticalOrder for comparing keys. This is done by providing the Comparator object as a parameter to the constructor. We just have to create the index with the command:

         TreeMap index = new TreeMap( new AlphabeticalOrder() );

This does work, but I've concealed one technicality. Suppose, for example, that the program calls addReference("aardvark",56) and that it later calls addReference("Aardvark",102). The words "aardvark" and "Aardvark" differ only in that one of them begins with an upper case letter. When we insert them into the index, do they count as two different terms or as one term? The answer depends on the way that a TreeMap tests objects for equality. In fact, TreeMaps and TreeSets always use a Comparator object or a compareTo method to test for equality. They do not use the equals() method. The Comparator that is used for the TreeMap in this example returns the value zero when it is used to compare "aardvark" and "Aardvark", so the TreeMap considers them to be the same. Page references to "aardvark" and "Aardvark" are combined into a single list. This is probably the correct behavior in this example. If not, some other technique must be used to sort the terms into alphabetical order.


Word Counting

The final example also deals with storing information about words. In Section 10.3, we looked at WordList.java, a program that makes a list of all the words in a file. In Section 2, we did the same thing using generic data structures. Now, we extend the problem so that instead of just listing the words, the program also counts the number of times each word occurs in the file. The output consists of two lists, one sorted alphabetically and one sorted according to the number of occurrences, with the most common words at the top and the least common at the bottom. The same problem was assigned in Exercise 10.1 and solved without using generic data structures.

As the program reads an input file, it must keep track of how many times it encounters each word. We could simply throw all the words, with duplicates, into a list and count them later. But that would require a lot of storage and would not be very efficient. A better method is to keep a counter for each word. The first time the word is encountered, the counter is initialized to 1. On subsequent encounters, the counter is incremented. To keep track of the data for one word, the program uses a simple class that holds a word and the counter for that word. The class is a static nested class:

      static class WordData { 
            // Represents the data we need about a word:  the word and
            // the number of times it has been encountered.
         String word;
         int count;
         WordData(String w) {
               // Constructor for creating a WordData object when
               // we encounter a new word.
            word = w;
            count = 1;  // The initial value of count is 1.
         }
      } // end class WordData

The program has to store all the WordData objects in some sort of data structure. We want to be able to add new words efficiently. Given a word, we need to check whether a WordData object already exists for that word, and if it does, we need to find that object so that we can increment its counter. A Map can be used to implement these operations. Given a word, we want to look up a WordData object in the Map. This means that the word is the key, and the WordData object is the value. (It might seem strange that the key is also one of the instance variables in the value object, but in fact this is probably the most common situation: The value object contains all the information about some entity, and the key is one of those pieces of information.) After reading the file, we want to output the words in alphabetical order, so we should use a TreeMap rather than a HashMap. This program converts all words to lower case so that the default ordering on Strings will put the words in alphabetical order.

When the program reads a word from a file, it calls words.get(word) to find out if that word is already in the map, where words is a variable that refers to the TreeMap. If the return value is null, then this is the first time the word has been encountered, so a new WordData object is created and inserted into the map with the command words.put(word, new WordData(word)). If words.get(word) is not null, then it's the WordData object for this word, and the program only has to increment the counter in that object. Here is the subroutine that is used to read the words from the file:

   static void readWords(TextReader inStream, Map words) {
        // Read all words from inStream, and store data about them in words.

      try {
         while (true) { 
            while (! inStream.eof() && ! Character.isLetter(inStream.peek()))
               inStream.getAnyChar();  // Skip past non-letters.
            if (inStream.eof())
               break;  // Exit because there is no more data to read.
            String word = inStream.getAlpha();  // Read one word from stream.
            word = word.toLowerCase();
            WordData data = (WordData)words.get(word);
                // Check whether the word is already in the Map.  If not,
                // the value of data will be null.  If it is not null, then
                // it is a WordData object containing the word and the
                // number of times we have encountered it so far.
            if (data == null) {
                  // We have not encountered word before.  Add it to
                  // the map.  The initial frequency count is
                  // automatically set to 1 by the WordData constructor.
               words.put(word, new WordData(word) );
            }
            else {
                  // The word has already been encountered, and data is
                  // the WordData object that holds data about the word.
                  // Add 1 to the frequency count in the WordData object.
               data.count = data.count + 1;
            }
         }
      }
      catch (TextReader.Error e) {
         System.out.println("An error occurred while reading the data.");
         System.out.println(e.toString());
         System.exit(1);
      }

   } // end readWords()

After reading the words and printing them out in alphabetical order, the program has to sort the words by frequency count and print them again. To do the sorting using a generic algorithm, I defined a Comparator class for comparing two word objects according to their frequency counts:

   static class CountCompare implements Comparator {
         // A comparator for comparing objects of type WordData
         // according to their counts.  This is used for
         // sorting the list of words by frequency.
      public int compare(Object obj1, Object obj2) {
         WordData data1 = (WordData)obj1;
         WordData data2 = (WordData)obj2;
         return data2.count - data1.count;
            // The return value is positive if data2.count > data1.count.
            // I.E., data1 comes after data2 in the ordering if there
            // were more occurrences of data2.word than of data1.word.
            // The words are sorted according to decreasing counts.
      }
   } // end class CountCompare

Given this class, we can sort the WordData objects according to frequency by copying them into a list and using the generic method Collections.sort(List,Comparator). The WordData objects are the values in the map, words, and words.values() is a Collection that contains all these values. The constructor for the ArrayList class lets you specify a collection to be copied into the list when it is created. So, we can create a List containing the word data and sort that list according to frequency count using the commands:

      ArrayList wordsByCount = new ArrayList( words.values() );
      Collections.sort( wordsByCount, new CountCompare() );

You should notice that these two lines replace a lot of code! It requires some practice to think in terms of generic data structures and algorithms, but the payoff is significant in terms of saved time and effort.

The only remaining problem is to print the data to the output file. We have to print the data from all the WordData objects twice, first in alphabetical order and then sorted according to frequency count. The data is in alphabetical order in the TreeMap, or more precisely, in the values of the TreeMap. We can use an Iterator for the value collection, words.values(), to access the words in alphabetical order. Similarly, the words are in the ArrayList wordsByCount in the correct order according to frequency count, so we could use an Iterator for the ArrayList to access and print the data according to frequency count. Since we have to do essentially the same thing twice, we might as well write a subroutine to do it:

      static void printWords(PrintWriter outStream, Collection wordData) {
            // wordData must contain objects of type WordData.  The words
            // and counts in these objects are written to the output stream.
         Iterator iter = wordData.iterator();
         while (iter.hasNext()) {
            WordData data = (WordData)iter.next();
            outStream.println("   " + data.word + " (" + data.count + ")");
         }
      } // end printWords()

This is a generic subroutine. Since its second parameter is of type Collection, it can be applied to the collection words.values() as well as to the collection wordsByCount. This is the last piece needed for the program. You can find the complete program in the file WordCount.java.


With this section, we reach the end of Introduction to Programming Using Java. It has been a long journey, but I hope a worthwhile one and one that has left you prepared to move on to more advanced study of Java, programming, and computer science in general. Good luck and have fun!



End of Chapter 12

[ Previous Section | Chapter Index | Main Index ]