Solution for
Programming Exercise 12.5


THIS PAGE DISCUSSES ONE POSSIBLE SOLUTION to the following exercise from this on-line Java textbook.

Exercise 12.5: One of the examples in Section 12.4 concerns the problem of making an index for a book. A related problem is making a concordance for a document. A concordance lists every word that occurs in the document, and for each word it gives the line number of every line in the document where the word occurs. All the subroutines for creating an index that were presented in Section 12.4 can also be used to create a concordance. The only real difference is that the integers in a concordance are line numbers rather than page numbers.

Write a program that can create a concordance. The document should be read from an input file, and the concordance data should be written to an output file. The names of the input file and output file should be specified as command line arguments when the program is run. You can use the indexing subroutines from Section 12.4, modified to write the data to a file instead of to System.out. You will also need to make the subroutines static. If you need some help with using files and command-line arguments, you can find an example in the sample program WordCount.java, which was also discussed in Section 12.4.

As you read the file, you want to take each word that you encounter and add it to the concordance along with the current line number. Your program will need to keep track of the line number. The end of each line in the file is marked by the newline character, '\n'. Every time you encounter this character, add one to the line number. One warning: The method in.eof(), which is defined in the TextReader, skips over end-of-lines. Since you don't want to skip end-of-line characters, you should not use in.eof() -- at least, you should not use it in the same way that it is used in the program WordCount.java. The function in.peek() from the TextReader class returns the nul character '\0' at the end of the file. Use this function instead of in.eof() to test for end-of-file.

Because it is so common, don't include the word "the" in your concordance. Also, do not include words that have length less than 3.


Discussion

Solving this exercise, for the most part, means collecting subroutines that were presented in Section 12.4 into a complete program. The subroutines addReference() and printIntegers() were copied directly, as was the class AlphabeticalOrder. The printIndex() subroutine was copied, with its name changed to printConcordance(). I made a few cosmetic changes, such as changing references to "page" into references to "line".

It remains to write the main() routine, which is similar in outline to the main() routine in WordCount.java. One difference is that we have to keep track of line numbers. We can do this by looking for new-line characters as we skip over the characters between words. A pseudocode algorithm can be given as:

         Open the input and output files.

         Let lineNum = 1.

         Repeat:
            while the next character is not end-of-file or a letter:
               Read the next character.
               If it is a new line character:
                  Add 1 to lineNum
            // After the while loop, we are looking at either the end of file
            //    or at a letter that is the beginning of the next word.
            If at end-of-file:
               Exit from the loop.
            Get the next word from the input file.
            if the word is not "the" and has length > 2:
               Add the word and lineNum to the concordance.

         Print the concordance.

This can be translated directly to give the main() routine. A subroutine for opening the input and output files is copied directly from WordCount.java.


The Solution

   /* 
      This program makes a concordance for a file.  A concordance
      lists all the words that occur in the file, along with all
      the line numbers on which each word occurs.  (Words of
      length less than 3 are omitted, and "the" is omitted.)  The
      concordance is written to an output file.  The names of the
      input and output files must be specified on the command line
      when the program is run.  This program uses the non-standard
      classes defined in the file TextReader.java.
   */

   import java.util.*;
   import java.io.*;

   public class Concordance {

      static class AlphabeticalOrder implements Comparator {
             // Represents a Comparator that can be used for
             // comparing Strings according to alphabetical
             // order.  It is an error to apply this 
             // Comparator to objects that are non-strings.
         public int compare(Object obj1, Object obj2) {
            String str1 = (String)obj1;  // Type-cast object to Strings.
            String str2 = (String)obj2;
            str1 = str1.toLowerCase();  // Convert to lower case.
            str2 = str2.toLowerCase();
            return str1.compareTo(str2);  // Compare lower-case Strings.
         }
      }

      static TextReader in;   // An input stream for reading the input file.
      static PrintWriter out; // Output stream for writing the output file.

      static TreeMap index = new TreeMap(new AlphabeticalOrder());
          // This TreeMap holds the concordance.  Words from the file
          // are used as keys in the map.  The value associated with
          // each word is a set that contains the line numbers on which
          // the word occurs in the file.  The set contains values
          // belonging to the wrapper class, Integer.


      public static void main(String[] args) {

         openFiles(args);   // Open input and output files.

         int lineNum = 1;  // The number of the line in the input
                           // file that is currently begin processed.

         while (true) {
            while (in.peek() != '\0' && ! Character.isLetter(in.peek())) {
                  // Skip over non-letter characters, stopping when 
                  // end-of-file ('\0') or a letter is seen.  If the
                  // character is an end-of-line character, add one
                  // to lineNum to reflect that fact that we are moving
                  // on to the next line in the file.
               char ch = in.getAnyChar();
               if (ch == '\n') {
                  lineNum++;  // Start a new line.
               }
            }
            if (in.eof()) {
                 // The end-of-file has been reached, so exit from the loop.
               break;
            }
            String word = in.getAlpha();  // The next word from the file.
            if (word.length() > 2  && !word.equalsIgnoreCase("the")) {
                 // Add the reference to word to the concordance, unless
                 // the word is "the" or the word has length <= 2.
               addReference(word,lineNum);
            }
         }

         printConcordance();  // Print the data to the output file.

         if (out.checkError()) {
               // Some error occurred on the output stream.
            System.out.println("An error occurred while writing the data.");
            System.out.println("Output file might be missing or incomplete.");
            System.exit(1);
         }

         System.out.println(index.size() + " distinct words were found.");

      }

      static void openFiles(String[] args) {
             // Open the global variable "in" as an input file with name 
             // args[0].  Open the global variable "out" as an output
             // file with name args[1].  If args.length != 2, or if an 
             // error occurs while trying to open the files, then an 
             // error message is printed and the program will be terminated.
         if (args.length != 2) {
            System.out.println("Error: Please specify file names on command line.");
            System.exit(1);
         }
         try {
            in = new TextReader(new FileReader(args[0]));
         }
         catch (IOException e) {
            System.out.println("Error: Can't open input file " + args[0]);
            System.exit(1);
         }
         try {
            out = new PrintWriter(new FileWriter(args[1]));
         }
         catch (IOException e) {
            System.out.println("Error: Can't open output file " + args[1]);
            System.exit(1);
         }
      } // end openFiles()


      static void addReference(String term, int lineNum) {
            // Add a line reference to the concordance.  The
            // concordance is stored in the global variable, index.
            // The word term has been found on line number lineNum.
         TreeSet references; // The set of line references that we
                             //    have so far for the term.
         references = (TreeSet)index.get(term);  // Type-cast!
         if (references == null){
                // This is the first reference that we have
                // found for the term.  Make a new set containing
                // the line number and add it to the index, with
                // the term as the key.
             TreeSet firstRef = new TreeSet();
             firstRef.add( new Integer(lineNum) );
             index.put(term,firstRef);
         }
         else {
               // references is the set of line references
               // that we have found previously for the term.
               // Add the new line number to that set.
            references.add( new Integer(lineNum) );
         }
      }

      static void printConcordance() {
            // Print each entry from the concordance to the output file.
         Set entries = index.entrySet();  
              // The index viewed as a set of entries, where each
              // entry has a key and a value.  The objects in
              // this set are of type Map.Entry.
         Iterator iter = entries.iterator();
         while (iter.hasNext()) {
               // Get the next entry from the entry set and print
               // the term and list of line references that 
               // it contains.
            Map.Entry entry = (Map.Entry)iter.next();
            String term = (String)entry.getKey();
            Set lines = (Set)entry.getValue();
            out.print(term + " ");
            printIntegers(lines);
            out.println();
         }
      }

      static void printIntegers( Set integers ) {
            // Assume that all the objects in the set are of
            // type Integer.  Print the integer values on
            // one line, separated by commas.  The commas
            // make this a little tricky, since no comma
            // is printed before the first integer.
         if (integers.isEmpty()) {
              // There is nothing to print if the set is empty.
            return;
         }
         Integer integer;  // One of the Integers in the set.
         Iterator iter = integers.iterator(); // For traversing the set.
         integer = (Integer)iter.next(); // First item in the set.
                                         // We know the set is non-empty,
                                         // so this is OK.
         out.print(integer.intValue());  // Print the first item.
         while (iter.hasNext()) {
               // Print additional items, if any, with separating commas.
            integer = (Integer)iter.next();
            out.print(", " + integer.intValue());
         }
      }


   } // end class Concordance

[ Exercises | Chapter Index | Main Index ]