CPSC 225, Spring 2012
Lab 7: TreeSet Exercises

The point of this lab is to gain some experience with another of the JFC's classes: TreeSet. You will write a program to create and perform some operations on some rather large sets of words. Each set will be created by reading all the words in a book. The books are "Pride and Prejudice," "A Tale of Two Cities," and "Leaves of Grass." You will find copies of these books in plain text format in the directory /classes/cs225/books. They were downloaded from the Top 100 list for February 24 at Project Gutenberg. These three books were chosen partly because they are all about the same size, between 700 and 775 kilobytes.

To help you avoid the bad habit of using global variables, here are a few extra rules for this lab:

You can begin the lab by making a lab7 project folder. You will need a copy of TextIO.java in that folder. This lab is due next Friday, March 9. You should copy your lab7 folder into your homework folder by Friday afternoon.

Reading Words from a File

TextIO has the ability to read from a file. If you call TextIO.readFile(fileName), where fileName is a string giving the name of the file, then TextIO will start reading from the file instead of from standard input. For this lab, there is no need to copy the files into your own account. You can use the full path names of the files and read them directly from their original locations:

                /classes/cs225/books/pride-and-prejudice.txt
                /classes/cs225/books/tale-of-two-cities.txt
                /classes/cs225/books/leaves-of-grass.txt

(It should also be possible to place the files into your project folder and read them from there, using their simple names, such as "pride-and-prejudice.txt". To do this, put the files in the top level of the project folder, not in the src folder. I'm not sure that this will work in all cases.)

After using TextIO to read from a file, you can go back to reading from standard input by calling TextIO.readStandardInput(). (Standard input refers to input typed in by the user.)

You will need to read "words" from the file, skipping over punctuation, spaces, and end-of-lines. You can use the following method, taken from an example in the textbook. You should copy-and-paste it into your program. Read the comment carefully to understand what it does and how to use it.

/**
 * Read the next word from TextIO, if there is one.  First, skip past
 * any non-letters in the input.  If an end-of-file is encountered before 
 * a word is found, return null.  Otherwise, read and return the word.
 * A word is defined as a sequence of letters.  Also, a word can include
 * an apostrophe if the apostrophe is surrounded by letters on each side.
 * @return the next word from TextIO, or null if an end-of-file is encountered
 */
private static String readNextWord() {
    char ch = TextIO.peek(); // Look at next character in input.
    while (ch != TextIO.EOF && ! Character.isLetter(ch)) {
        TextIO.getAnyChar();  // Read the character.
        ch = TextIO.peek();   // Look at the next character.
    }
    if (ch == TextIO.EOF) // Encountered end-of-file
        return null;
    // At this point, we know that the next character, so read a word.
    String word = "";  // This will be the word that is read.
    while (true) {
        word += TextIO.getAnyChar();  // Append the letter onto word.
        ch = TextIO.peek();  // Look at next character.
        if ( ch == '\'' ) {
            // The next character is an apostrophe.  Read it, and
            // if the following character is a letter, add both the
            // apostrophe and the letter onto the word and continue
            // reading the word.  If the character after the apostrophe
            // is not a letter, the word is done, so break out of the loop.
            TextIO.getAnyChar();   // Read the apostrophe.
            ch = TextIO.peek();    // Look at char that follows apostrophe.
            if (Character.isLetter(ch)) {
                word += "\'" + TextIO.getAnyChar();
                ch = TextIO.peek();  // Look at next char.
            }
            else
                break;
        }
        if ( ! Character.isLetter(ch) ) {
            // If the next character is not a letter, the word is
            // finished, so break out of the loop.
            break;
        }
        // If we haven't broken out of the loop, next char is a letter.
    }
    return word;  // Return the word that has been read.
}

The TreeSet API

TreeSet is one of the classes in the Java Collection Framework. It is defined in the package java.util. It can be used to represent sets of objects. A set is a collection of objects in which no object can occur more than once. A TreeSet has the additional property that the objects in the set are stored in increasing order, so that if you traverse the set, you will visit the objects in sorted order. In particular, TreeSet<String> represents sets of strings, with the strings stored in lexicographic order (which means alphabetical order for strings of lower case letters). Here are some of the constructors and methods from the API for TreeSets of strings, where set is of type TreeSet<String>:

The Assignment

For this lab, you should write a program that analyzes and compares the sets of words used in each of the three books "Pride and Prejudice", "A Tale of Two Cities", and "Leaves of Grass." The files that you will need are listed above, and a method for reading words from a file is also given above. For each file, you should read the words from the file and put the words in a set. Convert all the words to lower case, since we don't, for example, want to count "folly" and "Folly" as being different words.

To "analyze" the set of words in a book, you should print out: the number of words; the first and last words alphabetically not counting the word "a"; and the number of words beginning with each letter of the alphabet.

You should then apply the same analysis to the set of words that occur in all three books. You can use the same method to do all of these analyses.

To "compare" two books, you should print the number of words that are in both books, the number that are in the first but not in the second, and the number that are in the second but not in the first. You should compare all pairs of books.

Finally, you should let the user enter a word and test whether that word is in each book; you should do this in a loop that allows the user to enter as many words as they want, followed by a blank line as a signal to end the program.

To do this in a general way, you can put all the file names and book titles into arrays, and process the arrays in a way that would work for any list of books. If you feel like it, download another book or two and add them to the analysis.

Note that the operations that you need to do should be done using the TreeSet API. It would take a lot more code to do them yourself, but you are not being asked to do that.

Here is part of the output from my version of the program:

                Pride and Prejudice
                   Number of different words:  6617
                   Last word alphabetically:   zip
                   First word alphabetically:  abatement
                   Number of words beginning with each letter:
                      a: 530
                      b: 263
                      c: 614
                   .
                   .
                   .
                Words used in both Pride And Prejudice and Tale Of Two cities:  4322
                Words used in Pride And Prejudice but not Tale Of Two cities:   2295
                Words used in Tale Of Two cities but not Pride And Prejudice:   5775
                   .
                   .
                   .
                Enter a word, or a blank line to end: folly
                Pride and Prejudice DOES contain "folly".
                A Tale of Two Cities does NOT contain "folly".
                Leaves of Grass does NOT contain "folly".