CPSC 225, Spring 2012 -- Lab 7

CPSC 225, Spring 2012
Lab 7: TreeSet Exercises

The point of this lab is to gain some experience with another of the JFC's classes: TreeSet. You will write a program to create and perform some operations on some rather large sets of words. Each set will be created by reading all the words in a book. The books are "Pride and Prejudice," "A Tale of Two Cities," and "Leaves of Grass." You will find copies of these books in plain text format in the directory /classes/cs225/books. They were downloaded from the Top 100 list for February 24 at Project Gutenberg. These three books were chosen partly because they are all about the same size, between 700 and 775 kilobytes.

To help you avoid the bad habit of using global variables, here are a few extra rules for this lab:

Your program should not use any global variables.
You must write three methods: one to read all the words from a book, one to print information about the set of words in a book, and one to compare the sets of words in two different books.

You can begin the lab by making a lab7 project folder. You will need a copy of TextIO.java in that folder. This lab is due next Friday, March 9. You should copy your lab7 folder into your homework folder by Friday afternoon.

Reading Words from a File

TextIO has the ability to read from a file. If you call TextIO.readFile(fileName), where fileName is a string giving the name of the file, then TextIO will start reading from the file instead of from standard input. For this lab, there is no need to copy the files into your own account. You can use the full path names of the files and read them directly from their original locations:

                /classes/cs225/books/pride-and-prejudice.txt
                /classes/cs225/books/tale-of-two-cities.txt
                /classes/cs225/books/leaves-of-grass.txt

(It should also be possible to place the files into your project folder and read them from there, using their simple names, such as "pride-and-prejudice.txt". To do this, put the files in the top level of the project folder, not in the src folder. I'm not sure that this will work in all cases.)

After using TextIO to read from a file, you can go back to reading from standard input by calling TextIO.readStandardInput(). (Standard input refers to input typed in by the user.)

You will need to read "words" from the file, skipping over punctuation, spaces, and end-of-lines. You can use the following method, taken from an example in the textbook. You should copy-and-paste it into your program. Read the comment carefully to understand what it does and how to use it.

/**
 * Read the next word from TextIO, if there is one.  First, skip past
 * any non-letters in the input.  If an end-of-file is encountered before 
 * a word is found, return null.  Otherwise, read and return the word.
 * A word is defined as a sequence of letters.  Also, a word can include
 * an apostrophe if the apostrophe is surrounded by letters on each side.
 * @return the next word from TextIO, or null if an end-of-file is encountered
 */
private static String readNextWord() {
    char ch = TextIO.peek(); // Look at next character in input.
    while (ch != TextIO.EOF && ! Character.isLetter(ch)) {
        TextIO.getAnyChar();  // Read the character.
        ch = TextIO.peek();   // Look at the next character.
    }
    if (ch == TextIO.EOF) // Encountered end-of-file
        return null;
    // At this point, we know that the next character, so read a word.
    String word = "";  // This will be the word that is read.
    while (true) {
        word += TextIO.getAnyChar();  // Append the letter onto word.
        ch = TextIO.peek();  // Look at next character.
        if ( ch == '\'' ) {
            // The next character is an apostrophe.  Read it, and
            // if the following character is a letter, add both the
            // apostrophe and the letter onto the word and continue
            // reading the word.  If the character after the apostrophe
            // is not a letter, the word is done, so break out of the loop.
            TextIO.getAnyChar();   // Read the apostrophe.
            ch = TextIO.peek();    // Look at char that follows apostrophe.
            if (Character.isLetter(ch)) {
                word += "\'" + TextIO.getAnyChar();
                ch = TextIO.peek();  // Look at next char.
            }
            else
                break;
        }
        if ( ! Character.isLetter(ch) ) {
            // If the next character is not a letter, the word is
            // finished, so break out of the loop.
            break;
        }
        // If we haven't broken out of the loop, next char is a letter.
    }
    return word;  // Return the word that has been read.
}

The TreeSet API

TreeSet is one of the classes in the Java Collection Framework. It is defined in the package java.util. It can be used to represent sets of objects. A set is a collection of objects in which no object can occur more than once. A TreeSet has the additional property that the objects in the set are stored in increasing order, so that if you traverse the set, you will visit the objects in sorted order. In particular, TreeSet<String> represents sets of strings, with the strings stored in lexicographic order (which means alphabetical order for strings of lower case letters). Here are some of the constructors and methods from the API for TreeSets of strings, where set is of type TreeSet<String>:

set = new TreeSet<String>() --- construct a new, empty set.
set = new TreeSet<String>(anotherSet) --- constructs a copy of a set, where anotherSet is an existing TreeSet<String>.
set.size() --- returns the number of items in the set.
set.isEmpty() --- returns true if the set is empty, false if not.
set.add(str) --- adds a String, str, to the set; has no effect if str is already in the set.
set.remove(str) --- removes a String, str, from the set, if it is present; it is not an error to try to remove a string that is not in the set.
set.contains(str) --- tests whether str is in the set; returns true if it is, false if not.
set.first() --- returns the first element of the set.
set.last() --- returns the last element of the set
set.subset(start,end) --- returns a set, of type TreeSet<String>, that consists of all the elements of set that are greater than or equal to start and strictly less than end, where start and end are strings. For example, set.subset("a","b") returns the set of elements of set that begin with the letter 'a'.
set.removeAll(anotherSet) --- removes from set any element of set that is also an element of anotherSet, leaving only those elements that are in set but not in anotherSet. Note that this method call modifies set.
set.retainAll(anotherSet) --- removes from set any element of set that is not an element of anotherSet, leaving only those elements that are in both set and anotherSet. Mathematically, this computes the intersection of the two sets. Note that this method call modifies set.

The Assignment

For this lab, you should write a program that analyzes and compares the sets of words used in each of the three books "Pride and Prejudice", "A Tale of Two Cities", and "Leaves of Grass." The files that you will need are listed above, and a method for reading words from a file is also given above. For each file, you should read the words from the file and put the words in a set. Convert all the words to lower case, since we don't, for example, want to count "folly" and "Folly" as being different words.

To "analyze" the set of words in a book, you should print out: the number of words; the first and last words alphabetically not counting the word "a"; and the number of words beginning with each letter of the alphabet.

You should then apply the same analysis to the set of words that occur in all three books. You can use the same method to do all of these analyses.

To "compare" two books, you should print the number of words that are in both books, the number that are in the first but not in the second, and the number that are in the second but not in the first. You should compare all pairs of books.

Finally, you should let the user enter a word and test whether that word is in each book; you should do this in a loop that allows the user to enter as many words as they want, followed by a blank line as a signal to end the program.

To do this in a general way, you can put all the file names and book titles into arrays, and process the arrays in a way that would work for any list of books. If you feel like it, download another book or two and add them to the analysis.

Note that the operations that you need to do should be done using the TreeSet API. It would take a lot more code to do them yourself, but you are not being asked to do that.

Here is part of the output from my version of the program:

                Pride and Prejudice
                   Number of different words:  6617
                   Last word alphabetically:   zip
                   First word alphabetically:  abatement
                   Number of words beginning with each letter:
                      a: 530
                      b: 263
                      c: 614
                   .
                   .
                   .
                Words used in both Pride And Prejudice and Tale Of Two cities:  4322
                Words used in Pride And Prejudice but not Tale Of Two cities:   2295
                Words used in Tale Of Two cities but not Pride And Prejudice:   5775
                   .
                   .
                   .
                Enter a word, or a blank line to end: folly
                Pride and Prejudice DOES contain "folly".
                A Tale of Two Cities does NOT contain "folly".
                Leaves of Grass does NOT contain "folly".