CPSC 225, Spring 2016 -- Lab 8

CPSC 225, Spring 2016
Lab 8: Books and Sets

We have been looking at collection classes in the Java Collection Framework. This lab asks you to JFC set operations. You will create and manipulate large sets of words. Each set will consist of all the words from a classic book, and the sets will be used to investigate and compare the vocabularies used in the books.

You will need a copy of TextIO.java. You should already have a copy in your Lab 1 project; you can copy that one or get another copy from /classes/cs225/Lab1-files. Aside from TextIO, you will write this week's program from scratch; no starting point is given. Make a new project named Lab 8. When I grade your work, I will print out any .java file, other than TextIO.java, that is in your Lab 8 folder.

This lab is due next week. It will be collected on Friday afternoon, October 28, at 3:00.

Words from the Classics

The complete text of seven classic novels and one play can be found in the folder /classes/cs225/books. It is not necessary to copy these files into your own account. Your program can access them directly from that folder. (But if you are working on your own computer, you will need to copy them onto your computer. However, please do not turn in copies with your work.) Here are a couple of global variables that you can use to store the full names of the files:

static final String DIRECTORY = "/classes/cs225/books/";
static final String[] bookFiles = {
        DIRECTORY + "20000-leagues.txt",
                  // 20,000 Leagues Under the Sea (Jules Verne)
        DIRECTORY + "alice-in-wonderland.txt",
                  // Alice in Wonderland (Lewis Carroll)
        DIRECTORY + "bleak-house.txt",
                  // Bleak House (Charles Dickens)
        DIRECTORY + "hamlet.txt",
                  // Hamlet (William Shakespeare)
        DIRECTORY + "huckleberry-finn.txt",
                  // Huckleberry Finn (Mark Twain)
        DIRECTORY + "pride-and-prejudice.txt",
                  // Pride and Prejudice (Jane Austin)
        DIRECTORY + "something-new.txt",
                  // Something New (P.G. Wodehouse)
        DIRECTORY + "time-machine.txt"
                  // The Time Machine (H.G. Wells)
};

You will use TextIO to read from the files. Recall that TextIO has a method TextIO.readFile(filename) that tells it to start reading from the specified file. After this command is executed, input will be taken from the file instead of from the user. For this program, the filename will always be one of the array elements bookFiles[i].

We define a "word" to be a sequence of letters, except that a word can also include apostrophes as long as they occur between two letters. It can be tricky to extract the words from a file. Here is a method, taken from one of the examples in the textbook, that you should use to read the words:

/**
 * Read the next word from TextIO, if there is one.  First, skip past
 * any non-letters in the input.  If an end-of-file is encountered before 
 * a word is found, return null.  Otherwise, read and return the word.
 * A word is defined as a sequence of letters.  Also, a word can include
 * an apostrophe if the apostrophe is surrounded by letters on each side.
 * @return the next word from TextIO, or null if an end-of-file is encountered
 */
private static String readNextWord() {
    char ch = TextIO.peek(); // Look at next character in input.
    while (ch != TextIO.EOF && ! Character.isLetter(ch)) {
        TextIO.getAnyChar();  // Read the character.
        ch = TextIO.peek();   // Look at the next character.
    }
    // Either the next char is a letter, or we are at end-of-file.
    if (ch == TextIO.EOF) // Encountered end-of-file
        return null;
    // At this point, we know that the next character is a letter, so read a word.
    String word = "";  // This will be the word that is read.
    while (true) {
        word += TextIO.getAnyChar();  // Append the letter onto word.
        ch = TextIO.peek();  // Look at next character.
        if ( ch == '\'' ) {
                // The next character is an apostrophe.  Read it, and
                // if the following character is a letter, add both the
                // apostrophe and the letter onto the word and continue
                // reading the word.  If the character after the apostrophe
                // is not a letter, the word is done, so break out of the loop.
            TextIO.getAnyChar();   // Read the apostrophe.
            ch = TextIO.peek();    // Look at char that follows apostrophe.
            if (Character.isLetter(ch)) {
                word += "\'" + TextIO.getAnyChar();
                ch = TextIO.peek();  // Look at next char.
            }
            else
                break;
        }
        if ( ! Character.isLetter(ch) ) {
                // If the next character is not a letter, the word is
                // finished, so break out of the loop.
            break;
        }
        // If we haven't broken out of the loop, next char is a letter.
    }
    return word;  // Return the word that has been read.
}

Note that readNextWord() returns null if it has reached the end of the file.

The Program

As a first step, you should write a a method that can read the words from a file, store them in a set, and output certain required information. The method can take the file name as its parameter, and it should return a set that contains all the words in the file. All words must be converted into lower case (or upper case, if you like that better). Either HashSet<String> or TreeSet<String> would work as the type for the set. The method must print out: (1) The total number of words read from the file, counting duplicates (#1 does not use any sets); (2) The number of different words found in the file; (3) The number of words that only occurred once in the file; and (4) The percentage of words that occurred only once as a percentage of the number of different words. Here is a sample output with what I hope are the correct values for The Time Machine:

From the file: /classes/cs225/books/time-machine.txt:
    Total number of words: 32,653
    Number of different words: 4,607
    Number of words that occur only once: 2,415 (52.4%)

(If you are wondering how I got the commas in numbers such as 32,653, you can do that with System.out.printf using the output format like %,d with a comma between the percent sign and the d.)

We discussed in class how to find the number of words that occur only once: Make a second set containing words that are duplicates of previously found words. When you read a word, you can add it to the set of duplicates if it is not already in the set of all words. At the end, subtract the size of the set of duplicates from the size of the set of all words. You will have no further need for the set of duplicates, so it can be a local variable in the method. The return value of the method is the set of all words.

Your program should call the method for each file and save the result in an ArrayList. Assuming that your sets are HashSets, the type for the array list is

ArrayList<HashSet<String>>

That is, the elements of the list are themselves sets. (Note that for ugly technical reasons in Java, you shouldn't try to use an array to hold the sets.)

The rest of the lab consists of doing various things with the list of sets. You can do most of them using set methods such as addAll, removeAll and retainAll. In one case, you will have to traverse a set using a for-each loop. You should label the output clearly and arrange it attractively. You should not simply do everything in the main program! You should use well-designed subroutines to perform various tasks. Here are the assignments, some expressed as questions for which you should compute an answer:

How many different words, in total, are used in all eight books combined?
How many words are common to all eight books?
Among the words that are common to all eight books, what is the longest word? (There is only one.)
For each book, determine how many words from that book are not used in any of the other seven books.
For each pair of books, determine how many words they have in common. (It would be good to present the results in a neat table.)

Remember that the object is to do things efficiently, using set operations!

You can do other things for some extra credit. It would nice if you think of something yourself, but here are some ideas. Make a table of the number of words of each length in one of the sets. Compare the average lengths of the words in each book. Put the unsorted_words.txt from Lab 1 in a HashSet, and use it to find out how many words from a book are, or are not, in that list. Since the files are very different lengths, but all greater than 26,000, repeat some of the analysis using only 26,000 words from each file (which might make the comparisons more meaningful). Print out the set of words that appear in all eight books, in alphabetical order, in, say, six neat columns.