CPSC 225, Fall 2016
Lab 2: Sort and Search Times

Our second lab is a little different from most programming labs. The main goal is to measure and compare the run times of a variety of sorting and searching algorithms. You will write programs to do the measurements, but I am just as interested in your results and analysis as in the programs themselves. In addition to writing the programs, you will turn in a written report about your results.

You should make a new Eclipse project named Lab2 to hold your work for this lab. (If you choose not to use Eclipse, you should use a regular folder named Lab2.)

For Part 1 of the lab, you will need copies of the files WordList.java and unsorted_words.txt, which you used in Lab 1. To copy a file from Lab1 to Lab2 from within Eclipse, open "Lab1", "src", and "(default package)" in the Eclipse Project Explorer, select the file, and "Copy" it using a menu or Control-C. Then select the "src" folder in "Lab2" and "Paste" the file using a menu or Control-V.

You will write two programs from scratch, named SearchTimes.java and SortTimes.java. The exact design of these programs is to a large extent up to you. You will also create a new class file named SortedWordList.java. For your report, you can either turn in a printout or you can write your report in a plain text file named report.txt and turn in the file along with your programs. (Please do not submit a word processing document!) You can create report.txt in Eclipse by selecting the Lab2 src and using the menu command "New" / "File".

You should turn in your work by copying the Lab2 folder that contains your work into your homework folder in /classes/cs225/homework. You can copy an entire project from your Eclipse workspace. Alternatively, you can submit a folder named Lab2 that contains the all the files for the lab (SearchTimes.java, SortTimes.java, WordList.java, SortedWordList.java, unsorted_words.txt, and report.txt). The work should be turned in by the start of the next lab.

For this lab, you can work with a partner if you want. If two people work together, they should turn in only one assignment, in either person's homework folder. Be sure that both of your names are on the work!!

Measuring Elapsed Time

Java has a method System.currentTimeMillis() that returns the current time, measured in milliseconds since an arbitrary starting point. However, the time can only be accurate to within one millisecond, and it is sometimes useful to have a more accurate measure of time.

The function System.nanoTime() also returns the current time, measured from some arbitrary starting time, but it measures time in nanoseconds. There are one billion nanoseconds in a second. This function is not actually accurate to the nanosecond, but it probably has significantly more than one millisecond accuracy. You should use System.nanoTime to measure time for this lab. The return value is of type long. So, to measure the time taken by some computation in Java, you can do something like this:

long startTime, endTime;
double elapsedTimeInSeconds;
startTime = System.nanoTime();
   .
   .  // do some computation
   .
endTime = System.nanoTime();
elapsedTimeInSeconds = ( endTime - startTime ) / 1.0e9;

There is another problem with measuring compute time in Java: A Java program gets faster as it runs! That's because the Java virtual machine starts out in interpreted mode, where it inspects each instruction, decides what it means, and carries out. This is much slower than executing instructions in your computer's native machine language. (Instructions in a class file are Java bytecode instructions, not native machine language.) However, while it is interpreting instructions, the virtual machine can also compile those instructions into native machine language. Then when that segment of code is executed again, it can be run as fast native machine language rather than slow interpreted instructions. This is called "just-in-time compilation." (Java doesn't automatically compile all the bytecode that it executes; it has some rules for guessing when it worth the effort to do the compilation.)

This is great for running programs quickly, but if you are trying to get an accurate measure of compute time, it's a problem if the compute times change as the program runs!

To get more accurate measurements, you can tell Java to turn off its just-in-time compiler and to run in interpreted mode only. This is done with the -Xint option to the java command. For example:

java  -Xint  SearchTimes

In Eclipse, of course, you run a program with a "Run" command rather than on the command line. In Eclipse, you can use the "Run Configuration" dialog to add an option such as -Xint. It's easiest to do this if you have already run the program once. Use the "Run Configuration" command in the "Run" menu to open the Run Configuration dialog. Assuming that you have already run the program at least once, you will see a run configuration for the program. Select the run configuration for that program, click the "Arguments" tab, and enter -Xint into the VM Arguments box:

The run configuration will remember the option; you do not have to enter it every time you run the program. As an alternative, you can run a program on the command line in a Terminal, even if you are working on the program in Eclipse. In a Terminal, cd into your Eclipse workspace, then into the folder for your project, then into the "bin" directory. The bin directory contains your compiled class files. You can use the "java" command in that directory to run the program.

Part 1: Search Times

Last week, you worked with a WordList class that uses a deliberately very inefficient linear search algorithm. We have seen that binary search should be much faster than linear search. For the first part of the lab, you will make a new word list class that uses binary search, and you will write a program to compare the search times for the two algorithms.

Start by making a copy of WordList.java, and name the copy SortedWordList.java. (In Eclipse, you can use Control-C on WordList.java, then paste it into the same folder. You will be asked to specify the name for the new file.)

To use binary search, the array of legal words in SortedWordList will have to be sorted. You can sort the array, which is named words, in the construtor, after it has been read from the file. Java has a built-in method for sorting arrays:

Arrays.sort(array);

It works for arrays of numbers as well as arrays of strings. You can use this method to sort the list of words. You will have to import the Arrays class from the package java.util.

You will also have to modify the contains method in SortedWordList to do binary search. Fortunately, Java also has a built-in method for doing binary search. The function call

Arrays.binarySearch( array, item )

does a binary search for item in the array, and returns a int value. The return value is -1 if the item is not in the array. If the item is in the array, then the return value is >=0 (in fact, it is the location of the item in the array). Of course, it will only return a valid answer if the array has been sorted.


You have a class WordList that uses linear search and a class SortedWordList the uses binary search. You should design and write a program that compares the run times of the two algorithms. Create a new program named SearchTimes.java to do the comparison.

A single search, especially binary search, doesn't take enough time to measure accurately. Instead, you should do a large number of searches and measure the time that it takes to execute the whole group of searches.

One idea is to make a large list of words selected at random from the word list, and then to search for each word from the list, first using SortedWordList and then using WordList. (Note that methods named size and get in the word list classes.) You could even try searching for every word in the list, but that will probably take too long for linear search, especially if you use the -Xint option to run the program. You could also do a meaningful comparison by making up a word that is not in the list and searching for it some large number of times. Note that the first idea measures times for successful searches while the second measures times for unsuccessful searches. You might want to do both.

(Do you see why it's not a good idea to pick a valid word and search for the same word a large number of times? The search time for linear search depends strongly on the position of the word in the list. You need a random sample of words to get an accurate idea of the average search time for words in the list.)

Part 2: Sort Times

For the second part of the lab, you will write a program that measures the time required to sort arrays using selection sort, insertion sort, and the built-in Arrays.sort method. You will sort both arrays of double values and arrays of String. Start by creating a new program named SortTimes.java. You can copy and paste the following selection sort and insertion sort methods from this web page into your program:

	
/**
 * Sorts A[0], A[1], ..., A[count-1] into increasing order using insertion sort.
 */
public static void insertionSort(double[] A, int count) {
    for (int top = 1; top < count; top++) {
        int pos = top;
        double temp = A[top];
        while (pos > 0 && A[pos-1] > temp) {
            A[pos] = A[pos-1];
            pos--;
        }
        A[pos] = temp;
    }
}

/**
 * Sorts A[0], A[1], ..., A[count-1] into increasing order using insertion sort.
 */
public static void insertionSort(String[] A, int count) {
    for (int top = 1; top < count; top++) {
        int pos = top;
        String temp = A[top];
        while (pos > 0 && A[pos-1].compareTo(temp) > 0) {
            A[pos] = A[pos-1];
            pos--;
        }
        A[pos] = temp;
    }
}

/**
 * Sorts A[0], A[1], ..., A[count-1] into increasing order using selection sort.
 */
public static void selectionSort(double[] A, int count) {
    for (int top = count-1; top > 0; top--) {
        int maxLoc = 0;
        for (int i = 1; i <= top; i++) {
            if (A[i] > A[maxLoc])
                maxLoc = i;
        }
        double temp = A[maxLoc];
        A[maxLoc] = A[top];
        A[top] = temp;
    }
}

/**
 * Sorts A[0], A[1], ..., A[count-1] into increasing order using selection sort.
 */
public static void selectionSort(String[] A, int count) {
    for (int top = count-1; top > 0; top--) {
        int maxLoc = 0;
        for (int i = 1; i <= top; i++) {
            if (A[i].compareTo(A[maxLoc]) > 0)
                maxLoc = i;
        }
        String temp = A[maxLoc];
        A[maxLoc] = A[top];
        A[top] = temp;
    }
}

The basic idea is to make an array filled with random numbers or random strings, and sort that array using each of the three sorting algorithms. (For random strings, you might consider selecting words at random from the word lists used in the first part of the lab, or you can think of some other way to make random strings.)

It's a good idea to apply the algorithms to identical data, so you should make three copies of each array, one for each algorithm. Remember that just saying "B=A" does not make a copy of the array A; it just makes B point to the same array as A. To make a copy, you have to create a new array, and put the same data in the second array. (Note that sorting an array that is already sorted will not give a valid result! You want to make sure that you are actually applying the algorithms to different arrays.)

Make sure that you only measure the time that it takes to sort the array. Don't include the time that it takes to fill the arrays with random values.

You should apply insertionSort(), selectionSort(), and Arrays.sort() to arrays of type double[] and to arrays of type String[]. You should sort arrays of several different sizes between about 1000 and 100000 for doubles, and for sizes between about 1000 and 10000 for strings. (It takes longer to sort strings.) You can use Arrays.sort on even larger arrays, if you want. For my program, I used powers of two from 1024 to 65536 or 8192. Run your program with the -Xint option to get accurate measurements. The program should neatly output the sort times. You will want to copy the results into your report.

Of course you should follow the usual style rules for programming, including the use of comments.

Part 3: Written Report

In addition to your programs, you should turn in your responses to the following exercises, either on paper or in a plain text file named report.txt. Don't just give short answers; be sure to justify your answers and explain your reasoning!

Exercise 1. This exercise goes back the spell checker that you wrote for Lab 1. The word list contains 72875 words, which means that it takes 72875 comparisons to determine that a word is not in the list. Suppose that the user enters mxyzptlk to your program. (This string has no valid corrections.) Exactly how many comparisons will your program make when spell checking this word? You will need to compute how many different possible corrections it tries.

Exercise 2. Report your results from Part 1 of the lab. Describe the experiments that you performed and what you observed. What were the times that you measured for searching the two lists? How did the times for linear search and binary search compare? How much faster was binary search? (About how many times faster would your spellcheck program be if it used binary search instead of linear search?)

Exercise 3. In Part 2 of the lab, you should have found that it takes significantly longer to search for a string than to search for a number. Explain why. (This is a short exercise!)

Exercise 4. Report your results from Part 2 of the lab. What were the measured times for the various experiments? How do the times for selection sort, insertion sort, and Arrays.sort compare? What pattern do you see in the times for sorting arrays of different sizes using selectionSort? insertionSort? Arrays.sort?

Exercise 5. Finally, I would like you to predict how long it would take each of the sorting methods to sort an array of one billion double values. Show your work and explain your reasoning. If it's not too mathematical for you, you should use the facts that selection sort and insertion sort have Θ(n2) run time, while Arrays.sort has Θ(n*log(n)) sort time. (Remember, for example, that Θ(n*log(n)) means that the run time is approximately proportional to n*log(n), for large enough n; that is, the run time is approximately k*n*log(n) for some constant k. The mathematical way to do the prediction is to use your data for smaller values of n to solve for the constant k, and then use the formula to find the run time for n equal to one billion.) We will discuss this in class.