CPSC 225, Spring 2009, Lab 6

CPSC 225, Spring 2009
Lab 6: Random Expressions

This week is the first of two labs that deal with Expression Trees. You will want to be sure to finish this week's lab before next Tuesday, but your work for the two labs will be turned in together, two weeks from the date of this lab.

The basic idea here is that we imagine that we have a set of sample data, and we would like to find an expression that matches that data as well as possible. The data is in the form of input/output pairs. That is, we have a set of input values to some function or physical process, and we have the output that is produced for each input value. We would like to find an expression that will give approximately the same output for each of the sample input values.

In order to do this, we can try generating random expressions. The expressions can include the variable "x" to represent the value of an input. Each expression that we generate can be tested on all the sample inputs, and we can keep track of which expression gives the best overall approximation to the sample outputs. After a certain amount of time, we simply take the best expression that we have found so far.

For this lab, you will implement the procedure described in the preceding paragraph. Next week, we will try to apply genetic programming to the problem and see how much better it does (if at all).

The file /classes/s09/cs225/Expressions.java contains several static nested classes that define expression nodes, as we discussed in class. You should get a copy of this file and add it to your Eclipse project. To keep things simple, you can add put the code that you write for the two labs into this file.

The file contains a recursive copy() method that makes a full copy of an expression tree. You will need this method next week, but for now it just serves as a model of recursive tree processing. It might help you to think about the randomExpression() method that you will have to write. The file also contains a makeTestData() method that is meant to produce the sample input/output data that you want to match. Finally, there is a main() method whose purpose is to test the other stuff in the file; you will have to replace this method.

Random Expression

Your first task is to write a recursive method that creates a random expression. You don't want to make a tremendously complex expression (and certainly not one that is infinitely long...), so the method should have a parameter that limits the "height" of the expression tree. The height is the greatest number of steps from the "root" of the tree to one of the "leaves" at the bottom of the tree. In the recursive method, the height of the tree is the same as the number of levels in the recursion. The method should have a parameter should represent the maximum height:

            static ExpNode randomExpression(int maxHeight) {...

The method can randomly decide what type of node to create. If maxHeight is zero, the only possibilities should be a constant node or a variable node. If the maxHeight is greater than zero, any type of node is possible (even constant or variable nodes; maxHeight gives the maximum allowable height, not the actual height). To make a binary operator node, you have to create random expressions for the left and right operands of the binary operator; you can do this recursively, passing maxHeight-1 as the value of the parameter. (Note: For constant nodes, I suggest that you limit yourself to fairly small numerical values.)

Add some code to the main method to test your function. Print out a bunch of randomly created expressions and their values.

Measuring Closeness

Once you have a random expression, you have to test it against the sample input/output data created by the makeTestData() method. When you do this, you will have two outputs for each input -- the one in the sample data and the one produced by your expression. The expression's outputs are supposed to be an approximation for the sample outputs. The question is, how to measure the closeness of the approximation?

A common measurement for such situations is the Root Mean Square Error (RMSE). This is given by the square root of the average of the squares of the errors for all the individual inputs. Mathematically, the RMSE is defined as

where x_i are the sample inputs, y_i are the corresponding sample outputs, and f(x_i) are the values of the expression for the given inputs. You should write a method

            static RMSError( ExpNode f, double[][] sampleData ) {...

to compute the Root Mean Square Error. Once you have this method, you can write a main method that generates a bunch of random expressions, finds the RMSE for each of them on the sample data, and keeps track of the one that has the smallest RMSE. Do this in a while loop that runs for, say, 10 seconds. After the loop ends, print out the best expression found and its RMS Error.

(Note: There is one problem that you might want to think about. It is possible for the value of an expression to be infinite (if it divides a non-zero number by zero) or NaN (if it divides zero by zero). Such things can happen easily in randomly generated expressions. If you simply use the formula given above for the RMS Error, the answer will be infinite or NaN. This is not a big problem for today's lab, but you might want to think about what would be the right value for the RMSE when it happens. It will be more of an issue next week.)

Adding Another Node Type

The expression node classes that I have given you include only a constant node class, a variable node class, and a binary operator class. Add one more type of expression node, UnaryOpNode. Use it to represent (at least) the mathematical functions sin, cos, exp, and abs. Each of these functions can be considered to be an operator that applies to just one operand. (The "unary minus" operator discussed in the book is another example, and you can add that to the class if you want.)

I suggest that you use integers to code the possible functions (for example, 0 for sin, 1 for cos, and so on). You might want to define constants such as "final static int COS = 1" to give names to the constants.

Besides defining the UnaryOpNode class, you will have to take account of the existence of this class in the copy() method and in the randomExpression() method. Don't forget to test your work.

David Eck, for CPSC 225