CS 229: Foundations of Computation, Fall 2015
Regular Expressions Lab and Assignment

The class meets on Friday, October 23, in Rosenberg 009 for some work on using regular expressions on the command line. A programming assignment on using regular expressions in Java is due one week after the lab. The assignment consists of two short Java programs. They should be submitted to your homework folder by next Friday, October 30, at the end of the day.

There are a number of exercises in the lab that ask you for a command that can be used to accomplish some task. You can copy-and-paste your commands from the Terminal window into a text document. Or, if you prefer, you can hand-write them. Some of the exercises also ask you for the output of the command. You should include that answer in your respons. Turn in your hand-written responses or a printout of your text file. If you finish all the exercises in class, you can turn in your responses at the end of class. Otherwise, you should turn them in next Friday.



http://xkcd.org/208


UNIX Utilities and the Bash Shell

Both Linux and Mac OS X are "UNIX-like operating systems." Effectively, they are, at heart, versions of UNIX. What that means for us here is that they come with a number of standard command line utilities—small programs that can be run on the command line. The program that implements the command line itself is called a "shell" or "command shell." On Mac OS and most versions of Linux, the command shell program is bash. You run bash when you open a terminal window or when you log on remotely using a program such as ssh. Some of the commands that you are used to, such as cd are built into bash, but many are actually small programs.

Bash supports a basic programming langauge for scripting. The language has variables, assignment statements, if statements, loops, and functions. But here, we are interested in some of the more basic syntax.

First of all, note that many command-line programs are designed for processing text. These programs read from standard input and write to standard output. Typically, when a command is used to operate on a file, standard input comes from the file and standard output goes to the command line. For example, the command

head MyProgram.java

will read the first 10 lines from the file MyProgram.java in the current directory and will display those lines to the terminal window where the command was given. However, the shell can use input/output redirection. You can redirect the output from a command to a file by adding > filename to the end of the command. For example, the command

head MyProgram.java > MyProgram.tenlines

copies the first 10 lines of MyProgram.java to a file named MyProgram.tenlines. (Warning: An existing file will be overwritten without warning.) For redirecting output, use < in place of >.

But for this lab, we need the fact that you can pipe the output from one program into another program. This means that the standard output from the first program is sent to the standard input of the second program. The symbol for piping is the vertical bar, |. For example,

ls | head

runs the ls command, which lists the contents of a directory, and sends the output from that command into the head program. The head program then displays just the first ten lines of the output.

Another feature of UNIX commands is that they often have many features that can be enabled by adding an "option" to the command. Options have names that begin with "-" or "--". For example, The ls command options include -l for showing extra information for each directory item, -R for listing the contents of directories recursively, and -h, used along with -l for showing file sizes in more "human-readable" form. Single-letter options that begin with "-" can be combined; for example: ls -lh.

Here, then are a few of the command-line utilities that can be used in a UNIX-like opertating system, along with some of their options. For most of these commands, standard input can be taken from a file or can be piped from a previous program. If neither a file nor a pipe is provided, they might expect the user to type in the input.

Most commands can take multiple file names on a line. Note that file names use "globbing", which means that * and ? are treated as wildcard characters, with ? matching exactly one character and * matching any number of characters. For example, javac *.java will compile all .java files in the current directory. Note that *.java is not a regular expression here; the * does not mean repetition.

grep and egrep

The grep command takes a regular expression as its argument, and it prints every line from its input that contains a substring that matches the regular expression. grep uses a somewhat more limited syntax for regular expressions than we have studied. For the full set of features, use egrep instead of grep. For this lab, you should use egrep to avoid confusion about what features are and are not supported. When using egrep, enclose the regular expression in single quotes (but the quotes are only really necessary if the expression contains characters that are special in the bash shell). For example,

egrep '".*"' MyProgram.java

will print every line from MyProgram.java that contains a string, using the regular expression ".*" to match strings. (Grep would also work here.) And, using the pipe syntax discussed above,

egrep '".*"' MyProgram.java | wc -l

will just output the number of such lines that were found. The grep and egrep commands have several useful options, including

  • -i — does a case-insensitive match.
  • -v — prints out lines that do NOT match the regular expression.
  • -o — prints out just the part of the line that matches the expression. If there are several matching parts in one line, then they are all output, each one on a separate line.
  • -r — does recursive searching when applied to a directory; that is, all the files in the directory and in its sub-directories are searched.

Investigate User accounts.

On the Linux computers in our labs, the command ypcat passwd prints out account information for all the networked user accounts on our system. (Local, non-networked accounts are in the file /etc/passwd.) Try it. If you want to know how many such accounts there are, try the command

ypcat passwd | wc -l

Each output line contains seven fields, separated by colons. The first field is the user name. A student user name matches the regular expression '[a-z]{2}[0-9]{4}'.

Exercise 1. Write a command that will output the number of student accounts. You will need three individual commands, separated by two pipes. (Be sure to use egrep since grep doesn't support the {n} syntac.) How many are there? (Your response to this and all exercises should include the command that you used, plus your answer to any other question that was asked in the exercise.)

Exercise 2. The third field in the ouput of ypcat passwd is the user account number. Write a command that will print the smallest student account number in the file, with no other output. (Hint: Use sort -n as one of your commands. And you will need some of the other utilities discussed above.) What's the smallest student account number?

Exercise 3. The fifth field in the ouput of ypcat passwd is the user name. For student accounts, the name should consist a first name and a last name, separated by a space. Write a command that will print out all student names in the form similar to "Smith, John" including the quotation marks. Note that the order of the names is reversed, to give last name, first name. What happens for names that have extra spaces in them?

Exercise 4. Now, write a command that will print out the student user name, followed by a space, followed by the name in the same format as the previous exercise. For example: zz9999 "Smith, John". (For the previous exercise, you could use cut. For this one, you can use perl to get the data format that you need. You might use cut first to extract just the two fields that you need.)

Exercise 5. Finally, write a command that will output the same information in the same format as the previous example, but with the names in alphabetical order by last name. This is harder because you will have to put the information in one format for sorting, then rearrange the information into its final format..

(Note: I frequently do things like this with files. I might use grep and cut to filter some information from the file, but I am more likely to do the actual transformations in a text editor that supports regular expression find-and-replace.)

Investigate a Web Access Log

The 120-megabyte file /classes/cs229/access.log contains the access log for the web server on math.hws.edu for a recent six-day period. It has one line for each time someone on the Internet sent a request to our web server. In this part of the lab, you will work with that file. Since it is so large, don't copy it! You should cd into the directory /classes/cs229 and work there. The file has 483107 lines. (You could determine this with the command wc -l access.log.)

As you develop commands to operate on this file, you will sometimes need to see what the output from a command looks like, but You don't want to see 120 megabytes of output. Piping the output int0 head is a way to see just the first ten lines of output.

Exercise 6. The first thing on each line in the file is an IP address that identifies the computer that sent the request to our server. The IP address is followed by a space. Write a command that determines the number of different IP addresses that sent requests from the file; use cut to extract the IP address from the line. How many were there? Now, write a second command that does the same thing, but uses grep -o instead of cut to extract the IP address. The IP address is everything from the start of the line up to the first space.

Exercise 7. Lines that represent requests for specific files will contain a string of the form "GET followed by a space (a double quote, followed by GET, followed by a space). This is followed by the path name of the file and another space. The path name cannot contain a space. Write a command that determines how many requrests were made for files beginning with /javanotes/. How many were there? Now, write a command that will find out how many different IP addresses sent requrests for such files. How many were there?

Exercise 8. One of the fields on each line is the "referrer." If the request was generated by a user clicking on a link, the referrer is the web address of the page that contained that link. If the referrer field starts with http://www.google or https://www.google, then the request was the result of a Google search. You can assume that any line that contains this string represents a request generated from a Google search. Write a command to determine the number of such requests. (This is easy.) How many were there?

Exercise 9. The web site in the referrer field is terminated by the first slash (/) or a double quote ("). For example: http://www.google.com", https://www.google.co.in/, https://www.google.com.ng/. The last two are Google's Indian and Singapore sites. Write a command to determine how many different google sites occur in the referrer field. How many were there? (Note: The -o option for egrep will be useful.) What were some of the other countries where the Google sites were located? (See the list of country codes.)

Program 1: Use Regular Expressions in Java

You will write two short Java programs that use regular expressions. The Java support for regular expressions uses the classes Pattern and Matcher, which are in the package java.util.regex. So at the beginning of each program, you should say

import java.util.regex.*;

The first program is a simple exercise to make sure that you can use the two classes. (You can check the documentation for Pattern and Matcher. In particular, look at the example at the top of the Pattern documentation. Also, see the regular expression handout from class.)

Your program should ask the user to type in a regular expression. Read a line of input from the user, and pass it to the Pattern.compile method to create an object of type Pattern.

Then ask the user type in lines of text to be matched against the pattern. For each line of text that is input, create a Matcher for the text, using the matcher method in the Pattern object.

Call the matcher's matcher.find() method to test whether the string contains a substring matches the regular expression. This returns a boolean value to tell you whether a matching substring was found. Tell the user the result. If the regular expression included parentheses, then matcher.groupCount() is the number of left parentheses, and matcher.group(n) is the substring that matched group n for n between 1 and matcher.groupCount(). Also, matcher.group(0) is the entire matching substring. You should print out the substring for each group.

You can end the program when the user inputs an empty string. Here is an example of a session with a sample program:

   Input a regular expression:   ([a-zA-Z]*), ([a-zA-Z]*)
   
   Input a string:   Doe, John
   That string matches
         Group 0 matched  Doe, John
         Group 1 matched  Doe
         Group 2 matched  John

   Input a string:  Bond, 007
   That string does not match
   
   Input a string:  Fortunately, the reactor did not explode
   That string matches
         Group 0 matched  Fortunately, the
         Group 1 matched  Fortunately
         Group 2 matched  the

   Input a string:

Program 2: Extract Information from a File.

For the second program, you will use Pattern and Matcher in a more practical way. The program should take an HTML file as input. It should print out a list of all the web sites to which the page links. You can try out your program on the file /classes/cs124/index.html, which is a copy of the web page for this course. (You could get other web pages to try it on by using the Save command in a web browser.)

An HTML file is a text file that contains the content of the page as well as special "mark-up" code. In the file a link can look, for example, like one of these strings:

<a href="http://google.com">
<A target="mainframe" href=’data321.html’>
<a id=’link23’ HREF = "file23.html" target="_TOP">
<a href="images/myhouse.png">

The link must start with <a and contain href=. (These are case-insensitive.) There can be spaces around the "=", but not after the "<". The web site is in single or double quotes after href=. For the examples above, the web sites that I want you to extract are

http://google.com
data321.html
file23.html
images/myhouse.png

Your program should use a Pattern that will match such links and will extract the web site as a group. Use a Matcher for each line of input, to test whether it contains a link to a web site and, if so, to extract the web site. Print out all the the web sites, one to a line.

You can read the file using either a Scanner or TextIO. To create the scanner, use

in = new Scanner( new File("/classes/cs229/index.html") );

For TextIO, use

TextIO.readFile("/classes/cs229/index.html");

and use TextIO.eof() to test for end-of-file. To make a case-insensitive Pattern, use

pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

This basic operation—extract a list of web sites that a web page links to—is an important used, for example, by Google for building its map and index of the Web.