CS229, Regular Expressions Lab and Homework

CS 229: Foundations of Computation, Spring 2020
Homework 6: Regular Expressions Lab and Assignment

The class meets on Wednesday, March 11, in Rosenberg 009 for some work on using regular expressions on the command line. A programming assignment on using regular expressions in Java is due one week after the lab. The assignment consists of two short programs. The Java programs can be submitted to your folder in /classes/cs229/homework by 10:00 AM on March 26, the Thursday following Spring break. I will print out any .java files in that folder.

The exercises in Parts 1 and 2 of the lab ask you for commands that can be used to accomplish certain tasks. You can copy-and-paste your commands from the Terminal window into a text document. You can hand-write them if you prefer, but please write them up carefully and clearly. You should also answer any other questions that are asked, such as the output of the command. Turn in your hand-written responses or a printout of your file. If you finish all the exercises in class, you can turn in your responses at the end of class. Otherwise, you should turn them in in class on March 25.

You have the option of working on this lab either by yourself or with one partner. If two people are working together, make sure that both names appear on anything that is submitted.

About UNIX Utilities and the Bash Shell

Linux (like Mac OS X) is a "UNIX-like operating system." What that means for us here is that it comes with a number of standard command line utilities—small programs that can be run on the command line. The program that implements the command line itself is called a "shell" or "command shell." On Mac OS and most versions of Linux, the command shell program is bash. You run bash when you open a terminal window or when you log on remotely using a program such as ssh. Some of the commands that you are used to, such as cd, are built into bash, but many are actually small programs.

Bash supports a basic programming language for scripting. This so-called "Bash shell scripting language" has variables, assignment statements, if statements, loops, and functions. But here, we are interested in some of the more basic syntax.

First of all, note that many command-line programs are designed for processing text. These programs read from standard input and write to standard output. Typically, when a command is used to operate on a file, standard input comes from the file and standard output goes to the command line. For example, the command

head MyProgram.java

will read the first 10 lines from the file MyProgram.java as input, and it will output those lines to the terminal window where the command was given. However, the shell can use input/output redirection. You can redirect the output from a command to a file by adding > filename to the end of the command. For example, the command

head MyProgram.java > out.txt

copies the first 10 lines of MyProgram.java to a file named out.txt. (Warning: An existing file will be overwritten without warning!) For redirecting output, use < in place of >.

But for this lab, we need the fact that you can pipe the output from one program into another program. This means that the standard output from the first program is sent to the standard input of the second program. The symbol for piping is the vertical bar, |. For example,

ls | head

runs the ls command, which lists the contents of a directory, and sends the output from that command into the head program. The head program then displays just the first ten lines of the output from ls.

Another feature of UNIX commands is that they often have many features that can be enabled by adding an "option" to the command. Options have names that begin with "-" or "--". For example, the ls command options include -l for showing extra information for each directory item, -R for listing the contents of directories recursively, and -h, used along with -l for showing file sizes in more "human-readable" form. Single-letter options that begin with "-" can be combined; for example: ls -lh.

Here, then are a few of the command-line utilities that can be used in a UNIX-like operating system, along with some of their options. For most of these commands, standard input can be taken from a file or can be piped from a previous program. If neither a file nor a pipe is provided, they might expect the user to type in the input.

cat <files> — copies the contents of all the named files, one after the other, to standard output. (The word "cat" here is short for "concatenate", but in practice, cat is used mostly for one file at a time.)
head and tail — Display the first or last ten lines of their input. You can specify a different number of lines as an option. For example, head -100 file shows the first 100 lines of the file, and tail -3 shows just the last three lines of whatever input it is given.
wc — outputs the numbers of lines, words, and characters in the input. To get just the number of lines, use wc -l
sort — sorts lines of input alphabetically and sends the result to output. The option -u causes duplicate lines to be omitted from the output; the "u" stands for "unique". The option -i causes the sort to be case-insensitive. The option -n causes sort to do a numeric instead of an alphabetic sort.
cut — selects just parts of the input line for output, where the parts of the line are divided by a specified delimiter. The default delimiter is tab. To use a space as the delimiter, add the option -d " ". To use a colon as the delimiter, add the option -d ":". To specify which fields you want, use the -f option, which takes a field number or a range of field numbers or a list of field numbers, separated by commas. For example, use -f 1 to print the first field from each line (that is, everything on the line up to the first occurrence of the delimiter). For example, cut -d ":" -f 3 outputs everything between the second and third colon on each line.
perl -pe 's/regex/replacement/' applies a "search-and-replace" command to each line of the input, and prints the result. This is discussed in the handout on regular expressions.
grep and egrep — do regular expression matching. These are the main commands for this lab. More on them below.

Most commands can take multiple file names on a line. Note that file names use "globbing", which means that * and ? are treated as wildcard characters, with ? matching exactly one character and * matching any number of characters. For example, javac *.java will compile all .java files in the current directory. Note that *.java is not a regular expression here; the * does not mean repetition.

grep and egrep

The grep command takes a regular expression as its argument, and it prints every line from its input that contains a substring that matches the regular expression. grep uses a somewhat more limited syntax for regular expressions than we have studied. For the full set of features, use egrep instead of grep. For this lab, you can use egrep to avoid confusion about what features are and are not supported. When using egrep, enclose the regular expression in single quotes (but the quotes are only really necessary if the expression contains characters that are special in the bash shell). For example,

egrep '".*"' MyProgram.java

will print every line from MyProgram.java that contains a string literal, using the regular expression ".*" to match the strings. Grep would also work here. (The single quotes are needed because both " and * are special characters for the bash shell.) And, using the pipe syntax discussed above,

egrep '".*"' MyProgram.java | wc -l

will just output the number of such lines that were found. The grep and egrep commands have several useful options, including

-i — does a case-insensitive match.
-v — prints out lines that do NOT match the regular expression.
-o — prints out just the part of the line that matches the expression. If there are several matching parts in one line, then they are all output, each one on a separate line.
-r — does recursive searching when applied to a directory; that is, all the files in the directory and in its sub-directories are searched.

Part 1: Investigate User accounts.

(If you are working on your own Mac computer, you will need to ssh to math.hws.edu or to one of our lab computers for the first two parts of the lab. If you don't know how to do that, ask. You can only do the lab on a Windows computer if you have an ssh client installed on it.)

On the Linux computers in our labs, the command ypcat passwd prints out account information for all the networked user accounts on our system. (Local, non-networked accounts are in the file /etc/passwd.) Try it. If you want to know how many such accounts there are, try the command

ypcat passwd | wc -l

Each output line from ypcat password contains seven fields, separated by colons. The first field is the user name. A student user name, in particular, is one that matches the regular expression [a-z]{2}[0-9]{4}'.

Exercise 1. Write a command that will output the number of student accounts. You will need three individual commands, separated by two pipes. (Be sure to use egrep since grep doesn't support the {n} syntax.) How many student accounts are there? (Your responses to Exercises 1 through 10 should include the exact command that you used, plus your answer to any other question that was asked in the exercise. It is suggested to type your responses in a neatly formatted text file and print out the result to turn in.)

Exercise 2. The third field in the output of ypcat passwd is the user account number. Write a single command that will print the smallest student account number in the file, with no other output. (Hint: Use sort -n as one of your commands. And you will need some of the other utilities discussed above.) What's the smallest student account number?

Exercise 3. The fifth field in the output of ypcat passwd is the user name. For student accounts, the name should consist a first name and a last name, separated by a space. Write a command that will print out all student names in a form similar to "Smith, John" including the quotation marks. Note that the order of the names is reversed, to give last name first, followed by a command, and then the first name. You can use perl -pe to rearrange the data into the required format.

Exercise 4. Now, write a command that will print out the student user name, followed by a space, followed by the name in the same format as the previous exercise. For example: zz9999 "Smith, John".

Exercise 5. Finally, write a command that will output the same information in the same format as the previous example, but with the names in alphabetical order by last name. This is harder because you will have to put the information in one format for sorting, then rearrange the information into its final format.

(Note: I frequently do things like this with files. I might use grep and cut to filter some information from the file, but I am more likely to do the actual transformations in a text editor that supports regular expression find-and-replace.)

Part 2: Investigate a Web Access Log

The 86-megabyte file /classes/cs229/access.log contains the access log for the web server on math.hws.edu for a recent seven-day period. It has one line for each time someone on the Internet sent a request to our web server during that period. In this part of the lab, you will work with that file. >There is no need to make a copy of the file. You can cd into the directory /classes/cs229 and work there.

As you develop commands to operate on this file, you will sometimes need to see what the output from a command looks like, but you don't want to see 86 megabytes of output. Piping the output into head is a way to see just the first ten lines of output.

Exercise 6. The first thing on each line in the file is an IP address that identifies the computer that sent the request to our server. The IP address is followed by a space. Write a command that determines the number of different IP addresses that sent requests from the file. The first step can be to use the cut to extract the IP address from the line. The IP address is everything from the start of the line up to the first space. But you will need to pipe the output from cut into another command to get a list that does not contain duplicate entries, and another command to count the lines of output. How many different IP addresses are there?

Exercise 7. Lines that represent requests for specific files will contain a string of the form "GET followed by a space (a double quote, followed by GET, followed by a space). This is followed by the path name of the file. The path name cannot contain a space. Write a command that determines how many requests were made for files beginning with /javanotes. Such a request will start with the string: "GET /javanotes. How many requests did you find? Now, write a command that will find out how many different IP addresses sent requests for such files. How many were there?

Exercise 8. One of the fields on each line is the "referrer." If the request was generated by a user clicking on a link, the referrer is the web address of the page that contained that link. If the referrer field starts with "http://www.google. or "https://www.google. (including the double quote mark at the start!), then the request was the result of a Google search. You can assume that any line that contains such a string represents a request generated from a Google search. Write a command to determine the number of such requests. (This is easy.) How many were there?

Exercise 9. The web site name in the referrer field is terminated by the first slash (/). For example: "http://www.google.com/, "https://www.google.co.in/, "https://www.google.com.ng/. The last two are Google's Indian and Singapore sites. Write a command to determine how many different Google sites occur in the referrer field. How many were there? (Note: The -o option for egrep will be useful.) For your own interest, you might be curious about some of the other countries where the Google sites were located. If so, see this list of country codes.

Exercise 10. Finally, write a command to determine how many google searches led to a page whose name started with /javanotes. How many were there?

Program 1: Use Regular Expressions in Java

You will write two short Java programs that use regular expressions. The Java support for regular expressions uses the classes Pattern and Matcher, which are in the package java.util.regex. So at the beginning of each program, you should say

import java.util.regex.*;

The first program is mostly to make sure that you can use the two classes. (You can check the documentation for Pattern and Matcher. In particular, look at the example at the top of the Pattern documentation. Also, see the regular expression handout from class.)

Your program should ask the user to type in a regular expression. Read a line of input from the user, and pass it to the Pattern.compile method to create an object of type Pattern.

Then ask the user to type in additional lines of text to be matched against the pattern. For each line of text that is input, create a Matcher for the text, using the matcher() method in the Pattern object.

Call the matcher's matcher.find() method to test whether the string contains a substring that matches the regular expression. This returns a boolean value to tell you whether a matching substring was found. Tell the user the result. If the regular expression included parentheses, then matcher.groupCount() is the number of left parentheses, and matcher.group(n) is the substring that matched group n for n between 1 and matcher.groupCount(). Also, matcher.group(0) is in any case the entire matching substring. You should print out the substring for each group.

You can end the program when the user inputs an empty string. Here is an example of a session with a sample program:

   Input a regular expression:   ([a-zA-Z]*), ([a-zA-Z]*)
   
   Input a string:   Doe, John
   That string matches
         Group 0 matched  Doe, John
         Group 1 matched  Doe
         Group 2 matched  John

   Input a string:  Bond, 007
   That string does not match
   
   Input a string:  Fortunately, the reactor did not explode
   That string matches
         Group 0 matched  Fortunately, the
         Group 1 matched  Fortunately
         Group 2 matched  the

   Input a string:

Program 2: Extract Information from a Web Page.

For the second program, you will use Pattern and Matcher in a more practical way. The program will find links on a web page. The program GetPage.java reads lines from a web page and prints them out. You should get a copy of that program and modify it so that instead of printing out the lines from the file, it prints out the web addresses from any links that appear on the page. You will use regular expressions to find the links and extract the addresses. (There is a copy of GetPage.java in /classes/cs229.)

A web page is usually an HTML file, that is, a text file that contains the content of the page as well as special "mark-up" code. In the file, a link can look, for example, like one of these strings:

<a href="http://google.com">
<A target="mainframe" href=’data321.html’>
<a id=’link23’ HREF = "file23.html" target="_TOP">
<a href="images/myhouse.png">

The link must start with <a and must contain href=. (These are case-insensitive.) There can be spaces around the "=", but not after the "<". The web site is in single or double quotes after href=. For the examples above, the web addresses that your program would extract are

http://google.com
data321.html
file23.html
images/myhouse.png

Your program should use a Pattern that will match such links and will extract the web site name. Use a Matcher for each line of input, to test whether it contains a link to a web site and, if so, to extract the web site. Print out all the web sites that are found, one to a line.

Note that to make a case-insensitive Pattern, you can use

pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

The basic operation for this assignment—extract a list of web sites that a web page links to—is an important one, used, for example, by Google for building its index of the Web.