CPSC 225, Spring 2011
Lab 8: Web Collage, Part 2


The assignment for this lab is to complete the WebCollage project that was begun in Lab 7. We have already discussed this project extensively. The final program will be due at the next lab (which will be in two weeks, because of Spring Break).

Since Lab 7 will be graded separately, you should create a new project for Lab 8. You can copy-and-paste all the Java files from Lab 7 into Lab 8; you will not need the folder of images from Lab 7.

In addition to the files from Lab 7, you will need a copy of the file /classes/s11/cs225/LinkParser.java. This LinkParser class will be used to read the contents of a web page and extract the URLs from all the links on that page. You should carefully read the comment on the static method LinkParser.grabReferences() to see how it works.

Lab 7 is due today and should be either copied into your homework folder or shared into your CVS repository. (If you use CVS, don't forget to commit the latest version.) Lab 8 will be due in two weeks.


The Web Crawler

This week, you will write the final component of the WebCollage component, the "web crawler." You should create a WebCrawler class to represent this component. The web crawler is connected to the rest of the program by a BlockingQueue, which carries image URLs from the web crawler to the image grabber. This queue will be created by the main program and should be passed to the WebCrawler in its constructor.

The web crawler uses a second blocking queue of URLs. This blocking queue is for the internal use of the web crawler and can be created by the web crawler. It will hold links to web pages. Note: It is important that the web page queue does not have a limited capacity (for reasons discussed in class), so it should be a LinkedBlockingQueue.

The web crawler should create several threads to do the downloading. Each thread will run in an infinite loop in which it removes a URL from the queue of web pages and processes that URL. To process the URL, it should create a URLConnection to the URL and get its input stream. The thread should check the content type of the connection (as returned by the getContentType() method of the connection). Content types of "text/html" or "application/xhtml+xml" are HTML web pages that can contain links to other web pages. For connections with one of those content types, the LinkParser class can be used to obtain lists of web page URLs and image URLs from the connection's input stream. Web page URLs go into the page URL queue, and image URLs go into the image URL queue.

(The content type for a URL cannot always be determined from the file extension in the URL. For example, links that end with ".php" could lead to images just as easily as to web pages. If you open a connection and find that the content type starts with "images/", then the link actually leads to an image. In that case, you might consider dropping the link into the image URL queue.)

In order to avoid downloading the same page or image over and over, the WebCrawler class should use a HashSet<URL>, as discussed in class. All URLs that are found by the web crawler should be added to this class. When a URL is found, if it was already in the set, then it should be discarded without adding it to the the web page or image queue. Access to the set should be synchronized. As discussed in class, an easy way to use the set is with a method:

        synchronized private boolean addURL(URL url) {
            return urlSet.add(url);
        }

Finally, the WebCrawler class will need some way of getting things started. So include a method such as the following that can be called to add a starting URL to the web page queue:

    public void addStartingURL(String webPageURL) {
        try {
            pageURLQueue.put(new URL(webPageURL));
        }
        catch (Exception e) {
            throw new IllegalArgumentException("Illegal URL " + webPageURL);
        }
    }

The Main Program

For a main program, you can rename and modify the TestImageGrabber class from Lab 7. You will no longer need any of the image files, but the main routine will also have to create and configure the web crawler.

For now, you can get things started by adding "http://math.hws.edu" as the starting URL. (Note that the "http://" is required.)


Enhancements

Before the project is complete, you should have some way to let the user specify the starting URL. A nice way to do this would be to include a text input box in the GUI -- say, in the window below the web collage -- where the user can enter the URL. Another possibility would be to use JOptionPane.showInputDialog() to get the URL from the user before opening the collage window. Note that this is the only required enhancement.

It might be nice for the user to be able to stop and restart the process of adding images to the collage. You can do this by stopping and starting the timer. Timers, unlike threads, can be stopped and restarted. If timer is a Timer, just call timer.stop() to stop it and timer.start() to start it.

Other options include: Giving the user the ability to clear the collage. Make it possible to discard all the data currently in the queues, so that you can really start from scratch. Make it possible to save the collage image into a file (see Section 13.1.5).


WARNING

It would not be a good idea to crank up the speed in your program and turn it loose on a popular web site. Web crawler programs are supposed to follow certain protocols, including checking with the web site to see how much crawling it allows. A web site might see a program that grabs large numbers of web pages very quickly as a kind of attack, and it might be monitoring for something like that.

You should run your program at a reasonable speed (with a non-trivial delay in the timer), and you should probably not run it over an extended period of time.


David Eck, for CPSC 225