CS441 -- Fall 2002 -- Lab 4

CPSC 441, Fall 2002
Lab 4: Programming a Web server, Part I

IN THIS LAB, you will start programming a Web server. The server will be developed in several stages. In the first stage, you will write a fairly simple, single-threaded Web server that can serve text documents. Later, you will add multi-threading and perhaps some other features. Your program should be able to interact with Web browsers such as Mozilla and Internet Explorer. You can work on this project yourself, or you can work on it with one other person. Here are the exercises for this lab:

Exercise 0: Be in class for a full 55 minutes!

Exercise 1: Write a Web server program, following the specification given in the rest of this lab. Note that there are some unresolved issues in this specification. You should ask for clarification where necessary. Part of the assignment is to write a program that can be easily extended later to include more features. Use functions and/or classes as necessary!

Exercise 2: Take a look at the HTTP 1.0 specification, RFC 1945. You can find a copy at http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1945.html. Read about some of the features available in HTTP 1.0, and think about which features you would like to implement in your server. Write a few paragraphs describing at least two features of HTTP 1.0 you would like to include in your server and two features that you believe would be unnecessary or too difficult to be worth implementing. Given the structure of HTTP, "feature" basically means a request such as HEAD, a response code such as "301 Moved Permanently", or a header such as "If-Modified-Since:". It could also mean working with non-text content types such as images.

The program and proposal will be due in two weeks.

The purpose of a Web server is to send responses to requests. The original HTTP (version 0.9) used the simple request format
            GET <path-to-document><crlf>
where <crlf> means "carriage return, line feed" or "\r\n". The <path-to-document> specifies the resource that is being requested. It's the part of the URL that comes after the computer name and optional port number. For example, in "http://math.hws.edu/eck/index.html", the document path is "/eck/index.html" and in "http://math.hws.edu:8080/", the document path is "/". In HTTP 1.0, the HTTP version was added to the first line of the request and the request can include headers on succeeding lines. The request has the form
               GET <path-to-document> HTTP/1.0<crlf>
               <header-name>: <header-data><crlf>
               <header-name>: <header-data><crlf>
                  ...
               <header-name>: <header-data><crlf>
               <crlf>
In addition to GET, HTTP 1.0 allows the HEAD and POST requests, but you will only implement GET for this assignment. The blank line at the end marks the end of the request. Note that although lines are supposed to end with <crlf>, in practice servers will also accept a carriage-return by itself or a line-feed by itself at the end of a line.

For a simple request, the Web server simply sends back the requested document and closes the connection. If the document doesn't exist, the server usually sends back a short error document that contains an error message. For a HTTP/1.0 request, the server sends a response that has the form:
               <status-line>crlf>
               <header-name>: <header-data><crlf>
               <header-name>: <header-data><crlf>
                  ...
               <header-name>: <header-data><crlf>
               <crlf>
               <requested-document>
The <status-line> contains the HTTP version, a code number and a textual description. For this assignment, the only status lines you are required to use are:
     HTTP/1.0 200 OK                 -- Requested document is returned.
     HTTP/1.0 400 Bad Request        -- Illegal request was received.
     HTTP/1.0 404 Not Found          -- A GET request with bad document name.
     HTTP/1.0 501 Not Implemented    -- A POST or HEAD request.
Your program should be able to handle both HTTP 0.9 and HTTP 1.0 requests. It will also handle HTTP 1.1 (or later) requests, but it will treat them no differently from HTTP 1.0. For now, you can ignore the headers in the request. The only response header that you have to include is the "Content-type:" header, which specifies the type of data that is being returned. This will have the form
               Content-type: text/html<crlf>
for a file whose name ends with .htm or .html. Since you are only going to be working with text files, it will have the form
               Content-type: text/plain<crlf>
for all other documents.

The documents to be served by the server will all be stored in some specified directory. I suggest that you assume that it's a directory named documents in the same directory as the server program. This directory can contain sub-directories. This is not so difficult. To get the full file path, you just have to add "documents" onto the beginning of the <path-to-document> from the request. (But see the remarks on "Attacks Based on File and Path Names" in the RFC, at http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1945.html#sec-12.5.)

You should keep a log file containing information about requests to your server, including the IP address of the client, the response code (such as 200 or 404), and the document that was requested by the client.

You will have to do some parsing of the request that you receive from the client. Here are some functions in the string class that you might find useful. If str is a variable of type string then:

str.length() returns the length of the string.
str.c_str() returns the equivalent C-style string (for use in opening fstreams, for example).
str.find(string) finds the first occurrence of the string in str. For example: str.find(" "). The return value is -1 if the string is not found. Otherwise, it gives the position of the string. (Note that the first character in the string is in position 0.)
str.find(string,pos) finds the first occurrence of the string in str following position number pos.
str.substr(pos,ct) returns a sub-string of str. The first parameter gives the position of the beginning of the sub-string. The second parameter gives the number of characters in the sub-string.

For example, to find the substring consisting of the first 4 characters of str you could use:
            string sub = str.substr(0,4);
To find the sub-string of str starting at position 4 and consisting of all the character from position 4 up to the next space character in the string:
            int pos = str.find(" ",4);
            // Assume that pos is not -1
            string sub = str.substr(4,pos-4);
The send(string) function in my Socket class sends the zero character at the end of the string as part of the data. Your Web server can't add these zeros to the data that it sends. However, you can use sendBinary() something like this:
            getline(file,line);
            line = line + "\r\n";
            socket->sendBinary(line.c_str(), line.length());
It will be OK to use socket->receive() to read the client's request, but note that you will not receive the request one line at a time. It's likely that you will get it all in one long string.

David Eck, 27 September 2002

CPSC 441, Fall 2002 Lab 4: Programming a Web server, Part I

CPSC 441, Fall 2002
Lab 4: Programming a Web server, Part I