Lessons by Jon

Decoding Data in a CGI

In the last lesson you got a better look at the garbage that is returned by WWW clients from a form. While you could see the data there, it was interspersed with odd characters; lots of "=" and "%" and such. In this lesson we will learn how to extract the information you want from this apparent mess.

Required OSAX

Tokenize
DecodeURL
DePlus

NOTE: if you have not yet installed these OSAXen, then do it before starting this lesson. The script will not compile without them. Go back to the Requirements section to download the OSAXen if you need them.


Script5.txt - Decoding Data

Here is the entire script for this lesson. The comments have been removed so you see only the lines that actually get compiled. The full script, including comments and special characters, is in the archive with the name "Script4.txt".
property crlf : (ASCII character 13) & (ASCII character 10)
property http_10_header : "HTTP/1.0 200 OK" & crlf & "Server: MacHTTP" & crlf & Ā
	"MIME-Version: 1.0" & crlf & "Content-type: text/html" & crlf & crlf
property idletime : 300
property datestamp : 0

set datestamp to current date

on «event WWW½sdoc» path_args ¬
   given «class kfor»:http_search_args, ¬
      «class post»:post_args, «class meth»:method, ¬
      «class addr»:client_address, «class user»:username, ¬
      «class pass»:password, «class frmu»:from_user, ¬
      «class svnm»:server_name, «class svpt»:server_port, ¬
      «class scnm»:script_name, «class ctyp»:content_type

 try

   set datestamp to current date

   set return_page to http_10_header ¬
      & "<HTML><HEAD><TITLE>Parsed Results</TITLE></HEAD>" ¬
      & "<BODY><H1>Parsed Results</H1>" & return
   set return_page to return_page & "<H4>post_args</H4><PRE>" & return

   set postarglist to tokenize post_args with delimiters {"&"}

   set postargtext to ""
   repeat with curritem in postarglist
      set postargtext to postargtext & ¬
         (Decode URL (DePlus (last text item of currpostarg))) & return & return
   end repeat

   set return_page to return_page & postargtext & "</PRE>" & return
   set return_page to return_page ¬
      & "<HR><I>Results generated at: " & (current date) ¬
      & "</I>" & "</BODY></HTML>"
   return return_page

 on error errMsg number errNum
   set return_page to http_10_header ¬
      & "<HTML><HEAD><TITLE>Error Page</TITLE></HEAD>" ¬
      & "<BODY><H1>Error Encountered!</H1>" & return ¬
      & "An error was encountered while trying to run this script." & return
   set return_page to return_page ¬
      & "<H3>Error Message</H3>" & return & errMsg & return ¬
      & "<H3>Error Number</H3>" & return & errNum & return ¬
      & "<H3>Date</H3>" & return & (current date) & return
   set return_page to return_page ¬
      & "<HR>Please notify Jon Wiederspan at " ¬
      & "<A HREF=\"mailto:jonwd@tjp.washington.edu\">jonwd@tjp.washington.edu</A>" ¬
      & " of this error." & "</BODY></HTML>"
   return return_page
 end try
end «event WWW½sdoc»

on idle
   if (current date) > (datestamp + idletime) then
      quit
   end if
   return 5
end idle

on quit
   continue quit
end quit

Step By Step

There is really only one addition to this script. Instead of just separating the list into its separate items, we now do some decoding on each item. The decoding does two things:
  1. Converts all codes which look like "%XX" to a character, where XX is the hexadecimal code for that character. This means "%20" becomes a space and "%28" becomes an ampersand
  2. Converts all occurences of "+" to spaces. This is only necessary for the NCSA Mosaic and Netscape clients. In case you have a major vision problem and missed my previous comments on this subject, these two clients use the "+" character to encode spaces in text before passing it on to MacHTTP. Can you say "no-no"?

We will use two new OSAXen to do the decoding. The first, DecodeURL, was written by Chuck Shotton (yes, that Chuck Shotton). The second, DePlus, was written by myself using tons of Chuck's original code. Both are free products and major time- and code-savers. Here is the section of code that does the decoding:

   set postargtext to ""
   repeat with curritem in postarglist
      set postargtext to postargtext & ¬
         (Decode URL (DePlus (last text item of currpostarg))) & return & return
   end repeat
I have used several commands on a single line, feeding the output of one to the input of the next, because it saves some typing and one variable. You could also use the following to do exactly the same thing:
   set postargtext to ""
   repeat with curritem in postarglist
      set temp to last text item of currpostarg
      set temp to DePlus temp
      set temp to Decode URL temp
      set postargtext to postargtext & temp & return & return
   end repeat
Looking at it this way you can see more clearly what we were doing. First, we grab the piece of text we're interested in (the data portion, delimited by a "=" character). Then we convert all +'s to spaces. Next, we scan the entire piece of text, converting hexadecimal encodings ("%XX") to their ASCII equivalents. Finally, we add this decoded text onto the end of postargtext and add two carriage returns before looping for the next item. NOTE: it is very important that DePlus be run before Decode URL. If you ran Decode URL first, you would convert all of the encoded + characters to real + characters. The DePlus would convert them to spaces. Doing things in the order in the script protects the real + characters from being misinterpreted.

Now for you anal retentive types, yes, you could do this same processing in AppleScript without using the OSAX. However, unlike in the last lesson, this time we're talking some serious bulk in your script. Here is some sample code that would perform some of the same function as DecodeURL, except it only decodes occurences of "%20" to spaces:

on decodeSpaces(inText)
   set outText to ""
   set spacePos to offset of "%20" in inText
   repeat while spacePos > 0
      if spacePos != 1 then   -- if the space is not the first character in the text
         set outText to (text from character 1 to character (spacePos - 1) of inText ¬
            & space & (text from character (spacePos + 3) to character (length of inText) of inText)
      else
         set outText to (text from character (spacePos + 3) to character (length of inText) of inText)
      end if
   end repeat
   return (outText)
end decodeSpaces
Even more lines would be required to make it handle all possible encodings, not to mention converting "+" characters. I have better things to do than figure out how to do things more slowly. Of course, if you don't believe me, feel free to write the code yourself. You should be able to time the difference on your wristwatch (or a good hourglass) if you're dealing with large arguments (like >10K of text). On the other hand, maybe you'll just want to take my word for it. That should leave you enough spare time to watch this week's Superman episode.

This looks like a good place to bring up another performance issue. If you remember from the first lesson, there are a number of variables passed to your CGI from MacHTTP. With the exception of post_args and http_search_args, all of the information in these variables is put there by MacHTTP. That means there is no reason to decode the information in these variables, since it is the client that does the encoding. In general, the data in post_args is the only thing that will take a lot of processing in your scripts. If you are using the http_search_args to hold information as well, then you will need to decode that information also.


Test the Script

I have compiled the script above on my own server, with an accompanying form. You will want to try this form out with a variety of WWW clients to show that it handles all current variations on encoding spaces.

Wrap It Up

Now you are ready to really get into the meat of the CGI applications. Everything up to now has been the foundation, the next lesson will begin to return useful information. I only have one item to bring up in reminder after this lesson:
Remember to thank the people who wrote this fine software you are using!
Products such as MacHTTP, Tokenize (and the accompanying ACME Suite), DecodeURL, and others are making your life and mine easier and more productive and at a great price (in my case, FREE). Another good suggestion to remember is give back to the Net. If you are helped by a free product, consider offering something of your own freely to others.
[Back to CGI Overview]

Jon Wiederspan
Last Edited: December 11, 1994