Javanotes 9, Section 11.5 -- A Brief Introduction to XML

Section 11.5

A Brief Introduction to XML

When data is saved to a file or transmitted over a network, it must be represented in some way that will allow the same data to be rebuilt later, when the file is read or the transmission is received. We have seen that there are good reasons to prefer textual, character-based representations in many cases, but there are many ways to represent a given collection of data as text. In this section, we'll take a brief look at one type of character-based data representation that has become increasingly common.

XML (eXtensible Markup Language) is a syntax for creating data representation languages. There are two aspects or levels of XML. On the first level, XML specifies a strict but relatively simple syntax. Any sequence of characters that follows that syntax is a well-formed XML document. On the second level, XML provides a way of placing further restrictions on what can appear in a document. This is done by associating a DTD (Document Type Definition) with an XML document. A DTD is essentially a list of things that are allowed to appear in the XML document. A well-formed XML document that has an associated DTD and that follows the rules of the DTD is said to be a valid XML document. The idea is that XML is a general format for data representation, and a DTD specifies how to use XML to represent a particular kind of data. (There are also alternatives to DTDs, such as XML schemas, for defining valid XML documents, but let's ignore them here.)

There is nothing magical about XML. It's certainly not perfect. It's a very verbose language, and some people think it's ugly. On the other hand it's very flexible. It can be used to represent almost any type of data. It was built from the start to support all languages and alphabets. Most important, it has become an accepted standard. There is support in just about any programming language for processing XML documents. There are standard DTDs for describing many different kinds of data. There are many ways to design a data representation language, but XML is one that has happened to come into widespread use. In fact, it has found its way into almost every corner of information technology. For example: There are XML languages for representing mathematical expressions (MathML), musical notation (MusicXML), molecules and chemical reactions (CML), vector graphics (SVG), and many other kinds of information. XML is used by OpenOffice and recent versions of Microsoft Office in the document format for office applications such as word processing, spreadsheets, and presentations. XML site syndication languages (RSS, ATOM) make it possible for web sites, newspapers, and blogs to make a list of recent headlines available in a standard format that can be used by other web sites and by web browsers; the same format is used to publish podcasts. And XML is a common format for the electronic exchange of business information.

My purpose here is not to tell you everything there is to know about XML. I will just explain a few ways in which it can be used in your own programs. In particular, I will not say anything further about DTDs and valid XML. For many purposes, it is sufficient to use well-formed XML documents with no associated DTDs.

11.5.1 Basic XML Syntax

If you know HTML, the language for writing web pages, then XML will look familiar. An XML document looks a lot like an HTML document. HTML is not itself an XML language, since it does not follow all the strict XML syntax rules, but the basic ideas are similar. Here is a short, well-formed XML document:

<?xml version="1.0"?>
<simplepaint version="1.0">
   <background red='1' green='0.6' blue='0.2'/>
   <curve>
      <color red='0' green='0' blue='1'/>
      <symmetric>false</symmetric>
      <point x='83' y='96'/>
      <point x='116' y='149'/>
      <point x='159' y='215'/>
      <point x='216' y='294'/>
      <point x='264' y='359'/>
      <point x='309' y='418'/>
      <point x='371' y='499'/>
      <point x='400' y='543'/>
   </curve>
   <curve>
      <color red='1' green='1' blue='1'/>
      <symmetric>true</symmetric>
      <point x='54' y='305'/>
      <point x='79' y='289'/>
      <point x='128' y='262'/>
      <point x='190' y='236'/>
      <point x='253' y='209'/>
      <point x='341' y='158'/>
   </curve>
</simplepaint>

The first line, which is optional, merely identifies this as an XML document. This line can also specify other information, such as the character encoding that was used to encode the characters in the document into binary form. If this document had an associated DTD, it would be specified in a "DOCTYPE" directive on the next line of the file.

Aside from the first line, the document is made up of elements, attributes, and textual content. An element starts with a tag, such as <curve> and ends with a matching end-tag such as </curve>. Between the tag and end-tag is the content of the element, which can consist of text and nested elements. (In the example, the only textual content is the true or false in the <symmetric> elements.) If an element has no content, then the opening tag and end-tag can be combined into a single empty tag, such as <point x='83' y='96'/>, with a "/" before the final ">". This is an abbreviation for <point x='83' y='96'></point>. A tag can include attributes such as the x and y in <point x='83' y='96'/> or the version in <simplepaint version="1.0">. A document can also include a few other things, such as comments, that I will not discuss here.

The author of a well-formed XML document gets to choose the tag names and attribute names, and meaningful names can be chosen to describe the data to a human reader. (For a valid XML document that uses a DTD, it's the author of the DTD who gets to choose the tag names.)

Every well-formed XML document follows a strict syntax. Here are some of the most important syntax rules: Tag names and attribute names in XML are case sensitive. A name must begin with a letter and can contain letters, digits and certain other characters. Spaces and ends-of-line are significant only in textual content. Every tag must either be an empty tag or have a matching end-tag. By "matching" here, I mean that elements must be properly nested; if a tag is inside some element, then the matching end-tag must also be inside that element. A document must have a root element, which contains all the other elements. The root element in the above example has tag name simplepaint. Every attribute must have a value, and that value must be enclosed in quotation marks; either single quotes or double quotes can be used for this. The special characters < and &, if they appear in attribute values or textual content, must be written as < and &. "<" and "&" are examples of entities. The entities >, ", and ' are also defined, representing >, double quote, and single quote. (Additional entities can be defined in a DTD.)

While this description will not enable you to understand everything that you might encounter in XML documents, it should allow you to design well-formed XML documents to represent data structures used in Java programs.

11.5.2 Working With the DOM

The sample XML file shown above was designed to store information about simple drawings made by the user. The drawings in question are ones that could be made using the sample program SimplePaint2.java from Subsection 7.3.3. We'll look at another version of that program that can save the user's drawing using an XML format for the data file. The new version is SimplePaintWithXML.java. The sample XML document shown earlier in this section can be used with that program. I designed the format of that document to represent all the data needed to reconstruct a picture in SimplePaint2. The document encodes the background color of the picture and a list of curves. Each <curve> element contains the data from one object of type CurveData.

It is easy enough to write data in a customized XML format, although we have to be very careful to follow all the syntax rules. Here is how SimplePaintWithXML writes the data for a SimplePaint2 picture to a PrintWriter, out. This produces an XML file with the same structure as the example shown above:

out.println("<?xml version=\"1.0\"?>");
out.println("<simplepaint version=\"1.0\">");
out.println("   <background red='" + backgroundColor.getRed() + "' green='" +
        backgroundColor.getGreen() + "' blue='" + backgroundColor.getBlue() + "'/>");
for (CurveData c : curves) {
    out.println("   <curve>");
    out.println("      <color red='" + c.color.getRed() + "' green='" +
            c.color.getGreen() + "' blue='" + c.color.getBlue() + "'/>");
    out.println("      <symmetric>" + c.symmetric + "</symmetric>");
    for (Point2D pt : c.points)
        out.println("      <point x='" + pt.getX() + "' y='" + pt.getY() + "'/>");
    out.println("   </curve>");
}
out.println("</simplepaint>");

Reading the data back into the program is another matter. To reconstruct the data structure represented by the XML Document, it is necessary to parse the document and extract the data from it. This could be difficult to do by hand. Fortunately, Java has a standard API for parsing and processing XML Documents. (Actually, it has two, but we will only look at one of them.)

A well-formed XML document has a certain structure, consisting of elements containing attributes, nested elements, and textual content. It's possible to build a data structure in the computer's memory that corresponds to the structure and content of the document. Of course, there are many ways to do this, but there is one common standard representation known as the Document Object Model, or DOM. The DOM specifies how to build data structures to represent XML documents, and it specifies some standard methods for accessing the data in that structure. The data structure is a kind of tree whose structure mirrors the structure of the document. The tree is constructed from nodes of various types. There are nodes to represent elements, attributes, and text. (The tree can also contain several other types of node, representing aspects of XML that we can ignore here.) Attributes and text can be processed without directly manipulating the corresponding nodes, so we will be concerned almost entirely with element nodes.

(The sample program XMLDemo.java lets you experiment with XML and the DOM. It has a text area where you can enter an XML document. Initially, the input area contains the sample XML document from this section. When you click a button named "Parse XML Input", the program will attempt to read the XML from the input box and build a DOM representation of that document. If the input is not well-formed XML, an error message is displayed. If it is legal, the program will traverse the DOM representation and display a list of elements, attributes, and textual content that it encounters. The program uses a few techniques for processing XML that I won't discuss here.)

In Java, the DOM representation of an XML document file can be created with just two statements. If selectedFile is a variable of type File that represents the XML file, and xmldoc is of type Document, then

DocumentBuilder docReader 
                 = DocumentBuilderFactory.newInstance().newDocumentBuilder();
xmldoc = docReader.parse(selectedFile);

will open the file, read its contents, and build the DOM representation. The classes DocumentBuilder and DocumentBuilderFactory are both defined in the package javax.xml.parsers. The method docReader.parse() does the actual work. It will throw an exception if it can't read the file or if the file does not contain a legal XML document. If it succeeds, then the value returned by docReader.parse() is an object that represents the entire XML document. (This is a very complex task! It has been coded once and for all into a method that can be used very easily in any Java program. We see the benefit of using a standardized syntax.)

The structure of the DOM data structure is defined in the package org.w3c.dom, which contains several data types that represent an XML document as a whole and the individual nodes in a document. The "org.w3c" in the name refers to the World Wide Web Consortium, W3C, which is the standards organization for the Web. DOM, like XML, is a general standard, not just a Java standard. The data types that we need here are Document, Node, Element, and NodeList. (They are defined as interfaces rather than classes, but that fact is not relevant here.) We can use methods that are defined in these data types to access the data in the DOM representation of an XML document.

An object of type Document represents an entire XML document. The return value of docReader.parse()—xmldoc in the above example—is of type Document. We will only need one method from this class: If xmldoc is of type Document, then

xmldoc.getDocumentElement()

returns a value of type Element that represents the root element of the document. (Recall that this is the top-level element that contains all the other elements.) In the sample XML document from earlier in this section, the root element consists of the tag <simplepaint version="1.0">, the end-tag </simplepaint>, and everything in between. The elements that are nested inside the root element are represented by their own nodes, which are said to be children of the root node. An object of type Element contains several useful methods. If element is of type Element, then we have:

element.getTagName() — returns a String containing the name that is used in the element's tag. For example, the name of a <curve> element is the string "curve".
element.getAttribute(attrName) — if attrName is the name of an attribute in the element, then this method returns the value of that attribute. For the element, <point x="83" y="42"/>, element.getAttribute("x") would return the string "83". Note that the return value is always a String, even if the attribute is supposed to represent a numerical value. If the element has no attribute with the specified name, then the return value is an empty string.
element.getTextContent() — returns a String containing all of the textual content that is contained in the element. Note that this includes text that is contained inside other elements that are nested inside the element.
element.getChildNodes() — returns a value of type NodeList that contains all the Nodes that are children of the element. The list includes nodes representing other elements and textual content that are directly nested in the element (as well as some other types of node that I don't care about here). The getChildNodes() method makes it possible to traverse the entire DOM data structure by starting with the root element, looking at children of the root element, children of the children, and so on. (There is a similar method that returns the attributes of the element, but I won't be using it here.)
element.getElementsByTagName(tagName) — returns a NodeList that contains all the nodes representing all elements that are nested inside element and which have the given tag name. Note that this includes elements that are nested to any level, not just elements that are directly contained inside element. The getElementsByTagName() method allows you to reach into the document and pull out specific data that you are interested in.

An object of type NodeList represents a list of Nodes. Unfortunately, it does not use the API defined for lists in the Java Collection Framework. Instead, a value, nodeList, of type NodeList has two methods: nodeList.getLength() returns the number of nodes in the list, and nodeList.item(i) returns the node at position i, where the positions are numbered 0, 1, ..., nodeList.getLength() - 1. Note that the return value of nodeList.get() is of type Node, and it might have to be type-cast to a more specific node type before it is used.

Knowing just this much, you can do the most common types of processing of DOM representations. Let's look at a few code fragments. Suppose that in the course of processing a document you come across an Element node that represents the element

<background red='1' green='0.6' blue='0.2'/>

This element might be encountered either while traversing the document with getChildNodes() or in the result of a call to getElementsByTagName("background"). Our goal is to reconstruct the data structure represented by the document, and this element represents part of that data. In this case, the element represents a color, and the red, green, and blue components are given by the attributes of the element. If element is a variable that refers to the node, the color can be obtained by saying:

double r = Double.parseDouble( element.getAttribute("red") );
double g = Double.parseDouble( element.getAttribute("green") );
double b = Double.parseDouble( element.getAttribute("blue") );
Color bgColor = Color.color(r,g,b);

Suppose now that element refers to the node that represents the element

<symmetric>true</symmetric>

In this case, the element represents the value of a boolean variable, and the value is encoded in the textual content of the element. We can recover the value from the element with:

String bool = element.getTextContent();
boolean symmetric;
if (bool.equals("true"))
   symmetric = true;
else
   symmetric = false;

Next, consider an example that uses a NodeList. Suppose we encounter an element that represents a list of Point2Ds:

<pointlist>
   <point x='17' y='42'/>   
   <point x='23' y='8'/>   
   <point x='109' y='342'/>   
   <point x='18' y='270'/>   
</pointlist>

Suppose that element refers to the node that represents the <pointlist> element. Our goal is to build the list of type ArrayList<Point2D> that is represented by the element. We can do this by traversing the NodeList that contains the child nodes of element:

ArrayList<Point2D> points = new ArrayList<>();
NodeList children = element.getChildNodes();
for (int i = 0; i < children.getLength(); i++) {
   Node child = children.item(i);   // One of the child nodes of element.
   if ( child instanceof Element ) {
      Element pointElement = (Element)child;  // One of the <point> elements.
      double x = Double.parseDouble( pointElement.getAttribute("x") );
      double y = Double.parseDouble( pointElement.getAttribute("y") );
      Point2D pt = new Point2D(x,y); // Create the Point represented by pointElement.
      points.add(pt);  // Add the point to the list of points.
   }
}

All the nested <point> elements are children of the <pointlist> element. The if statement in this code fragment is necessary because an element can have other children in addition to its nested elements. In this example, we only want to process the children that are elements.

All these techniques can be employed to write the file input method for the sample program SimplePaintWithXML.java. When building the data structure represented by an XML file, my approach is to start with a default data structure and then to modify and add to it as I traverse the DOM representation of the file. It's not a trivial process, but I hope that you can follow it:

Color newBackground = Color.WHITE;
ArrayList<CurveData> newCurves = new ArrayList<>();
Element rootElement = xmldoc.getDocumentElement();
if ( ! rootElement.getNodeName().equals("simplepaint") )
    throw new Exception("File is not a SimplePaint file.");
String version = rootElement.getAttribute("version");
try {
    double versionNumber = Double.parseDouble(version);
    if (versionNumber > 1.0)
        throw new Exception("File requires a newer version of SimplePaint.");
}
catch (NumberFormatException e) {
}
NodeList nodes = rootElement.getChildNodes();
for (int i = 0; i < nodes.getLength(); i++) {
   if (nodes.item(i) instanceof Element) {
      Element element = (Element)nodes.item(i);
      if (element.getTagName().equals("background")) {
         double r = Double.parseDouble(element.getAttribute("red"));
         double g = Double.parseDouble(element.getAttribute("green"));
         double b = Double.parseDouble(element.getAttribute("blue"));
         newBackground = Color.color(r,g,b);
      }
      else if (element.getTagName().equals("curve")) {
         CurveData curve = new CurveData();
         curve.color = Color.BLACK;
         curve.points = new ArrayList<>();
         newCurves.add(curve);
         NodeList curveNodes = element.getChildNodes();
         for (int j = 0; j < curveNodes.getLength(); j++) {
           if (curveNodes.item(j) instanceof Element) {
             Element curveElement = (Element)curveNodes.item(j);
             if (curveElement.getTagName().equals("color")) {
               double r = Double.parseDouble(curveElement.getAttribute("red"));
               double g = Double.parseDouble(curveElement.getAttribute("green"));
               double b = Double.parseDouble(curveElement.getAttribute("blue"));
               curve.color = Color.color(r,g,b);
             }
             else if (curveElement.getTagName().equals("point")) {
               double x = Double.parseDouble(curveElement.getAttribute("x"));
               double y = Double.parseDouble(curveElement.getAttribute("y"));
               curve.points.add(new Point2D(x,y));
             }
             else if (curveElement.getTagName().equals("symmetric")) {
               String content = curveElement.getTextContent();
               if (content.equals("true"))
                 curve.symmetric = true;
             }
           }
         }
      }
   }         
}
backgroundColor = newBackground;
curves = newCurves;

You can find the complete source code in SimplePaintWithXML.java.

XML has developed into an extremely important technology, and some applications of it are very complex. But there is a core of simple ideas that can be easily applied in Java. Knowing just the basics, you can make good use of XML in your own Java programs.

End of Chapter 11