SAX – Simple API for XML

Introduction

SAX – Simple API for XML, allows you to process a document as it is being read, which avoids the need to wait for all of it to be stored before taking action. So it takes less memory to process an XML document, whereas DOM (Document Object Model) parser first stores the whole document into memory before taking action. Unlike a DOM parser, a SAX parser creates no parse tree.

The SAX API allows a developer to capture the following events generated by SAX parser:

  • Start document
  • Start element
  • Characters
  • Start element
  • Characters
  • End element
  • Characters
  • Start element
  • Characters
  • End element
  • Characters
  • End element
  • End document

Document Parsing Steps

  1. Create an event handler.
  2. Create the SAX parser.
  3. Assign the event handler to the parser.
  4. Parse the document, sending each event to the handler.

Creating SAX Parser

First declare the XMLReader variable and then use SAXParserFactory to create a SAXParser. It is the SAXParser that gives us the XMLReader.

public class NewsReader extends DefaultHandler {

    public static void main(String args[]) {
        XMLReader xmlReader = null;
        try {
            SAXParserFactory spfactory = SAXParserFactory.newInstance();
            SAXParser saxParser = spfactory.newSAXParser();
            xmlReader = saxParser.getXMLReader();
        } catch (Exception e) {
            System.err.println(e);
        }
    }

}

Turning On/Off Validation

You can set true or false to the method setValidating() to turn validation on or off respectively.

public class NewsReader extends DefaultHandler {
    public static void main(String args[]) {
        XMLReader xmlReader = null;
        try {
            SAXParserFactory spfactory = SAXParserFactory.newInstance();
            spfactory.setValidating(false); // turn validation off
            spfactory.setValidating(true); // turn validation on
            SAXParser saxParser = spfactory.newSAXParser();
            xmlReader = saxParser.getXMLReader();
        } catch (Exception e) {
            System.err.println(e);
        }
    }
}

Events in SAX Parser

SAX parser may generate a different series of events. The major event-handling methods are: startDocument, endDocument, startElement, and endElement.

Document Events

The document events are invoked when the parser encounters the start and end points of the document being parsed.

startDocument()

You would generally use startDocument() event to start parsing the XL document from the beginning of the document.

This event, like the other SAX events, throws a SAXException.

public class NewsReader extends DefaultHandler {

    public void startDocument() throws SAXException {
        System.out.println("Tallying news results...");
    }
	
    public static void main(String args[]) {
        XMLReader xmlReader = null;
        try {
            SAXParserFactory spfactory = SAXParserFactory.newInstance();
            SAXParser saxParser = spfactory.newSAXParser();
            xmlReader = saxParser.getXMLReader();
        } catch (Exception e) {
            System.err.println(e);
        }
    }
	
}

endDocument()

Once the document is completely parsed and parser reaches to the end of the document, endDocument() method is invoked.

public class NewsReader extends DefaultHandler {

    int indent = 0;
    String thisNews = "";
    String thisElement = "";
	
    public void startDocument() throws SAXException {
        System.out.println("Tallying news results...");
        indent = -4;
    }
	
    private void printIndent(int indentSize) {
        for (int s = 0; s < indentSize; s++) {
            System.out.print(" ");
        }
    }
	
    public void startElement(String namespaceURI, String localName,
            String qName, Attributes atts) throws SAXException {
        if (qName == "news") {
            thisNews = atts.getValue("subject");
        }
        thisElement = qName;
    }
	
    public void endElement(String namespaceURI, String localName, String qName)
            throws SAXException {
        thisNews = "";
        thisElement = "";
    }
	
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        if (thisElement == "news") {
            printIndent(4);
            System.out.print(thisNews + ": ");
            System.out.println(new String(ch, start, length));
        }
    }
	
    public void endDocument() {
        System.out.println("End of documents with predators");
        System.out.println("A: " + getInstances(predators, "A"));
        System.out.println("B: " + getInstances(predators, "B"));
    }
	
    public static void main (String args[]) {
        XMLReader xmlReader = null;
        try {
            SAXParserFactory spfactory = SAXParserFactory.newInstance();
            SAXParser saxParser = spfactory.newSAXParser();
            xmlReader = saxParser.getXMLReader();
        } catch (Exception e) {
            System.err.println(e);
        }
    }
	
}

Element Events

The parser processes the element tags, including any attributes defined in the start tag, to obtain the namespace universal resource identifier (URI), the local name and the qualified name of that element.

startElement()

The parser actually passes several pieces of information through startElement() event:

  • The namespace URI – An actual namespace is a URI of some sort and not the alias that gets added to an element or attribute name. For example, http://www.example.com.
  • The local name – This is the actual name of the element, such as news. If the document does not provide namespace information, the parser may not be able to determine which part of the qName is the localName.
  • The qualified name, or qName – This is actually a combination of namespace information, if any, and the actual name of the element. The qName also includes the colon (:) if there is one – for example, revised:news.
  • Any attributes – The attributes for an element are actually passed as a collection of objects.
public class NewsReader extends DefaultHandler {

    public void startDocument() throws SAXException {
        System.out.println("Tallying news results...");
    }
	
    public void startElement(String namespaceURI, String localName,
                String qName, Attributes atts) throws SAXException {
        System.out.println("Start element: " + qName);
    }
    
	public static void main(String args[]) {
        XMLReader xmlReader = null;
        try {
            SAXParserFactory spfactory = SAXParserFactory.newInstance();
            SAXParser saxParser = spfactory.newSAXParser();
            xmlReader = saxParser.getXMLReader();
        } catch (Exception e) {
            System.err.println(e);
        }
    }
	
}

endElement()

It might be a signal to process the contents of an element.

public class NewsReader extends DefaultHandler {

    int indent = 0;
	
    public void startDocument() throws SAXException {
        System.out.println("Tallying news results...");
        indent = -4;
    }
	
    private void printIndent(int indentSize) {
        for (int s = 0; s < indentSize; s++) {
            System.out.print(" ");
        }
    }
	
    public void startElement(String namespaceURI, String localName,
                String qName, Attributes atts) throws SAXException {
        indent = indent + 4;
        printIndent(indent);
        System.out.print("Start element: ");
        System.out.println(qName);
        for (int att = 0; att < atts.getLength(); att++) {
            printIndent(indent + 4);
            String attName = atts.getLocalName(att);
            System.out.println(" " + attName + ": " + atts.getValue(attName));
        }
    }
	
    public void endElement(String namespaceURI, String localName, String qName)
            throws SAXException {
        printIndent(indent);
        System.out.println("End Element: " + localName);
        indent = indent - 4;
    }
	
    public static void main(String args[]) {
        XMLReader xmlReader = null;
        try {
            SAXParserFactory spfactory = SAXParserFactory.newInstance();
            SAXParser saxParser = spfactory.newSAXParser();
            xmlReader = saxParser.getXMLReader();
        } catch (Exception e) {
            System.err.println(e);
        }
    }
	
}

Characters Event

The JAXP SAX API also allows you to handle the characters that the parser delivers to your application, using the ContentHandler.characters() method.

It allows to retrieve the actual data using characters(). Signature of the method is given below :

public void characters(char[] ch, int start, int length) throws SAXException

Two important notes about character event are given below:

  • Range: The characters() event includes more than just a string of characters. It also includes start and length information. In actuality, the ch character array includes the entire document. The application must not attempt to read characters outside the range the event feeds to the characters() event.
  • Frequency: Nothing in the SAX specification requires a processor to return characters in any particular way, so it is possible for a single chunk of text to be returned in several pieces. Always make sure that the endElement() event has occurred before assuming you have all the content of an element. Also, processors may use ignorableWhitespace() to return whitespace within an element. This is always the case for a validating parser.
public class NewsReader extends DefaultHandler {
    
	int indent = 0;
    String thisNews = "";
    String thisElement = "";
    
	public void startDocument() throws SAXException {
        System.out.println("Tallying news results...");
        indent = -4;
    }
    
	private void printIndent(int indentSize) {
        for (int s = 0; s < indentSize; s++) {
            System.out.print(" ");
        }
    }
    
	public void startElement(String namespaceURI, String localName,
            String qName, Attributes atts) throws SAXException {
        if (qName == "news") {
            thisNews = atts.getValue("subject");
        }
        thisElement = qName;
    }
    
	public void endElement(String namespaceURI, String localName, String qName)
            throws SAXException {
        thisNews = "";
        thisElement = "";
    }
	
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        if (thisElement == "news") {
            printIndent(4);
            System.out.print(thisNews + ": ");
            System.out.println(new String(ch, start, length));
        }
    }
	
    public static void main (String args[]) {
        XMLReader xmlReader = null;
        try {
            SAXParserFactory spfactory = SAXParserFactory.newInstance();
            SAXParser saxParser = spfactory.newSAXParser();
            xmlReader = saxParser.getXMLReader();
        } catch (Exception e) {
            System.err.println(e);
        }
    }
	
}

Contenthandler in SAX

The most important interface in SAXLocalNameCount is ContentHandler. This interface requires a number of methods that the SAX parser invokes in response to various parsing events.

The major event-handling methods are: startDocument, endDocument, startElement, and endElement.

The easiest way to implement this interface is to extend the DefaultHandler class, defined in the org.xml.sax.helpers package. This class provides do-nothing methods for all the ContentHandler events.

The source of the code ContentHandler interface is given below:

public interface ContentHandler {
    public void setDocumentLocator (Locator locator);
    public void startDocument () throws SAXException;
    public void endDocument() throws SAXException;
    public void startPrefixMapping (String prefix, String uri) throws SAXException;
    public void endPrefixMapping (String prefix) throws SAXException;
    public void startElement (String namespaceURI, String localName,
                  String qName, Attributes atts) throws SAXException;
    public void endElement (String namespaceURI, String localName,
                String qName) throws SAXException;
    public void characters (char ch[], int start, int length) throws SAXException;
    public void ignorableWhitespace (char ch[], int start, int length) throws SAXException;
    public void processingInstruction (String target, String data) throws SAXException;
    public void skippedEntity (String name) throws SAXException;
}

Example

...
    xmlReader = saxParser.getXMLReader();
    xmlReader.setContentHandler(new NewsReader());
} catch (Exception e) {
    ...

ErrorHandler Event

Just as the ContentHandler has predefined events for handling content, the ErrorHandler has predefined events for handling errors. Because you specified NewsReader as the error handler as well as the content handler, you need to override the default implementations of those methods. You need to be concerned with three events: warning, error, and fatalError:

public class NewsReader extends DefaultHandler {
    int indent = 0;
    String thisNews = "";
    String thisElement = "";
    public void startDocument() throws SAXException {
        System.out.println("Tallying news results...");
        indent = -4;
    }
    private void printIndent(int indentSize) {
        for (int s = 0; s < indentSize; s++) {
            System.out.print(" ");
        }
    }
    public void startElement(String namespaceURI, String localName,
            String qName, Attributes atts) throws SAXException {
        if (qName == "news") {
            thisNews = atts.getValue("subject");
        }
        thisElement = qName;
    }
    public void endElement(String namespaceURI, String localName, String qName)
            throws SAXException {
        thisNews = "";
        thisElement = "";
    }
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        if (thisElement == "news") {
            printIndent(4);
            System.out.print(thisNews + ": ");
            System.out.println(new String(ch, start, length));
        }
    }
    public void endDocument() {
        System.out.println("End of documents with predators");
        System.out.println("A: " + getInstances(predators, "A"));
        System.out.println("B: " + getInstances(predators, "B"));
    }
    public void error(SAXParseException e) {
        System.out.println("Error parsing the file: " + e.getMessage());
    }
    public void warning(SAXParseException e) {
        System.out.println("Problem parsing the file: " + e.getMessage());
    }
    public void fatalError(SAXParseException e) {
        System.out.println("Error parsing the file: " + e.getMessage());
        System.exit(1);
    }
    public static void main(String args[]) {
        XMLReader xmlReader = null;
        try {
            SAXParserFactory spfactory = SAXParserFactory.newInstance();
            SAXParser saxParser = spfactory.newSAXParser();
            xmlReader = saxParser.getXMLReader();
            xmlReader.setContentHandler(new NewsReader());
            xmlReader.setErrorHandler(new NewsReader());
            InputSource source = new InputSource("news.xml");
            ...
        } catch (Exception e) {
            System.err.println(e);
        }
    }
}

EntityResolver in SAX

If a SAX application needs to implement customized handling for external entities, it must implement this interface and register an instance with the SAX driver using the setEntityResolver method.

The XML reader will then allow the application to intercept any external entities (including the external DTD subset and external parameter entities, if any) before including them.

Many SAX applications will not need to implement this interface, but it will be especially useful for applications that build XML documents from databases or other specialised input sources, or for applications that use URI types other than URLs.

The application can also use this interface to redirect system identifiers to local URIs or to look up replacements in a catalog (possibly by using the public identifier).

org.xml.sax.EntityResolver (http://www.saxproject.org/apidoc/org/xml/sax/EntityResolver.html) is a callback interface much like ContentHandler. It is attached to an org.xml.sax.XMLReader (http://www.saxproject.org/apidoc/org/xml/sax/XMLReader.html) interface with set and get methods:

public interface XMLReader {
    ...
    public void setEntityResolver(EntityResolver resolver);
    public EntityResolver getEntityResolver();
    ...
}

The EntityResolver interface contains just a single method, resolveEntity(…). If you register an EntityResolver with an XMLReader, then every time that XMLReader needs to load an external parsed entity, it will pass the entity’s public ID and system ID to resolveEntity(…) first. The external entities can be: external DTD subset, external parameter entities, etc.

The EntityResolver allows you to substitute your own URI lookup scheme for external entities. Especially useful for entities that use URL and URI schemes not supported by Java’s protocol handlers; e.g. jdbc:/ or isbn:/.

The resolveEntity(…) can either return an InputSource or null. If it returns an InputSource, then this InputSource provides the entity’s replacement text. If it returns null, then the parser reads the entity in the same way it would have if there wasn’t an EntityResolver – by using the system ID and the java.net.URL class.

You could replace the host in the system ID to load the DTDs from a mirror site. You could bundle the DTDs into your application’s JAR file and load them from there. You could even hardwire the DTDs in the EntityResolver as string literals and load them with a StringReader.

public interface EntityResolver {
    public InputSource resolveEntity(String publicId,
        String systemId) throws SAXException, IOException;
}

The following resolver will redirect system identifier to local URI:

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;

public class MyEntityResolver implements EntityResolver {
    public InputSource resolveEntity(String publicId, String systemId)
        throws FileNotFoundException {
        if (systemId.equals("http://server.com/DTD/news.dtd")) {
            return new InputSource(new FileInputStream("local/DTD/news.dtd"));
        } else {
            // use the default behaviour
            return null;
        }
    }
}

When to use SAX ?

It is helpful to understand the SAX event model when you want to convert existing data to XML. The key to the conversion process is to modify an existing application to deliver SAX events as it reads the data.

SAX is fast and efficient, but its event model makes it most useful for such state-independent filtering. For example, a SAX parser calls one method in your application when an element tag is encountered and calls a different method when text is found. If the processing you are doing is state-independent (meaning that it does not depend on the elements that have come before), then SAX works fine.

On the other hand, for state-dependent processing, where the program needs to do one thing with the data under element A but something different with the data under element B, then a pull parser such as the Streaming API for XML (StAX) would be a better choice. With a pull parser, you get the next node, whatever it happens to be, at any point in the code that you ask for it. So it is easy to vary the way you process text (for example), because you can process it multiple places in the program (for more detail, see Further Information).

SAX requires much less memory than DOM, because SAX does not construct an internal representation (tree structure) of the XML data, as a DOM does. Instead, SAX simply sends data to the application as it is read; your application can then do whatever it wants to do with the data it sees.

Pull parsers and the SAX API both act like a serial I/O stream. You see the data as it streams in, but you cannot go back to an earlier position or leap ahead to a different position. In general, such parsers work well when you simply want to read data and have the application act on it.

But when you need to modify an XML structure – especially when you need to modify it interactively – an in-memory structure makes more sense. DOM is one such model. However, although DOM provides many powerful capabilities for large-scale documents (like books and articles), it also requires a lot of complex coding.

For simpler applications, that complexity may well be unnecessary. For faster development and simpler applications, one of the object-oriented XML-programming standards, such as JDOM ( http://www.jdom.org) and DOM4J (http://dom4j.org/), might make more sense.

Thanks for reading.

Leave a Reply

Your email address will not be published. Required fields are marked *