SAX (Simple API for XML)

What is SAX ?

SAX allows you to process a document as it is being read, which avoids the need to wait for all of it to be stored before taking action. So it takes less memory to process an XML document.

What events are generated by SAX parser while processing an XML document ?

The SAX API allows a developer to capture the following events generated by SAX parser

Start document
Start element
Characters
Start element
Characters
End element
Characters
Start element
Characters
End element
Characters
End element
End document

What steps are involved in SAX ?

1. Create an event handler.

2. Create the SAX parser.

3. Assign the event handler to the parser.

4. Parse the document, sending each event to the handler.

How to create SAX parser ?

First declare the XMLReader variable and then use SAXParserFactory to create a SAXParser. It is the SAXParser that gives us the XMLReader.

How to turn on/off validation rule in SAX ?

What is startDocument() event in SAX ?

Start from the beginning of the document using the startDocument() event. This event, like the other SAX events, throws a SAXException.

What is startElement() event in SAX ?

The parser actually passes several pieces of information through startElement() event

The namespace URI – An actual namespace is a URI of some sort and not the alias that gets added to an element or attribute name. For example, http://www.example.com

The local name – This is the actual name of the element, such as news. If the document does not provide namespace information, the parser may not be able to determine which part of the qName is the localName.

The qualified name, or qName – This is actually a combination of namespace information, if any, and the actual name of the element. The qName also includes the colon (:) if there is one – for example, revised:news.

Any attributes – The attributes for an element are actually passed as a collection of objects.

What is endElement() event in SAX ?

It might be a signal to process the contents of an element.

What is characters() event in SAX ?

It allows to retrieve the actual data using characters(). Signature of the method is given below

Two important things here:

Range: The characters() event includes more than just a string of characters. It also includes start and length information. In actuality, the “ch” character array includes the entire document. The application must not attempt to read characters outside the range the event feeds to the characters() event.

Frequency: Nothing in the SAX specification requires a processor to return characters in any particular way, so it is possible for a single chunk of text to be returned in several pieces. Always make sure that the endElement() event has occurred before assuming you have all the content of an element. Also, processors may use ignorableWhitespace() to return whitespace within an element. This is always the case for a validating parser.

What is endDocument() event in SAX ?

Once the document is completely parsed, we may want something to do at end of the document.

How to write Contenthandler for SAX ?

How to write an InputSource for SAX ?

What is ErrorHandler event in SAX ?

Just as the ContentHandler has predefined events for handling content, the ErrorHandler has predefined events for handling errors. Because you specified NewsReader as the error handler as well as the content handler, you need to override the default implementations of those methods. You need to be concerned with three events: warning, error, and fatalError:

What is org.xml.sax.EntityResolver interface in SAX ?

If a SAX application needs to implement customized handling for external entities, it must implement this interface and register an instance with the SAX driver using the setEntityResolver method.

The XML reader will then allow the application to intercept any external entities (including the external DTD subset and external parameter entities, if any) before including them.

Many SAX applications will not need to implement this interface, but it will be especially useful for applications that build XML documents from databases or other specialised input sources, or for applications that use URI types other than URLs.

The application can also use this interface to redirect system identifiers to local URIs or to look up replacements in a catalog (possibly by using the public identifier).

org.xml.sax.EntityResolver(http://www.saxproject.org/apidoc/org/xml/sax/EntityResolver.html) is a callback interface much like ContentHandler. It is attached to an org.xml.sax.XMLReader(http://www.saxproject.org/apidoc/org/xml/sax/XMLReader.html) interface with set and get methods:

The EntityResolver interface contains just a single method, resolveEntity(…). If you register an EntityResolver with an XMLReader, then every time that XMLReader needs to load an external parsed entity, it will pass the entity’s public ID and system ID to resolveEntity(…) first. The external entities can be: external DTD subset, external parameter entities, etc.

The EntityResolver allows you to substitute your own URI lookup scheme for external entities. Especially useful for entities that use URL and URI schemes not supported by Java’s protocol handlers; e.g. jdbc:/ or isbn:/.

The resolveEntity(…) can either return an InputSource or null. If it returns an InputSource, then this InputSource provides the entity’s replacement text. If it returns null, then the parser reads the entity in the same way it would have if there wasn’t an EntityResolver – by using the system ID and the java.net.URL class.

You could replace the host in the system ID to load the DTDs from a mirror site. You could bundle the DTDs into your application’s JAR file and load them from there. You could even hardwire the DTDs in the EntityResolver as string literals and load them with a StringReader.

The following resolver will redirect system identifier to local URI:

When to use SAX ?

It is helpful to understand the SAX event model when you want to convert existing data to XML. The key to the conversion process is to modify an existing application to deliver SAX events as it reads the data.

SAX is fast and efficient, but its event model makes it most useful for such state-independent filtering. For example, a SAX parser calls one method in your application when an element tag is encountered and calls a different method when text is found. If the processing you are doing is state-independent (meaning that it does not depend on the elements that have come before), then SAX works fine.

On the other hand, for state-dependent processing, where the program needs to do one thing with the data under element A but something different with the data under element B, then a pull parser such as the Streaming API for XML (StAX) would be a better choice. With a pull parser, you get the next node, whatever it happens to be, at any point in the code that you ask for it. So it is easy to vary the way you process text (for example), because you can process it multiple places in the program (for more detail, see Further Information).

SAX requires much less memory than DOM, because SAX does not construct an internal representation (tree structure) of the XML data, as a DOM does. Instead, SAX simply sends data to the application as it is read; your application can then do whatever it wants to do with the data it sees.

Pull parsers and the SAX API both act like a serial I/O stream. You see the data as it streams in, but you cannot go back to an earlier position or leap ahead to a different position. In general, such parsers work well when you simply want to read data and have the application act on it.

But when you need to modify an XML structure – especially when you need to modify it interactively – an in-memory structure makes more sense. DOM is one such model. However, although DOM provides many powerful capabilities for large-scale documents (like books and articles), it also requires a lot of complex coding.

For simpler applications, that complexity may well be unnecessary. For faster development and simpler applications, one of the object-oriented XML-programming standards, such as JDOM ( http://www.jdom.org) and DOM4J (http://dom4j.org/), might make more sense.