Advanced SAX features
The features we've covered so far are probably enough for
90% of SAX applications. But it's useful to know something of the rest of the
features, for those occasions when they are needed. This section of the chapter
gives a survey of these features and their intended purpose.
Alternative Input Sources
In our examples so far, the XML document to be parsed has
been described in the form of a URL. This is usually adequate, given the range
of resources that a URL can describe. It allows the document to be held in a
file locally or remotely, or for it to be generated dynamically by a web
server.
Taking Input from a Byte Stream or Character Stream
Sometimes you want to supply the parser with a stream of XML
that is generated by another program rather than being held in a file. For
example, the XML might be stored in a relational database, or it might be
output by an EDI message translation program, or it might be an XML section
embedded within a file or message in some non-XML format. You don't want to
have to write the XML to the file store (or to install a web server) just so
that the parser can read your document.
To handle this situation, SAX allows you to supply the XML
input in the form of a character stream or a byte stream. It provides the InputSource class to generalize all these
possible sources of input.
For example, let's suppose your program wants to parse XML
held in a character string that has just been read from a relational database
using JDBC. The following code will do the job:
public void parseString(String s) throws SAXException,
IOException
{
StringReader
reader = new StringReader(s);
InputSource
source = new InputSource(reader);
parser.parse(source);
}
InputSource is a
class (not an interface) provided with the SAX distribution. The application
can set various details of the input source, some of which are mutually
exclusive. These include supplying a URL, a Reader (as here), an InputStream,
an encoding name, and a "public identifier". (Public identifiers,
however, are as enigmatic in SAX as in the XML specification itself: there are
no clues as to what the parser should actually do with the public identifier.
But as we will see later, the application can use it.)
Why does SAX need to provide two options for in-memory data,
an InputStream and a Reader?
An InputStream is a stream of bytes. The XML standard
provides many rules about how a stream of bytes can be translated into a stream
of Unicode characters, including for example the encoding attribute (which is part of the xml declaration at the start of the
document content). To translate bytes to characters, it's not good enough to
leave the work to the standard Java libraries, because they don't understand
these rules, and they certainly can't be expected to read the encoding attribute. If the
XML comes from a binary source, complete with encoding attribute, we want to
hand the stream of bytes to the parser for it to interpret directly.
A Reader, by contrast, is a stream of Unicode characters. If
we already have the data in the form of characters, we don't want to have to
encode it first as a stream of bytes (say in the UTF-8 encoding) just so that
the parser can decode it again. Better to hand the character stream to the
parser directly. (Actually, there was some debate about the desirability of
providing this option in SAX. While it's obviously useful, it's not entirely in
the spirit of the XML specification, which defines an XML document strictly as
sequence of bytes. It's perhaps best to think of the input character stream not
as an XML document, but as a preprocessed XML document in which the first stage
of processing, namely character decoding, has already been done.)
Whether we use a byte stream or a character stream, there is
one snag you need to be aware of: the parser has no way of resolving a relative
URL that appears in the document source. Suppose the document source contains
the line
<!DOCTYPE books SYSTEM "books.dtd">
Where is books.dtd to be
found? The XML specification says (in effect) that it should be found in the
same directory as the source document, but of course we don't have a directory
for the source document because it was in memory when parsing started.
SAX gets round this by allowing a system identifier (in
other words, a URL) to be supplied as
well as a byte stream or character stream. This URL is not used to read the
source document, only as a base for resolving any relative URLs found in the
source document.
Specifying a Filename rather than a URL
Another common source of input is a file name: for example,
command-line interfaces generally use file names as arguments rather than URLs,
and you may well want to use this form of argument in the interface to your
application.
The SAX InputSource class
does not directly allow you to specify a filename for the input; you have to
convert the filename into a URL so that the parser can process it. If you are
using Java 2, this is simplicity itself: the Java File
class has a suitable method. So to parse the file c:\sample.xml,
you can write:
parser.parse((new
File("c:\sample.xml")).toURL().toString());
(Note that the parse()
method expects the URL as a string, not as a Java URL object, hence the need to
call toString() to achieve the
conversion.)
With Java 1.1, the translation of a filename to a URL is a
little more difficult than you might expect if you want the code to work
equally on Windows and on UNIX, because of the wide variety of filename
formats. Here's a method that handles most cases, though the error handling
leaves something to be desired.
public String CreateURL(File file)
{
String path =
file.getAbsolutePath();
try
{
return
(new URL(path)).toString();
}
catch
(MalformedURLException ex)
{
String fs
= System.getProperty("file.separator");
char sep =
fs.charAt(0);
if (sep !=
'/') path = path.replace(sep, '/');
if
(path.charAt(0) != '/') path = '/' + path;
return
"file://" + path;
}
}
Input from Non-XML Sources
One of the more surprising ways in which SAX has been used
is to feed applications with data that is not stored in XML at all. So long as
the data is in a hierarchic format that can be mapped reasonably well to the
XML data model, you can write a driver that behaves in every way like an XML
parser. Your driver sends events such as startElement()
and endElement() to the application's DocumentHandler just as if the data
originated in an XML document, when in reality there is no XML document there
to be parsed.
Why would you want to do this? It allows you to take advantage
of applications that were written to accept XML data, without going through the
clumsy process of writing your data in XML format and then parsing it again.
For example, if you have an application designed to process incoming XML-EDI
messages for electronic commerce transactions, you might want also to write a
translator that feeds this application with messages arriving in older
proprietary formats. One way to do this is for your translator to create an XML
file and supply this file to the application. But a neat shortcut, if the
target application is written to use SAX, is for your translator to call the
application directly, pretending to be an XML parser.
The section below on SAX Filters discusses some of the
possibilities using this approach.
Handling External Entities
We often think of XML entities as the markers like äaut; appearing in the text of a
document. That's not quite accurate: äaut;
isn't strictly an entity, but an entity reference. The entity is the
thing that äaut; refers to, that is the
definition in the DTD that associates the name "aumlaut" with its
expanded text "ä".
There are many different kinds of entity in XML and we need
to be very careful which kinds we are talking about. As we saw in Chapter 3,
they include:
|
Character references
|
Characters specified in terms of a numeric code (decimal
or hexadecimal), for example 

or (These are not
technically entities at all but we include them here for completeness)
|
|
Predefined entities
|
The special entity references defined in the XML standard,
such as < and &
These are the only entity references you can use that do not
need a matching definition (either internal or external) in the DTD.
|
|
Internal entities
|
Entities whose expanded text is defined in the DTD (and
not as a reference to some external storage object)
|
|
External parsed entities
|
Entities whose expanded text is well-formed XML defined in
a separate file referenced from the main XML document by a system identifier
or URL.
|
|
Unparsed entities
|
Entities containing non-XML data (for example, binary
encoded images): these are always external. The actual format may be
identified by a notation.
|
|
Parameter entities
|
Entities containing parts of a DTD, rather than parts of a
document body.
|
|
Document entity
|
The main source XML document is itself an entity
|
|
External DTD
|
If the document references an external DTD, the DTD is
also an entity.
|
The facilities in SAX for handling
entities are concerned with resolving references to external entities, that is,
to data held in separate "files" – more strictly, in containers
identified by a system or public identifier. Internal entities, character
references, and predefined entities are dealt with automatically by the parser
and the application gets no chance to intervene in the way they are expanded.
External entities in XML are always identified by a system
identifier (which is a URI, which is for most practical purposes the same thing
as a URL) and, optionally, by a public identifier. Public identifiers are a
carry-forward from SGML: the XML standard (and SAX for that matter) doesn't
really say what public identifiers are or how they should be used, though there
are conventions based on established SGML practice.
There are various situations where the standard rules for
resolving an external entity reference by interpreting its system identifier or
URL are not really adequate. These include:
q
When the entities are held in a
database (or any other place where they are not directly addressable by URL,
for example a phrase library in a word processing system).
q
When the same entity reference is to
be interpreted differently depending on context. For example, the entity
reference ¤tUser; might expand to the
name of the currently logged-in user.
q
Where there is a versioning system in
use, with multiple versions of the same entity, and rules for determining which
version to use in given circumstances.
q
Where there are many copies of a list
of standard entities and the system wants to locate the nearest copy, for
performance reasons.
q
Where entities are referenced by
public identifier rather than URL. Public identifiers have become popular in
the SGML world and many publishing shops want to carry on using them with XML
too. Traditionally in SGML, public identifiers are mapped to actual files using
a lookup table known as a catalog. There is no such mechanism defined in XML,
but SAX allows the application to use such a mechanism if it wishes.
Where external entities cannot be found simply by URL, a SAX
application should provide an EntityResolver: that is, a class that implements
the org.xml.sax.EntityResolver interface.
The application can register an EntityResolver with the parser by calling the
parser's setEntityResolver() method.
An EntityResolver needs to implement only one method: resolveEntity(). This is called by the
parser with two parameters, a system identifier (or URL) and a public
identifier. The public identifier will be null if no public identifier was
specified in the entity declaration. The task of the resolveEntity()
method is to return an InputSource
object, which the parser will use to read the content of the external entity.
There is a simple example of an EntityResolver in the SAX
specification, reproduced in Appendix C.
Unparsed Entities and Notations
In general SAX does not provide any information to the
application about the contents of the DTD. During the definition phase of SAX,
it was decided that this fell outside the needs of most applications, and it
was therefore shelved. (As we will see, SAX 2.0 extends the facilities
available in this area.)
However, a total ban on access to DTD contents would have
made it impossible for a SAX application to deal with a document containing
references to unparsed entities and notations. As it happens, these are features
of XML that have been very little used, but no-one could predict that at the
time, and they still have their enthusiasts. Unparsed entities allow an XML
document to contain references to non-XML objects such as binary images or
sound, and notations allow the format of such objects to be registered and
accurately identified. When an unparsed entity is encountered, the parser (by
definition) won't touch it with a barge-pole, so the job of interpreting it is
left to the application. But the application can only deal with it if it can
identify the external entity and notation, and for this it needs access to the
relevant declarations from the DTD.
So the SAX interface DTDHandler,
whose name suggests that it might provide access to all kinds of goodies in the
DTD, actually provides only this minimal and very specialized information
concerning unparsed entities and notations. If you need this information, you
use the DTDHandler just like the other
event-handling interfaces: you write a class that implements org.xml.sax.DTDHandler, and register it with
the parser using the setDTDHandler()
method. The parser will then tell you about the system identifiers and public
identifiers used in unparsed entity and notation declarations in the DTD, and
you can use this information later on when you encounter references to these
objects (in the form of attributes of type ENTITY,
ENTITIES, or NOTATION)
in the body of the document.
But don't be disappointed that DTDHandler
offers less than the name appears to promise!
Choosing a Parser
Under this heading we can usefully consider two separate
questions:
q As a designer, how do you decide which product to use?
q As a programmer, how do you make your application configurable so
that the parser can be selected at run time?
The first question is really outside the scope of this book.
We have listed some of the SAX parsers available, and to be honest there is
little to choose between them. They are all effectively free, though the small
print of the licensing conditions varies from one to another: try them all and
take your pick.
The parsers broadly fall into two categories, those produced
by individuals and those produced by corporations. The products in both
categories are equally reliable. Those produced by corporations may be better
documented and supported, and they are also likely to contain a lot more
ancillary features (like support for Mandarin Chinese character encoding, or a
COBOL/CICS interface module). Fine if you happen to need that feature, a waste
of disk space and download time if you don't.
If you want a parser that does SAX parsing and nothing else,
that is fast, reliable and highly conformant to the standard, and if you don't
want technical support, there are few products that can beat James Clark's xp
parser available from http://www.jclark.com/xp.
Ælfred (see http://www.microstar.com/aelfred.html)
is smaller, which makes it a good choice for embedding in your own application,
especially in applets where download time is important. The Sun and IBM parsers
probably produce more helpful diagnostics for incorrect XML files, so they can
be useful in an XML authoring environment. For the other parsers, the main
consideration is the environment they run in: the Oracle parser, for example,
is an obvious choice in an application that makes heavy use of Oracle products.
In practice it is a good idea to keep your options open: you
don't know what parsers will come along in the future, and you don't know
whether potential purchasers of your applications might have policies such as
"No unsupported software" or "No software that doesn't have
French error messages". This means
you want to write your application in a way that avoids the crucial statement
Parser p =
new com.jclark.xml.sax.Driver();
which locks you and your customers into one particular
product..
If you were running in a distributed object environment such
as CORBA (Common Object Request Broker Architecture – see http://www.omg.org), the correct
architectural approach to this problem would be for your application to
delegate the task of finding a parser to the Trader, which could use all sorts
of rules to find one that met your run-time needs. The designers of SAX
understandably wanted to avoid being dependant on such a run-time environment.
Instead they left you with a number of choices:
q You can use the simple helper class ParserFactory that comes with the SAX distribution. Your application calls the
static method ParserFactory.makeParser(). This reads
the system property org.xml.sax.parser
and interprets it as a class name. You can set a system property using the –D option on the Java command line, and hence, by writing a command
script, from an environment variable.
q You can implement your own mechanism for instantiating a Parser
class whose name is determined at run-time. You might hold the name in a
configuration file or in the Windows registry. Provided you can read the name
as a String, you can use a Java sequence such as the following to create a
Parser instance. In practice, you will need to add some error handling to catch
the various exceptions that can be thrown.
String parserName = [***
read name of parser ***];
Parser p = (Parser)(Class.forName(parserName).newInstance());
q You could also build a list of known parsers into your application,
and try loading them in turn until you find one that you can load successfully.
This allows your users to install any one of these parsers on their classpath,
but of course it doesn't allow them to substitute a parser that you didn't know
about.
An example of the second technique can be found in the ParserManager class from Michael Kay's SAXON
package (see http://users.iclway.co.uk/mhkay/saxon/).
This class instantiates a parser from information in a configuration file
called ParserManager.properties (provided in
the SAXON package). To run the application with a different parser, all that is
needed is a quick edit to the configuration file (instructions for this are
written in the file). ParserManager is a
free-standing class which can be used independently of the rest of the SAXON
package, and is freely distributable. Once you have installed ParserManager and its properties file on the
classpath, you can create a SAX Parser simply by writing:
Parser = ParserManager.makeParser();
We will do this in our subsequent examples.