Wattle Software - producers of XMLwriter XML editor
 Home | Site Map 
XMLwriter
 Screenshots
 Features
 About Latest Version
 Awards & Reviews
 User Comments
 Customers
Download
 Download XMLwriter
 Download Plug-ins
 Download Help Manual
 Downloading FAQ
Buy
 Buy XMLwriter
 Pricing
 Upgrading
 Sales Support
 Sales FAQ
Support
 Sales Support
 Technical Support
 Submit a Bug Report
 Feedback & Requests
 Technical FAQ
Resources
 XML Links
 XML Training
 XMLwriter User Tools
 The XML Guide
 XML Book Samples
Wattle Software
 About Us
 Contact Details
Professional XML

Buy this book

Back Contents Next

Advanced SAX features

The features we've covered so far are probably enough for 90% of SAX applications. But it's useful to know something of the rest of the features, for those occasions when they are needed. This section of the chapter gives a survey of these features and their intended purpose.

Alternative Input Sources

In our examples so far, the XML document to be parsed has been described in the form of a URL. This is usually adequate, given the range of resources that a URL can describe. It allows the document to be held in a file locally or remotely, or for it to be generated dynamically by a web server.

Taking Input from a Byte Stream or Character Stream

Sometimes you want to supply the parser with a stream of XML that is generated by another program rather than being held in a file. For example, the XML might be stored in a relational database, or it might be output by an EDI message translation program, or it might be an XML section embedded within a file or message in some non-XML format. You don't want to have to write the XML to the file store (or to install a web server) just so that the parser can read your document.

 

To handle this situation, SAX allows you to supply the XML input in the form of a character stream or a byte stream. It provides the InputSource class to generalize all these possible sources of input.

 

For example, let's suppose your program wants to parse XML held in a character string that has just been read from a relational database using JDBC. The following code will do the job:

 

public void parseString(String s) throws SAXException, IOException

{

StringReader reader = new StringReader(s);

InputSource source = new InputSource(reader);

parser.parse(source);

}

 

InputSource is a class (not an interface) provided with the SAX distribution. The application can set various details of the input source, some of which are mutually exclusive. These include supplying a URL, a Reader (as here), an InputStream, an encoding name, and a "public identifier". (Public identifiers, however, are as enigmatic in SAX as in the XML specification itself: there are no clues as to what the parser should actually do with the public identifier. But as we will see later, the application can use it.)

 

Why does SAX need to provide two options for in-memory data, an InputStream and a Reader?

 

An InputStream is a stream of bytes. The XML standard provides many rules about how a stream of bytes can be translated into a stream of Unicode characters, including for example the encoding attribute (which is part of the xml declaration at the start of the document content). To translate bytes to characters, it's not good enough to leave the work to the standard Java libraries, because they don't understand these rules, and they certainly can't be expected to read the encoding attribute. If the XML comes from a binary source, complete with encoding attribute, we want to hand the stream of bytes to the parser for it to interpret directly.

 

A Reader, by contrast, is a stream of Unicode characters. If we already have the data in the form of characters, we don't want to have to encode it first as a stream of bytes (say in the UTF-8 encoding) just so that the parser can decode it again. Better to hand the character stream to the parser directly. (Actually, there was some debate about the desirability of providing this option in SAX. While it's obviously useful, it's not entirely in the spirit of the XML specification, which defines an XML document strictly as sequence of bytes. It's perhaps best to think of the input character stream not as an XML document, but as a preprocessed XML document in which the first stage of processing, namely character decoding, has already been done.)

 

Whether we use a byte stream or a character stream, there is one snag you need to be aware of: the parser has no way of resolving a relative URL that appears in the document source. Suppose the document source contains the line

 

<!DOCTYPE books SYSTEM "books.dtd">

 

Where is books.dtd to be found? The XML specification says (in effect) that it should be found in the same directory as the source document, but of course we don't have a directory for the source document because it was in memory when parsing started.

 

SAX gets round this by allowing a system identifier (in other words, a URL) to be supplied as well as a byte stream or character stream. This URL is not used to read the source document, only as a base for resolving any relative URLs found in the source document.

Specifying a Filename rather than a URL

Another common source of input is a file name: for example, command-line interfaces generally use file names as arguments rather than URLs, and you may well want to use this form of argument in the interface to your application.

 

The SAX InputSource class does not directly allow you to specify a filename for the input; you have to convert the filename into a URL so that the parser can process it. If you are using Java 2, this is simplicity itself: the Java File class has a suitable method. So to parse the file c:\sample.xml, you can write:

 

parser.parse((new File("c:\sample.xml")).toURL().toString());

 

(Note that the parse() method expects the URL as a string, not as a Java URL object, hence the need to call toString() to achieve the conversion.)

 

With Java 1.1, the translation of a filename to a URL is a little more difficult than you might expect if you want the code to work equally on Windows and on UNIX, because of the wide variety of filename formats. Here's a method that handles most cases, though the error handling leaves something to be desired.

 

public String CreateURL(File file)

{

String path = file.getAbsolutePath();

try

{

return (new URL(path)).toString();

}

catch (MalformedURLException ex)

{

String fs = System.getProperty("file.separator");

char sep = fs.charAt(0);

if (sep != '/') path = path.replace(sep, '/');

if (path.charAt(0) != '/') path = '/' + path;

return "file://" + path;

}

}

 

Input from Non-XML Sources

One of the more surprising ways in which SAX has been used is to feed applications with data that is not stored in XML at all. So long as the data is in a hierarchic format that can be mapped reasonably well to the XML data model, you can write a driver that behaves in every way like an XML parser. Your driver sends events such as startElement() and endElement() to the application's DocumentHandler just as if the data originated in an XML document, when in reality there is no XML document there to be parsed.

 

Why would you want to do this? It allows you to take advantage of applications that were written to accept XML data, without going through the clumsy process of writing your data in XML format and then parsing it again. For example, if you have an application designed to process incoming XML-EDI messages for electronic commerce transactions, you might want also to write a translator that feeds this application with messages arriving in older proprietary formats. One way to do this is for your translator to create an XML file and supply this file to the application. But a neat shortcut, if the target application is written to use SAX, is for your translator to call the application directly, pretending to be an XML parser.

 

The section below on SAX Filters discusses some of the possibilities using this approach.

Handling External Entities

We often think of XML entities as the markers like &aumlaut; appearing in the text of a document. That's not quite accurate: &aumlaut; isn't strictly an entity, but an entity reference. The entity is the thing that &aumlaut; refers to, that is the definition in the DTD that associates the name "aumlaut" with its expanded text "".

 

There are many different kinds of entity in XML and we need to be very careful which kinds we are talking about. As we saw in Chapter 3, they include:

 

Character references

Characters specified in terms of a numeric code (decimal or hexadecimal), for example &#xa; or &#10; (These are not technically entities at all but we include them here for completeness)

 

Predefined entities

The special entity references defined in the XML standard, such as &lt; and &amp;
These are the only entity references you can use that do not need a matching definition (either internal or external) in the DTD.

 

Internal entities

Entities whose expanded text is defined in the DTD (and not as a reference to some external storage object)

 

External parsed entities

Entities whose expanded text is well-formed XML defined in a separate file referenced from the main XML document by a system identifier or URL.

 

Unparsed entities

Entities containing non-XML data (for example, binary encoded images): these are always external. The actual format may be identified by a notation.

 

Parameter entities

Entities containing parts of a DTD, rather than parts of a document body.

 

Document entity

The main source XML document is itself an entity

 

External DTD

If the document references an external DTD, the DTD is also an entity.

 

 

The facilities in SAX for handling entities are concerned with resolving references to external entities, that is, to data held in separate "files" more strictly, in containers identified by a system or public identifier. Internal entities, character references, and predefined entities are dealt with automatically by the parser and the application gets no chance to intervene in the way they are expanded.

 

External entities in XML are always identified by a system identifier (which is a URI, which is for most practical purposes the same thing as a URL) and, optionally, by a public identifier. Public identifiers are a carry-forward from SGML: the XML standard (and SAX for that matter) doesn't really say what public identifiers are or how they should be used, though there are conventions based on established SGML practice.

 

There are various situations where the standard rules for resolving an external entity reference by interpreting its system identifier or URL are not really adequate. These include:

 

q     When the entities are held in a database (or any other place where they are not directly addressable by URL, for example a phrase library in a word processing system).

q     When the same entity reference is to be interpreted differently depending on context. For example, the entity reference &currentUser; might expand to the name of the currently logged-in user.

q     Where there is a versioning system in use, with multiple versions of the same entity, and rules for determining which version to use in given circumstances.

q     Where there are many copies of a list of standard entities and the system wants to locate the nearest copy, for performance reasons.

q     Where entities are referenced by public identifier rather than URL. Public identifiers have become popular in the SGML world and many publishing shops want to carry on using them with XML too. Traditionally in SGML, public identifiers are mapped to actual files using a lookup table known as a catalog. There is no such mechanism defined in XML, but SAX allows the application to use such a mechanism if it wishes.

 

Where external entities cannot be found simply by URL, a SAX application should provide an EntityResolver: that is, a class that implements the org.xml.sax.EntityResolver interface. The application can register an EntityResolver with the parser by calling the parser's setEntityResolver() method.

 

An EntityResolver needs to implement only one method: resolveEntity(). This is called by the parser with two parameters, a system identifier (or URL) and a public identifier. The public identifier will be null if no public identifier was specified in the entity declaration. The task of the resolveEntity() method is to return an InputSource object, which the parser will use to read the content of the external entity.

 

There is a simple example of an EntityResolver in the SAX specification, reproduced in Appendix C.

Unparsed Entities and Notations

In general SAX does not provide any information to the application about the contents of the DTD. During the definition phase of SAX, it was decided that this fell outside the needs of most applications, and it was therefore shelved. (As we will see, SAX 2.0 extends the facilities available in this area.)

 

However, a total ban on access to DTD contents would have made it impossible for a SAX application to deal with a document containing references to unparsed entities and notations. As it happens, these are features of XML that have been very little used, but no-one could predict that at the time, and they still have their enthusiasts. Unparsed entities allow an XML document to contain references to non-XML objects such as binary images or sound, and notations allow the format of such objects to be registered and accurately identified. When an unparsed entity is encountered, the parser (by definition) won't touch it with a barge-pole, so the job of interpreting it is left to the application. But the application can only deal with it if it can identify the external entity and notation, and for this it needs access to the relevant declarations from the DTD.

 

So the SAX interface DTDHandler, whose name suggests that it might provide access to all kinds of goodies in the DTD, actually provides only this minimal and very specialized information concerning unparsed entities and notations. If you need this information, you use the DTDHandler just like the other event-handling interfaces: you write a class that implements org.xml.sax.DTDHandler, and register it with the parser using the setDTDHandler() method. The parser will then tell you about the system identifiers and public identifiers used in unparsed entity and notation declarations in the DTD, and you can use this information later on when you encounter references to these objects (in the form of attributes of type ENTITY, ENTITIES, or NOTATION) in the body of the document.

 

But don't be disappointed that DTDHandler offers less than the name appears to promise!

Choosing a Parser

Under this heading we can usefully consider two separate questions:

 

q     As a designer, how do you decide which product to use?

q     As a programmer, how do you make your application configurable so that the parser can be selected at run time?

 

The first question is really outside the scope of this book. We have listed some of the SAX parsers available, and to be honest there is little to choose between them. They are all effectively free, though the small print of the licensing conditions varies from one to another: try them all and take your pick.

 

The parsers broadly fall into two categories, those produced by individuals and those produced by corporations. The products in both categories are equally reliable. Those produced by corporations may be better documented and supported, and they are also likely to contain a lot more ancillary features (like support for Mandarin Chinese character encoding, or a COBOL/CICS interface module). Fine if you happen to need that feature, a waste of disk space and download time if you don't.

 

If you want a parser that does SAX parsing and nothing else, that is fast, reliable and highly conformant to the standard, and if you don't want technical support, there are few products that can beat James Clark's xp parser available from http://www.jclark.com/xp. lfred (see http://www.microstar.com/aelfred.html) is smaller, which makes it a good choice for embedding in your own application, especially in applets where download time is important. The Sun and IBM parsers probably produce more helpful diagnostics for incorrect XML files, so they can be useful in an XML authoring environment. For the other parsers, the main consideration is the environment they run in: the Oracle parser, for example, is an obvious choice in an application that makes heavy use of Oracle products.

 

In practice it is a good idea to keep your options open: you don't know what parsers will come along in the future, and you don't know whether potential purchasers of your applications might have policies such as "No unsupported software" or "No software that doesn't have French error messages". This means you want to write your application in a way that avoids the crucial statement

 

Parser p = new com.jclark.xml.sax.Driver();

 

which locks you and your customers into one particular product..

 

If you were running in a distributed object environment such as CORBA (Common Object Request Broker Architecture see http://www.omg.org), the correct architectural approach to this problem would be for your application to delegate the task of finding a parser to the Trader, which could use all sorts of rules to find one that met your run-time needs. The designers of SAX understandably wanted to avoid being dependant on such a run-time environment. Instead they left you with a number of choices:

 

q     You can use the simple helper class ParserFactory that comes with the SAX distribution. Your application calls the static method ParserFactory.makeParser(). This reads the system property org.xml.sax.parser and interprets it as a class name. You can set a system property using the D option on the Java command line, and hence, by writing a command script, from an environment variable.

q     You can implement your own mechanism for instantiating a Parser class whose name is determined at run-time. You might hold the name in a configuration file or in the Windows registry. Provided you can read the name as a String, you can use a Java sequence such as the following to create a Parser instance. In practice, you will need to add some error handling to catch the various exceptions that can be thrown.

String parserName = [*** read name of parser ***];
Parser p = (Parser)(Class.forName(parserName).newInstance());

q     You could also build a list of known parsers into your application, and try loading them in turn until you find one that you can load successfully. This allows your users to install any one of these parsers on their classpath, but of course it doesn't allow them to substitute a parser that you didn't know about.

 

An example of the second technique can be found in the ParserManager class from Michael Kay's SAXON package (see http://users.iclway.co.uk/mhkay/saxon/). This class instantiates a parser from information in a configuration file called ParserManager.properties (provided in the SAXON package). To run the application with a different parser, all that is needed is a quick edit to the configuration file (instructions for this are written in the file). ParserManager is a free-standing class which can be used independently of the rest of the SAXON package, and is freely distributable. Once you have installed ParserManager and its properties file on the classpath, you can create a SAX Parser simply by writing:

 

Parser = ParserManager.makeParser();

 

We will do this in our subsequent examples.

 

Back Contents Next
©1999 Wrox Press Limited, US and UK.

Buy this book



Select a Book

Beginning XML
Beginning XHTML
Professional XML
Professional ASP XML
Professional XML Design...
Professional XSLT...
Professional VB6 XML
Designing Distributed...
Professional Java XML...
Professional WAP

© Wattle Software 1998-2019. All rights reserved.