SAX 2.0
SAX 1.0 has been very widely implemented and has been in
widespread use almost since the day the first draft appeared on 12 January
1998 – a month earlier than the date of the final XML 1.0 recommendation. It
has met user needs well, in spite of a few criticisms, some of which are hinted
at in this chapter.
So it is perhaps unsurprising that the development of a
successor, SAX 2.0, has been comparatively leisurely. Requirements were
discussed on the XML-DEV mailing list during the early months of 1999, and an
alpha version of a revised spec was published by David Megginson (though not
widely advertised) on 1 June 1999. There has been little adverse comment, and
it seems likely that the final specification of SAX 2.0 will be close to its
current form, which can be found on http://www.megginson.com/SAX/SAX2/
Whether the specification will be widely implemented is
another matter. Time will tell.
The way in which the original SAX interface has been
extended is in itself quite interesting. A standard mechanism has been defined
to allow the application to ask the parser to support particular features or to
set particular properties; the parser in all cases has the option to refuse.
The set of features and properties that can be requested is itself entirely
open-ended. SAX2 defines a core set, but additional features and properties can
be invented by anyone at any time. To make this possible, the features and
properties are identified by a URI, in rather the same way as XML namespaces.
The Configurable Interface
The key new interface in SAX2 is named Configurable. A SAX2 parser must implement
the org.xml.sax.Configurable interface as
well as the org.xml.sax.Parser interface. The Configurable interface contains four
methods:
|
getFeature(featureName)
|
Allows the application to ask the parser whether or not it
supports a particular feature.
|
|
setFeature(featureName,
boolean)
|
Allows the application to request that the parser should
turn a particular feature on or off.
|
|
getProperty(featureName)
|
Allows the application to request the current value of
some particular property.
|
|
setProperty(featureName,
object)
|
Allows the application to set some particular property to
the supplied value.
|
In each case, if the parser does not recognize the feature
or property name, it must throw a SAXNotRecognizedException.
This means in general that the application will not know whether the parser
supports the feature or not. If the parser recognizes the name of the feature
or property, but cannot set it to the requested value, it must throw a SAXNotSupportedException.
To make this more concrete, consider one of the new core
features, whose name is http://xml.org/sax/features/validation.
This feature is provided to fix the problem in SAX 1.0 whereby an application
has no way of discovering or controlling whether the parser is a validating
one. With SAX 2.0, if this feature is on, the parser must validate the XML
document; if it is off, it must not do so (in other words, the parse must
succeed so long as the document is well-formed).
An application that explicitly requires a validating parser
may call:
parser.setFeature("http://xml.org/sax/features/validation",
true);
This is a core feature, so every SAX2 parser should
recognize its name. A parser that can perform validation will return normally,
while a parser that cannot perform validation will throw a SAXNotSupportedException.
Equally, an application that explicitly requires the parser not to do validation may call:
parser.setFeature("http://xml.org/sax/features/validation",
false);
This time, a parser that insists on doing validation must
respond to this request with a SAXNotSupportedException.
On the other hand, an application that simply wants to know
whether the parser is performing validation or not may call:
if (parser.getFeature("http://xml.org/sax/features/validation"))
...
Core Features and Properties
The following core features and properties are defined in
SAX2. A feature is simply shorthand for a property whose value is a boolean.
|
Name
(prefixed http://xml.org/sax)
|
Value
|
Meaning
|
|
/features/validation
|
boolean
|
Perform validation
|
|
/features/external-general-entities
|
boolean
|
Expand general (i.e. parsed) external
entities
|
|
/features/external-parameter-entities
|
boolean
|
Expand the external DTD subset and
external parameter entities
|
|
/features/namespaces
|
boolean
|
Process namespace declarations.
Element and attribute names with a prefix will have the prefix replaced by
the URI of the namespace
|
|
/features/normalize-text
|
boolean
|
Normalize character data, by ensuring
that all consecutive pieces of character data are passed in a single call of
the characters() method
|
|
/features/use-locator
|
boolean
|
Supply the application with a Locator object by calling the setDocumentLocator() method
|
|
/properties/namespace-sep
|
String
|
Separator to be used between the URI
and the local part of a name when the namespaces feature is enabled
|
|
/properties/dom-node
|
org.w3c.dom.Node
|
Read-only property: if the DOM for
the source document exists in memory, this property identifies the DOM node
relating to the current event
|
|
/properties/xml-string
|
String
|
Read-only property: a character
string giving the XML representation of the current event.
|
|
/handlers/DeclHandler
|
org.xml.sax.misc.
DeclHandler
|
Set a handler to process element and
attribute declarations encountered in the DTD
|
|
/handlers/LexicalHandler
|
org.xml.sax.misc.
LexicalHandler
|
Set a handler to process lexical
events. These include CDATA sections, entities, and comments
|
|
/handlers/NamespaceHandler
|
org.xml.sax.misc.
NamespaceHandler
|
Set a handler to process namespace
declarations
|
The core properties in SAX2 thus include three new
event-handling interfaces: features, properties, and handlers.
(Remember, however, that "core" simply means every parser must
recognize a request for these features, it still has the right to refuse the
request.)
The declaration handler, DeclHandler,
meets the requirement for access to the structural definitions in the DTD. It
provides access to element declarations in the simplest possible way, as a
string that the application must parse.
The lexical handler, LexicalHandler, meets the requirement for access to
information that was suppressed in SAX 1.0 because it was considered to be of
no interest to applications. This includes the boundaries of internal entities,
the boundaries of CDATA sections, and the existence of comments. Many
application writers asked for these features because they enable the
application to minimize the changes made to a document as it is being copied.
Comments are needed for other reasons as well: for example, the XSLT
recommendation allows a stylesheet to say what should happen to comments in the
source document, so an XSLT interpreter written using the SAX interface needs
access to this information.
The namespace handler, NamespaceHandler, meets more advanced namespace handling
requirements than the namespaces feature. Whereas the namespaces feature simply
expands element and attribute prefixes using the namespace definitions
currently in force, a namespace handler allows the namespace definitions
themselves to be processed as events in their own right. This is useful in
several circumstances:
q Where the application uses prefixes in contexts other than element
and attribute names (for example, it might use them in attribute values)
q Where the application needs to know the prefix that was used (for
example, for use in error messages, or in attempting to copy parts of the
original document)
As remarked earlier, the SAX 2.0 specification cannot yet be
regarded as stable, so even if you find a parser that supports it, use it with
care.
Summary
We've presented some information about the origins of the
SAX interface, which is implemented by a wide variety of parsers.
The thing that characterizes SAX, and that distinguishes it
from the DOM interface, is that it is event-based. We discussed some of the
factors that might cause you to use an event-based interface in preference to
the DOM.
We discussed the structure of a simple SAX application, and
the relationship of the three main classes, the application, the parser, and
the document handler. We showed several examples of how to write SAX
applications using these classes.
We presented some of the important design patterns for SAX
applications, in particular, the filter or pipeline pattern, and the rule-based
pattern.
Finally, we gave a preview of the features that are expected
to appear in SAX 2.0 when it stabilizes.
We should end with a word of caution. All the examples shown
in this chapter could be coded much more easily in XSLT, which we will discuss
in Chapter 7. Of course that doesn't mean there is no need for SAX: Java
applications can do many things that XSL stylesheets can't – for example,
loading data into a relational database; and they will usually be much faster.
But it's worth thinking twice about your problem before you rush into assuming
that SAX is the answer, because in many cases an XSL approach, or a hybrid
approach using XSL for preprocessing, may be preferable.