Parsing XML
The main reason for creating all of these
rules about writing well-formed XML documents is so that we can create a
computer program to read in the data, and easily tell markup from information.
According to
the XML specification (http://www.w3.org/TR/1998/REC-xml-19980210#sec-intro):
"A software module called an XML processor
is used to read XML documents and provide access to their content and
structure. It is assumed that an XML processor is doing its work on behalf of
another module, called the application."
An XML
processor is more commonly called a parser, since it
simply parses XML and provides the application with any information it needs.
There are quite a number of XML parsers available, many of which are free. Some
of the better known ones are listed below.
Microsoft Internet
Explorer Parser
Microsoft's
first XML parser shipped with Internet
Explorer 4 and implemented an early draft of the XML specification. With the
release of IE5, the XML implementation was upgraded to reflect the XML version
1 specification. The latest version of the parser (March 2000 Technology
Preview Release) is available for download from http://msdn.microsoft.com/downloads/webtechnology/xml/msxml.asp. In this book we'll be mainly using the IE5 version.
James Clark's Expat
Expat is
an XML 1.0 parser toolkit written in C. More information
can be found at http://www.jclark.com/xml/expat.html and Expat can be downloaded from ftp://ftp.jclark.com/pub/xml/expat.zip. It is free for both private and commercial use.
Vivid Creations
ActiveDOM
Vivid
Creations (http://www.vivid-creations.com) offers several XML tools, including ActiveDOM. ActiveDOM contains a
parser similar to the Microsoft parser and, although it is a commercial
product, a demonstration version may be downloaded from the Vivid Creations web
site.
DataChannel XJ Parser
DataChannel,
a business solutions software company, worked
with Microsoft to produce an early XML parser written in Java. Their website (http://xdev.datachannel.com/directory/xml_parser.html) provides a link to get their most recent version. However, they
are no longer doing parser development. They have opted instead to use the
xml4j parser from IBM.
IBM
xml4j
IBM's
AlphaWorks site (http://www.alphaworks.ibm.com) offers a number of XML tools and applications, including the xml4j
parser. This is another parser written in Java, available for free, though
there are some licensing restrictions regarding its use.
Apache Xerces
The
Apache Software Foundation's Xerces sub-project of
the Apache XML Project (http://xml.apache.org/) has resulted in XML parsers in Java and C++, plus a Perl wrapper
for the C++ parser. These tools are in beta, they are free, and the
distribution of the code is controlled by the GNU Public License.
Errors in XML
As well as specifying how a parser
should get the information out of an XML document, it is also specified how a
parser should deal with errors in XML. There are two types of errors in the XML
specification: errors and fatal
errors.
An error is simply a violation of the
rules in the specification, where the results are undefined; the XML processor
is allowed to recover from the error and continue processing.
Fatal errors are more serious:
according to the specification a parser is not
allowed to continue as normal when it encounters a fatal error. (It may,
however, keep processing the XML document to search for further errors.) Any
error which causes an XML document to cease being well-formed is a fatal error.
The reason for this drastic handling of
non-well-formed XML is simple: it would be extremely hard for parser writers to
try and handle "well-formedness" errors, and it is extremely simple
to make XML well-formed. (HTML does not force documents to be as strict as XML
does, but this is one of the reasons why web browsers are so incompatible; they
must deal with all of the errors they
may encounter, and try to figure out what the person who wrote the document was
really trying to code.)
But draconian error handling doesn't just
benefit the parser writers; it also benefits us when we're creating XML
documents. If I write an XML document that doesn't properly follow XML's
syntax, I can find out right away and fix my mistake. On the other hand, if the
XML parser tried to recover from these errors, it may misinterpret what I was
trying to do, but I wouldn't know about it because no error would be raised. In
this case, bugs in my software would be much harder to track down, instead of
being caught right at the beginning when I was creating my data.
Summary
This chapter has provided you with the basic syntax for writing
well-formed XML documents.
We've seen:
Elements and empty elements
<How to deal with white space in XML
Attributes
How to include comments
XML declarations and encodings
Processing instructions
Entity references, character
references and CDATA sections
We've also learned why the strict rules of XML grammar actually
benefit us, in the long run, and how some of the rules for authoring HTML are
different from the rules for authoring well-formed XML.
Unfortunately – or perhaps fortunately – you probably won't spend
much of your time just authoring XML documents. But once you have the data in
XML form, you still have to be able to use that data. In the chapters that
follow we'll learn some of the other technologies surrounding XML, which will
help you to make use of your data, starting with one of the most common:
display.