Empty Elements
Sometimes an element has no data. Recall our earlier example, where the middle element
contained no name:
<name nickname='Shiny
John'>
<first>John</first>
<!--John lost his middle name in a fire-->
<middle></middle>
<last>Doe</last>
</name>
In this case, you also have the option of
writing this element using the special empty
element syntax:
This is the one case where a start-tag
doesn't need a separate end-tag, because they are both combined together into
this one tag. In all other cases, they do.
Recall from our discussion of element names
that the only place we can have a space within the tag is before the closing
">". This rule is slightly different when it comes to empty
elements. The "/" and ">" characters always have to be
together, so you can create an empty element like this:
but not like these:
Empty elements really don't buy you
anything – except that they take less typing – so you can use them, or not, at
your discretion. Keep in mind, however, that as far as XML is concerned <middle></middle> is exactly the same as <middle/>; for
this reason, XML parsers will sometimes change your XML from one form to the
other. You should never count on your empty elements being in one form or the
other, but since they're syntactically exactly the same, it doesn't matter.
(This is the reason that IE5 felt free to change our earlier <parody></parody> syntax to just <parody/>.
Interestingly, nobody in the XML community seems
to mind the empty element syntax, even though it doesn't add anything to the
language. This is especially interesting considering the passionate debates
that have taken place on whether attributes are really necessary.
One place where empty elements are very
often used is for elements that have no (or optional) PCDATA, but instead have
all of their information stored in attributes. So if we rewrote our <name> example
without child elements, instead of a start-tag and end-tag we would probably
use an empty element, like this:
<name
first="John" middle="Fitzgerald Johansen"
last="Doe"/>
Another common example is the case where
just the element name is enough; for example, the HTML <BR> tag might
be converted to an XML empty element, such as the XHTML <br/> tag.
(XHTML is the latest "XML-compliant" version of HTML.)
XML Declaration
It is often very handy to be able to identify a document as being a certain type. XML
provides the XML declaration for us
to label documents as being XML, along with giving the parsers a few other
pieces of information. You don't need to have an XML declaration, but you
should include it anyway.
A typical XML declaration looks like this:
<?xml version='1.0'
encoding='UTF-16' standalone='yes'?>
<name nickname='Shiny John'>
<first>John</first>
<!--John lost his middle
name in a fire-->
<middle/>
<last>Doe</last>
</name>
Some things to note about the XML
declaration:
The XML declaration starts with the characters <?xml, and ends with the characters ?>.
If you include it, you must include
the version, but the encoding and standalone
attributes are optional.
The version, encoding, and standalone attributes must be in that order.
Currently, the version should be 1.0. If you use a number other than 1.0, XML parsers that were written for the version 1.0 specification
should reject the document. (As of yet, there have been no plans announced for
any other version of the XML specification. If there ever is one, the version
number in the XML declaration will be used to signal which version of the
specification your document claims to support.)
The XML declaration must be right at
the beginning of the file. That is, the first character in the file should be
that <; no line breaks
or spaces. Some parsers are more forgiving about this than others.
So an XML declaration can be as full as the
one above, or as simple as:
The next two sections will describe more
fully the encoding and standalone attributes of the XML declaration.
Encoding
It should come as no surprise to us that text is
stored in computers using numbers, since numbers are all that computers really
understand.
A character code is a one-to-one mapping between a
set of characters and the corresponding numbers to represent those characters.
A character encoding
is the method used to represent the numbers in a character code digitally, (in
other words how many bytes should be used for each number, etc.)
One
character code/encoding that you might have come across is the American Standard Code for Information
Interchange (ASCII). For
example, in ASCII the character "a" is represented by the number 97,
and the character "A" is represented by the number 65.
There are
seven-bit and eight-bit ASCII encoding schemes. 8-bit
ASCII uses one byte (8 bits) for each character, which can only store 256
different values, so that limits ASCII to 256 characters. That's enough to
easily handle all of the characters needed for English, which is why ASCII was
the predominant character encoding used on personal computers in the
English-speaking world for many years. But there are way more than 256
characters in all of the world's languages, so obviously ASCII can only handle
a small subset of these. This is reason that Unicode was
invented.
Unicode
Unicode is
a character code designed from the ground up with internationalization in mind, aiming to have enough possible characters to cover all of
the characters in any human language. There are two major character encodings
for Unicode: UTF-16 and UTF-8. UTF-16 takes the easy way, and simply uses two bytes for every character (two
bytes = 16 bits = 65,356 possible values).
UTF-8 is
more clever: it uses one byte for the characters covered by 7-bit ASCII, and
then uses some tricks so that any other characters may be represented by two or
more bytes. This means that ASCII text can actually be considered a subset of
UTF-8, and processed as such. For text written in English, where most of the
characters would fit into the ASCII character encoding, UTF-8 can result in
smaller file sizes, but for text in other languages, UTF-16 should usually be
smaller.
Because of
the work done with Unicode to make it international, the XML specification
states that all XML processors must use Unicode internally. Unfortunately, very
few of the documents in the world are encoded in Unicode. Most are encoded in ISO-8859-1,
or windows-1252, or EBCDIC, or one of a large number of other character encodings. (Many of
these encodings, such as ISO-8859-1 and windows-1252, are actually variants of
ASCII. They are not, however, subsets of UTF-8 in the same way that
"pure" ASCII is.)
Specifying Character
Encoding for XML
This is
where the encoding attribute
in our XML declaration comes in. It allows us to specify, to the XML
parser, what character encoding our text is in. The XML parser can then read
the document in the proper encoding, and translate it into Unicode internally.
If no encoding is specified, UTF-8 or UTF-16 is assumed (parsers must support
at least UTF-8 and UTF-16). If no encoding is specified, and the document is
not UTF-8 or UTF-16, it results in an error.
Sometimes
an XML processor is allowed to ignore the encoding specified in the XML declaration.
If the document is being sent via a network protocol such as HTTP, there may be
protocol-specific headers which specify a different encoding than the one
specified in the document. In such a case, the HTTP header would take
precedence over the encoding specified in the XML declaration. However, if
there are no external sources for the encoding, and the encoding specified is
different from the actual encoding of the document, it results in an error.
If you're
creating XML documents in Notepad on a machine running a Microsoft Windows
operating system, the character encoding you are using by default is
windows-1252. So the XML declarations in your documents should look like this:
<?xml version="1.0"
encoding="windows-1252"?>
However,
not all XML parsers understand the windows-1252 character set. If that's the
case, try substituting ISO-8859-1, which happens to be very similar. Or, if
your document doesn't contain any special characters (like accented characters,
for example), you could use ASCII instead, or leave the encoding attribute out, and let the XML parser treat the document as UTF-8.
If you're
running Windows NT or Windows 2000, Notepad also gives you the option of saving
your text files in Unicode, in which case you can leave out the encoding attribute in your XML declarations.
Standalone
If the standalone attribute
is included in the XML
declaration, it must be either yes or no.
yes specifies that
this document exists entirely on its own, without depending on any other files
no indicates that
the document may depend on other files
This little attribute actually has its own
name: the Standalone Document Declaration, or SDD. The XML
specification doesn't actually require a parser to do anything with the SDD. It
is considered more of a hint to the parser than anything else.
This is only a partial description of the
SDD. If it has whetted your appetite for more, you'll have to be patient until
Chapter 11, when all will be made clear.
It's time to take a look at how the XML
declaration works in practice.
Try It Out – Declaring Al's CD to the World
Let's declare our XML document, so that any
parsers will be able to tell right away what it is. And, while we're at it,
let's take care of that second <parody> element, which doesn't have any content.
1. Open up the file cd3.xml,
and make the following changes:
<?xml version='1.0'
encoding='windows-1252' standalone='yes'?>
<CD serial='B6B41B'
disc-length='36:55'>
<artist>"Weird
Al" Yankovic</artist>
<title>Dare to be
Stupid</title>
<genre>parody</genre>
<date-released>1990</date-released>
<!--date-released is the
date released to CD, not to record-->
<song>
<title>Like A
Surgeon</title>
<length>
<minutes>3</minutes>
<seconds>33</seconds>
</length>
<parody>
<title>Like A
Virgin</title>
<artist>Madonna</artist>
</parody>
</song>
<song>
<title>Dare to be
Stupid</title>
<length>
<minutes>3</minutes>
<seconds>25</seconds>
</length>
<parody/>
</song>
<!--There are more songs
on this CD, but I didn't have time
to include them!-->
</CD>
Save the file
as cd5.xml, and view it in IE5:
How It Works
With our new XML declaration, any XML
parser can tell right away that it is indeed dealing with an XML document, and
that document is claiming to conform to version 1.0 of the XML specification.
Furthermore, the document indicates that it
is encoded using the windows-1252 character encoding. Again many XML parsers
don't understand windows-1252, so you may have to play around with the
encoding. Luckily, the parser used by Internet Explorer 5 does understand
windows-1252, so if you're viewing the examples in IE5 you can leave the XML
declaration as it is here.
In addition, because the Standalone
Document Declaration declares that this is a standalone document, the parser
knows that this one file is all that it needs to fully process the information.
And finally, because "Dare to be
Stupid" is not a parody of any particular song, the <parody>
element has been changed to an empty element. That way we can visually
emphasize the fact that there is no information there. Remember, though, that to
the parser <parody/> is exactly the same as <parody></parody>, which is why this part of our document looks the same as it did in
our earlier screenshots.