Table of Contents
Characters in a computer are just numbers. An XML document examined directly in a computer's memory is a long string of numbers. A character set encoding is a mapping of those computer numbers to particular characters. For example, in the
iso-8859-1 encoding, the number
225 is mapped to á (a acute). Whenever the computer displays the XML document, it uses an encoding to convert the numbers to character glyphs for display. There are many ways to do such mappings, and there are many character sets the numbers can map to. So there are many possible encodings. XML programs use Unicode internally to encode all characters in the computer's memory. However, your DocBook documents do not have to be written in Unicode and your output does not have to be Unicode. But having available a Unicode-aware text editor such as UniPad can help resolve many problems with characters and encodings.
If you want more details on encoding in XML, this website http://skew.org/xml/tutorial/ has an in-depth tutorial.
The creators of the XML specification were well aware that different documents may need different character encodings. So they let you specify the encoding right at the top of each document in the XML declaration:
<?xml version="1.0" encoding="iso-8859-1"?>
In this example, the encoding is specified as
iso-8859-1 which is also known as ISO Latin 1. If the encoding is not specified, then UTF-8 encoding is assumed. With the encoding established, an XML program that opens the document knows how to convert the numbers it sees to logical characters, and then convert those characters into the Unicode numbers it uses internally. Of course, the content of the document must actually be encoded with this encoding. That is, you cannot just change the label at the top and think you have a new encoding. The document itself would have to be converted to the new mapping of characters. If the encoding declaration of the document does not match the actual encoding, then you may end up with gibberish.
The following are several common encoding names. Usually either uppercase or lowercase letters are recognized. But do not forget the hyphens.
Table 20.1. Character encodings
|UTF-8||The default Unicode encoding.|
|UTF-16||Another Unicode encoding.|
|US-ASCII||Basic 128 characters.|
|ISO-8859-1||Western European languages.|
|ISO-8859-2||Central European languages.|
|ISO-8859-15||ISO-8859-1 plus the Euro symbol and other small changes.|
|Shift_JIS||Japanese on Windows|
|EUC-JP||Japanese on Unix|
What if you need to enter a character that a document's encoding does not include? For example,
iso-8859-1 does not include a character for the trademark symbol ™. The solution is to use numerical character references for any characters not in your encoding. The trademark symbol can be entered as
™ in hexadecimal notation, or the equivalent
™ in decimal notation. Of course, having to remember that
™ means trademark is an author's nightmare. Fortunately, the DocBook DTD provides more easily recognized text entities for hundreds of characters you might need. The following is one example from the DTD, this one declared in the
iso-num.ent entities file.
<!ENTITY trade "™"> <!-- TRADE MARK SIGN -->
So you just need to enter
™ in your document, and the DTD converts that to the numerical Unicode character that all XML applications recognize. You can examine the complete set of available character entities by looking in the directory that contains the DocBook DTD. The
ent subdirectory contains a number of
iso- files, where
something identifies the set of entity declarations in that file.
It is entirely possible to write a document in any language that is supported by Unicode using only ASCII characters. All characters beyond the basic ASCII character set are written using numerical character references such
á, and so on. If you want to see examples of such XML files, look at the files containing generated text strings, such as
fr.xml in the
common directory of the DocBook XSL distribution. Those files for all languages are encoded as ASCII XML files using numerical character references. The raw XML is not very readable, however, unless it is displayed in a program that converts such numerical Unicode references to displayable glyphs.
|DocBook XSL: The Complete Guide - 4th Edition||PDF version available|
Copyright © 2002-2007 Sagehill Enterprises