The DocBook XSL stylesheets support documents written in many languages. This support is made easier by the fact that XML itself supports Unicode, which includes characters for most of the world's languages. To write a DocBook document in a given language, you just have to identify a character encoding that expresses the language, and then indicate that character encoding in the XML declaration that must appear at the top of each XML file, such as
<?xml version="1.0" encoding="iso-8859-1"?>. You write the text of your document using that character encoding, and you use the standard DocBook tags (which have English names) to mark the XML elements. Then you just have to make sure the XSLT processor you use supports your encoding.
The language support in the DocBook XSL stylesheets is primarily for generated text that the stylesheets produce. For example, an English document should label a chapter with
Chapter 3, while a German document's chapter should be labeled
The XML document encoding does not tell the stylesheets what language the document is written in. You have to supply that information with either a
lang attribute in the document or a stylesheet parameter at processing time.
Indexing in DocBook XSL does not sort properly for non-English languages. But there is a customization available that does sort properly. See the section “Internationalized indexes”.
The preferred method of indicating language is by adding a
lang attribute with a language code value, usually on the document root element . This method records the language within the document itself, so it is clear to anyone examining the document. Also, the attribute triggers automatic processing in that language by the stylesheets. That means you do not have to indicate the language on the processing command line.
lang is one of the common DocBook attributes, it is permissible for all DocBook elements. The attribute applies to the element it is in, and all of that element's descendants. If one of the descendants has a different
lang attribute, then it overrides the ancestor's value for the scope of that descendant. For example, if a document's root element is
book, you can put a
lang attribute in the book start tag so it applies to the whole document. If one of your chapters is written in a different language, then it can have a
lang attribute whose value applies only to that chapter. The following example illustrates this usage.
<book lang="de"> ... <chapter> <title>Profil verwalten</title> ... </chapter> <chapter lang="en"> <title>Special Features</title> ... </chapter> <chapter> <title>Junk-E-Mails vermeiden</title> ... </chapter> </book>
In this example, the document root element sets the lang to
de (German) for the document. So the chapters
Profil verwalten and
Junk-E-Mails vermeiden are processed as German. But the
Special Features chapter has its own lang set to
en (English). So the second chapter is processed as English. Its label will be
Chapter in the chapter title page, the book's table of contents and any cross references to that chapter.
You can also indicate the language of a document at processing time by using a stylesheet parameter set to a language code. This is useful if you are processing a document that does not have a
lang attribute and you cannot edit it to add one, or if you want to override the attribute it does have. There are two stylesheet parameters that can be used to set the processing language:
l10n.gentext.language will override any lang attribute set in the
This parameter is only needed if the document is a single language
that is not English, and one of the following conditions.
It does not have a lang attribute.
The lang attribute it does have is wrong.
The lang attribute it does have is not one of those supported by the stylesheets.
l10n.gentext.default.language can be used in the same circumstances as the
parameter, but it will not override any lang attributes in the
stylesheet. It will apply only to those elements for which no lang
attribute applies. Thus if there is a lang attribute on the
document's root element, then the parameter will have no
If you wondering about the names of these parameters, you probably do not recognize the odd abbreviation
l10n, which is a lower case L followed by the number 10 and the letter n. This is an abbreviation of “localization” (the first and last letters, and 10 letters in between). It means the gentext strings are adapted to a particular locale in the world. This abbreviation is similar to
i18n, which is an abbreviation for “internationalization”.
As of this writing, DocBook XSL supports 45 languages. That means it has translations for the generated text strings in 45 languages. The translations are stored in XML files named for the language code, such as
fr.xml, etc. These are stored in the
common subdirectory of the stylesheet distribution. So if you want to check if a given language is supported, look in that directory for an XML file of that name. The top of each file looks like the following:
<?xml version="1.0" encoding="US-ASCII"?> <l:l10n xmlns:l="http://docbook.sourceforge.net/xmlns/l10n/1.0" language="it" english-language-name="Italian">
language attribute identifies the language code. It is this attribute value that the stylesheet uses to match to a
lang attribute in a document. The filename just happens to have the same name. The
english-language-name attribute gives the language name in English for each language.
Most of the language codes are two-letters, named using the ISO 639 standard. A few have variations to reflect how a given language is used in a different country. For example the
pt_br language is for Portuguese as spoken in Brazil. The country codes that are used in the second part of the name are listed in the ISO 3166 alpha-2
When you specify a language code for your document in an attribute or parameter, you can use upper- or lower-case letters. If it has a country extension, you can use either dash or underscore as the separator. In all these cases the stylesheets will map the code to the supported value.
If you specify a country extension, and there is no translation for that extension, the stylesheet will fall back to using just the two-letter language code. If a two-letter code is not supported, then the stylesheets fall back to English.
In theory, DocBook XSL can support any language that can be expressed in Unicode. In practice, only 45 languages have translated text strings that the stylesheets can access. If you need a language that is not currently available, then you can make the translations and add them to your stylesheets. You should copy the English file
common/en.xml to a new language code XML file, and then translate the
text attributes in the file. The translations should use Unicode numerical character references for any non-ASCII characters.
The easiest way to add a new language to the stylesheets is to submit your translation to the DocBook XSL project for integration into the next release. Send email to the project admins at the DocBook SourceForge site. Then your new translation will be included in future stylesheet distributions. It also makes it available to other users, who can make contributions to it as well.
If you want to include your translation only in your own stylesheet, you need to do the following:
Copy the stylesheet file
common/l10n.xml to a new filename, such as
common/my-l10n.xml. It is best to keep it in the same directory because it references all the other language files in that directory.
Edit your new file to add a SYSTEM entity declaration to the DOCTYPE and an entity reference to the body of your copied file. Just copy similar lines from the file itself. The entity declaration should point to your new language file location, relative to the
<!ENTITY fy SYSTEM "../mystuff/fy.xml"> ... &fy;
Create a stylesheet customization layer if you do not already have one.
Add the following line to your customization file:
<xsl:param name="l10n.xml" select="document('../mystuff/my-l10n.xml')"/>
The path to your enhanced
my-l10n.xml file should be relative to your stylesheet customization file.
document() function loads your customized file into the stylesheet parameter
l10n.xml. That parameter is searched when looking for a
This arrangement is a bit awkward, and will need to be repeated with each new stylesheet release. It's best to complete the translation and submit it to the DocBook project.
Some languages, such as Hebrew and Arabic, read from right to left. When viewing an XML source file, you might think that it reads from left to right, but that view is just an artifact of the viewing device. In fact, an XML file is a linear sequence of bytes, with no particular direction except from beginning to end. The file is in logical order, with the beginning of each word appearing earlier in the file than the end of the word, regardless of the language. Any device that interprets the bytes and assigns displayable characters has to choose how to lay out those characters in some readable fashion. For some languages, that presentation is left to right, and for others it is right to left.
There are two principal properties that determine the direction of text:
Writing mode sets the overall direction for the document.
dir attributes change the direction for specific spans of text.
Note that most right-to-left languages are actually bidirectional, because numbers still read from left-to-right, and any words in the Latin alphabet, such as technical terms, still read from left-to-right.
Writing mode is a term from XSL-FO that describes the overall plan for laying out text onto a page. A writing mode is a combination of horizontal direction and vertical direction for text flows. For example, an XSL-FO output with
writing-mode="lr-tb" displays inline text that flows from left-to-right (
lr), and lines that stack down the page from top-to-bottom (
tb). Similarly, in
rl-tb the inline text flows from right-to-left, and again the lines stack down the page from top-to-bottom.
If you are conditioned to Latin-based languages that read left-to-right, you may not realize how important the left side is for text layout. Indents that show hierarchy are indented from the left. Numbers in
orderedlist and bullets in
itemizedlist appear on the left. When outputting a right-to-left language such as Arabic or Hebrew, putting such features on the left does not work. The importance of these formatting features is not that they appear on the left, it is that they appear at the start of the line. The XSL-FO standard recognizes this, and uses the term
start-indent instead of
writing-mode="rl-tb" (right-to-left), the
start-indent property is applied to the right side. Similarly, bullets and numbers appear on the right, at the start of their line. Tables are also reversed, that is, the first
table-cell in each row appears on the right.
You can set the writing mode for XSL-FO output by adding an attribute to the
<xsl:attribute-set name="root.properties"> <xsl:attribute name="writing-mode">rl-tb</xsl:attribute> </xsl:attribute-set>
When you set this property, you will find that your print pages are mirror images of the left-to-right writing mode. Even page headers and footers will be mirrored, because they use tables to lay out the different portions of the headers and footers, and the order of table cells is reversed. You may want to swap your values for the
page.margin.outer parameters, because the side for binding would change.
If you set
writing-mode="rl-tb" in a document using a Latin-based language, the text does not print backwards. Only the layout is mirrored. As described in the next section, the text direction is based on the Unicode character range in use.
For HTML output, a right-to-left writing mode can be established by adding a
dir="rtl" attribute to the
HTML document element in the output. This currently requires using a customization, which differs if you are doing single-page or chunked output.
Single-page HTML, customize this template from docbook.xsl: <xsl:template match="*" mode="process.root"> <xsl:variable name="doc" select="self::*"/> <xsl:call-template name="user.preroot"/> <xsl:call-template name="root.messages"/> <html> <xsl:variable name="lang"> <xsl:call-template name="l10n.language"/> </xsl:variable> <xsl:if test="starts-with($lang, 'he') or starts-with($lang, 'ar')"> <xsl:attribute name="dir">rtl</xsl:attribute> </xsl:if> ... Chunked HTML, customize this template from chunk-common.xsl: <xsl:template name="chunk-element-content"> <xsl:param name="prev"/> <xsl:param name="next"/> <xsl:param name="nav.context"/> <xsl:param name="content"> <xsl:apply-imports/> </xsl:param> <xsl:call-template name="user.preroot"/> <html> <xsl:variable name="lang"> <xsl:call-template name="l10n.language"/> </xsl:variable> <xsl:if test="starts-with($lang, 'he') or starts-with($lang, 'ar')"> <xsl:attribute name="dir">rtl</xsl:attribute> </xsl:if> ...
These customizations call the utility template named
l10n.language to get the current document's
lang attribute. It then checks to see if it starts with either
he (Hebrew) or
ar (Arabic) and adds the
When processing content for output, you will find that the inline text direction is mostly handled automatically. That is, if you process an XML document containing Arabic, the formatted output will present the Arabic words from right to left, and any English words from left to right.
How does the formatter know when to switch the direction of presentation? It knows by the range of Unicode characters used in each word. Part of the information in the Unicode standard is the text direction that each range of characters is expected to be presented in. Latin letters are to be presented left to right, and Hebrew characters from right to left. Modern browsers and XSL-FO processors use that information to decide the direction of presentation. In mixed language text, sometimes called bidirectional text, the direction can change in mid-sentence. When a formatter encounters a bit of text that should be displayed in the opposite direction, it has to read forward to find the end of such text, print it out character-by-character reading backwards from the end, and then resume normal layout of the text that follows.
There are some combinations of text that make this task harder for the formatter. Punctuation, parentheses, numbers mixed with letters, and other combinations may present ambigous information to the formatter. In such cases, the author may need to provide some help to the formatter through the XML markup.
The DocBook schemas starting with version 4.3 have supported an attribute named
dir on almost all elements. The
dir attribute provides a hint to the formatter for which direction to display the text enclosed by the element with that attribute. There are four possible values:
|dir attribute value||Unicode Name||Description|
|Left-to-Right Embedding||Embed a span of left-to-right characters inside right-to-left text.|
|Right-to-Left Embedding||Embed a span of right-to-left characters inside left-to-right text.|
|Left-to-Right Override||Force the characters to be treated as strong left-to-right characters.|
|Right-to-Left Override||Force the characters to be treated as strong right-to-left characters.|
You can put a
dir attribute on any inline element. Use
phrase if the text is not already inside an inline element. That is particularly useful for problems with parentheses or punctuation. You do not need to write a customization in order for these attribute values to have their effect. They automatically output the correct properties in HTML or XSL-FO for inline text elements. Then it is up to the browser or XSL-FO processor to handle it.
When working with a language such as Hebrew or Arabic that reads right to left, you need to pay attention to the XSL-FO
end terminology. These designate the two sides of a page, but which side each refers to depends on the writing mode (see the section “Writing mode” for details). The term
start refers to the side of a page that a sentence starts from. With the default
writing-mode="lr-tb", the term
start refers to the left side. If you set
writing-mode="rl-tb" (right to left), then
start means the right side. In each case,
end means the opposite side, where a sentence ends.
For example, when you set the
body.start.indent stylesheet parameter to indent paragraphs relative to titles, it inserts a
start-indent property in the XSL-FO output. That creates an indent on the left by default. But when you use right-to-left writing mode, the indents will be on the right, which is appropriate for those languages.
|DocBook XSL: The Complete Guide - 4th Edition||PDF version available|
Copyright © 2002-2007 Sagehill Enterprises