Regardless of what the encoding is for your documents, an XSL engine can convert the output to a different encoding if you need it. When the document is loaded into memory, XML applications such as XSLT engines convert it to Unicode. The XSL engine then uses the stylesheet templates to create a transformed version of the content in memory structures. When it is done, it serializes the internal content into a stream of bytes that it feeds to the outside world. During the serialization process, it can convert the internal Unicode to some other encoding for the output.
<xsl:output method="html" encoding="ISO-8859-1" indent="no"/>
encoding="ISO-8859-1" attribute means all documents processed with that stylesheet are to be output with the ISO-8859-1 encoding. If a stylesheet's
xsl:output element does not have an encoding attribute, then the default output encoding is
UTF-8. That is what the
fo/docbook.xsl stylesheet for print output does.
<meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">
When a browser opens the HTML file, it reads this tag and knows the bytes it finds in the file map to the ISO-8859-1 character set for display. What if the document contains characters that are not available in the specified output encoding? As with input, the characters are expressed as numerical character references such as
™. It is up to the browser to figure out how to display such characters. Most browsers cover a pretty wide range of character entities, but there are so many that sometimes a browser does not have a way to display a given character.
Most modern graphical browsers can display HTML files encoded with UTF-8, which covers a much wider set of characters than ISO-8859-1. To change the output encoding for the non-chunking
docbook.xsl stylesheet, you have to use a stylesheet customization layer. That is because the XML specification does not permit the encoding attribute to be a variable or parameter value. Your stylesheet customization must provide a new
<xsl:output> element such as the following:
<?xml version='1.0'?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:import href="
/path/to/html/docbook.xsl"/> <xsl:output method="html" encoding="UTF-8" indent="no"/> </xsl:stylesheet>
This is a complete stylesheet customization that you can save in a file such as
docbook-utf8.xsl and use in place of the stock
html/docbook.xsl stylesheet. All it does is import the stock stylesheet and set a new output encoding, in this instance to UTF-8. Any HTML files generated with this stylesheet will have their characters encoded as UTF-8, and the file will include a meta tag like the following:
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
Changing the output encoding of the chunking stylesheet is much easier. It can be done with the
chunker.output.encoding parameter, either on the command line or in a
customization layer. That's because the chunking stylesheet uses
EXSLT extensions to generate HTML files. See
the section “Output encoding for chunk HTML” for more information.
If you are using the Saxon processor with the chunking stylesheet for non-English HTML output, then you may want to set the stylesheet parameter
saxon.character.representation to a value of
'native;decimal'. By default, this parameter (which is defined in
html/chunker.xsl) is set to
'entity;decimal'. The default value of
entity before the semicolon means that any non-ASCII characters
within the encoding are converted to named entity references such as
á instead of the numerical character code for that encoding. For
example, when using the
iso-8859-1 output encoding, this means one native character is
replaced by the 8 ASCII characters that form the named entity reference, which
makes your files considerably larger. When
entity is replaced with
native, the single character code of the encoding is
output. Note that when the output encoding is
UTF-8 and the parameter value uses
native, then no entity references will be output because there are no XML characters outside of UTF-8.
The value after the semicolon controls how characters that are not in the encoding are output by Saxon. They must be converted to some kind of entity reference, and the value can be
entity (named entity reference such as
á if one exists),
decimal (decimal numerical character reference such as
hex (hexadecimal numerical character reference such as
á). Saxon outputs named entity references only for characters in ISO-8859-1, not for all DocBook named character entities.
If you are using the chunking stylesheet, then you can use this parameter to set the Saxon output character representation. If you are using the non-chunking stylesheet, then your customization of
xsl:output as described above needs to be enhanced as follows:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:saxon="http://icl.com/saxon" extension-element-prefixes="saxon"> <xsl:import href="file:///c:/docbook/xsl/html/docbook.xsl"/> <xsl:output method="html" encoding="UTF-8" indent="no" saxon:character-representation="native;decimal"/>
|DocBook XSL: The Complete Guide - 4th Edition||PDF version available|
Copyright © 2002-2007 Sagehill Enterprises