Output encoding

Regardless of what the encoding is for your documents, an XSL engine can convert the output to a different encoding if you need it. When the document is loaded into memory, XML applications such as XSLT engines convert it to Unicode. The XSL engine then uses the stylesheet templates to create a transformed version of the content in memory structures. When it is done, it serializes the internal content into a stream of bytes that it feeds to the outside world. During the serialization process, it can convert the internal Unicode to some other encoding for the output.

An XSL stylesheet usually sets the output encoding in an xsl:output element at the top of the stylesheet file. The following shows that element for the html/docbook.xsl stylesheet:

<xsl:output method="html"
            encoding="ISO-8859-1"
            indent="no"/>

The encoding="ISO-8859-1" attribute means all documents processed with that stylesheet are to be output with the ISO-8859-1 encoding. If a stylesheet's xsl:output element does not have an encoding attribute, then the default output encoding is UTF-8. That is what the fo/docbook.xsl stylesheet for print output does.

When the output method="html", the XSLT processor also adds an HTML META tag that identifies the HTML file's encoding:

<meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">

When a browser opens the HTML file, it reads this tag and knows the bytes it finds in the file map to the ISO-8859-1 character set for display. What if the document contains characters that are not available in the specified output encoding? As with input, the characters are expressed as numerical character references such as &#8482;. It is up to the browser to figure out how to display such characters. Most browsers cover a pretty wide range of character entities, but there are so many that sometimes a browser does not have a way to display a given character.

Most modern graphical browsers can display HTML files encoded with UTF-8, which covers a much wider set of characters than ISO-8859-1. To change the output encoding for the non-chunking docbook.xsl stylesheet, you have to use a stylesheet customization layer. That is because the XML specification does not permit the encoding attribute to be a variable or parameter value. Your stylesheet customization must provide a new <xsl:output> element such as the following:

<?xml version='1.0'?> 
<xsl:stylesheet  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"  
                 version="1.0"> 

<xsl:import href="/path/to/html/docbook.xsl"/> 
<xsl:output method="html"
            encoding="UTF-8"
            indent="no"/>
 
</xsl:stylesheet>  

This is a complete stylesheet customization that you can save in a file such as docbook-utf8.xsl and use in place of the stock html/docbook.xsl stylesheet. All it does is import the stock stylesheet and set a new output encoding, in this instance to UTF-8. Any HTML files generated with this stylesheet will have their characters encoded as UTF-8, and the file will include a meta tag like the following:

<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">

Changing the output encoding of the chunking stylesheet is much easier. It can be done with the chunker.output.encoding parameter, either on the command line or in a customization layer. That's because the chunking stylesheet uses EXSLT extensions to generate HTML files. See the section “Output encoding for chunk HTML” for more information.

Saxon output character representation

If you are using the Saxon processor with the chunking stylesheet for non-English HTML output, then you may want to set the stylesheet parameter saxon.character.representation to a value of 'native;decimal'. By default, this parameter (which is defined in html/chunker.xsl) is set to 'entity;decimal'. The default value of entity before the semicolon means that any non-ASCII characters within the encoding are converted to named entity references such as &aacute; instead of the numerical character code for that encoding. For example, when using the iso-8859-1 output encoding, this means one native character is replaced by the 8 ASCII characters that form the named entity reference, which makes your files considerably larger. When entity is replaced with native, the single character code of the encoding is output. Note that when the output encoding is UTF-8 and the parameter value uses native, then no entity references will be output because there are no XML characters outside of UTF-8.

The value after the semicolon controls how characters that are not in the encoding are output by Saxon. They must be converted to some kind of entity reference, and the value can be entity (named entity reference such as &aacute; if one exists), decimal (decimal numerical character reference such as &#225;), or hex (hexadecimal numerical character reference such as &#xE1;). Saxon outputs named entity references only for characters in ISO-8859-1, not for all DocBook named character entities.

If you are using the chunking stylesheet, then you can use this parameter to set the Saxon output character representation. If you are using the non-chunking stylesheet, then your customization of xsl:output as described above needs to be enhanced as follows:

<xsl:stylesheet version="1.0"  
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"  
                xmlns:saxon="http://icl.com/saxon"
                extension-element-prefixes="saxon">

<xsl:import href="file:///c:/docbook/xsl/html/docbook.xsl"/>

<xsl:output method="html" 
            encoding="UTF-8"
            indent="no" 
            saxon:character-representation="native;decimal"/>