You may want to split the output for a large document into several HTML files. That process is known in DocBook as chunking, and the individual output files are called chunks. The results are a coherent set of linked files, with a title page containing a table of contents as the starting point for browsing the set.
xsltproc /usr/share/docbook-xsl/html/chunk.xsl myfile.xml
The default behavior in chunking includes:
The name of the main titlepage/table of contents file is
Each of the following elements start a new chunk:
appendix article bibliography in article or book book chapter colophon glossary in article or book index in article or book part preface refentry reference sect1 except first section if equivalent to sect1 set setindex
Each chunk filename is generated with an algorithm. It can instead be named after the
id attribute value of its starting element, if it has one (see
the section “Generated filename”).
dbhtml filename processing instruction embedded in the
The chunk element's id attribute value (but only if the
use.id.as.filename parameter is set).
A unique name generated by the stylesheet.
<chapter><?dbhtml filename="intro.html" ?> <title>Introduction</title> ...
dbhtml name indicates that this processing instruction is
intended for DocBook HTML processing. This
dbhtml filename processing
instruction says that the HTML chunk file for this chapter should be
intro.html. The stylesheet does not add a
filename extension when
dbhtml filename is used. The processing instruction
needs to be an immediate child of the element you are naming, not
inside one of its children. For example, it will not work if you put it
inside the title element of a chapter. If there is more than one such
PI in an element then the first one is used.
If the element that starts a new chunk has an id attribute,
then that value can be used as the start of the chunk filename. The stylesheet parameter
use.id.as.filename controls that behavior. If that parameter is set to a
non-zero value, then your chunk filenames will use the element's id
attribute. By default, the parameter is set to zero, so you have to
turn that behavior on if you want it. For
<chapter id="intro"> <title>Introduction</title> ...
This will work for all elements that have an id value and that start a chunk, except for the main index file. By default, that file is named using the value of the
root.filename parameter, whose value is
index by default. To use your document root element's id as
that filename, set the
root.filename parameter to blank.
There may be situations where you need to add a prefix to all the chunk filenames. For example, if you are putting the output for several chunked books into one directory, you could use a different prefix for each book to avoid filename duplication (and subsequent overwritten files).
If you need all of your chunk filenames to include some sort of prefix string, then you can use the
base.dir stylesheet parameter. Normally the
base.dir parameter is used specify a directory to contain the chunked files, as described in the section “base.dir parameter”. When defining just an output directory with
base.dir, you must end the parameter value with a literal
/ character. If you omit the trailing slash, then the chunk filename is appended to the value without a slash separator, effectively adding it as a prefix to each chunk filename. You can also combine a prefix and a directory name, as shown in the third example below.
|base.dir parameter value||Description||Example chunk filename|
|Output directory only.|
|Filename prefix only.|
|Output directory and filename prefix.|
If not specified by a PI or id attribute, then the XSL
stylesheet will generate a filename. The names are abbreviations of
the element name and a count. For example, the first chapter element
ch01.html, the second chapter would be
ch02.html, and so on. The first sect1 in a chapter might be
s01.html. But that filename would not be unique if each chapter had
a sect1. To make each sect1 name unique, the stylesheet prepends the
chapter part. So the first sect1 in the second chapter would be
ch02s01.html. In general, the stylesheet keeps adding parent prefixes
to make sure each name is unique. If a document is a set with
multiple books, then the stylesheet would also add a book prefix to
make a name like
The names are not pretty, but they do have a recognizable logic. They are also somewhat stable, as opposed to random number names that might have been used instead. But the filenames may change if the document is edited, because when you insert a chapter, subsequent chapters are bumped up in number. If you are creating a website in which other files refer to these chunk filenames, then they are moving targets unless the document never changes. If you want to point to your generated files, it's best not to use generated filenames, and instead to use one of the other methods to name them. Using the id attribute is the easiest.
The first thing you will notice when you chunk a document is that it can produce a lot of HTML files! Suddenly your directory is very crowded with new HTML files. When chunking, most people choose to place the chunked files into a separate directory.
One method that does not work is to use
--output option. That option is used to redirect the standard
output of the processor to a file. During chunking, the stylesheet
creates the filenames and files, and also needs to handle the
xsltproc --stringparam base.dir /usr/apache/htdocs/ chunk.xsl myfile.xml
Things to watch out for:
Be sure to include that trailing
/ because the stylesheet
simply appends the filename to this string. If you forget the trailing slash, you'll end up with all your filenames beginning with that name. If you need such a filename prefix, then see the section “Filename prefix” for details.
Be aware that the
base.dir parameter only works with the chunk stylesheet, not the
docbook.xsl stylesheet. It does work
onechunk.xsl stylesheet, though.
<book><?dbhtml dir="UserGuide" ?> <title>User Guide</title> ... <chapter id="intro"> ...
This sets the output directory to be
UserGuide for the root element chunk and all of
its children and descendants (unless otherwise specified). Since this
is a relative pathname, the output will be relative to the current
directory. So in this example the root element chunk will be
UserGuide/index.html, and the first chapter
chunk will be in
UserGuide/intro.html since it
is a child of the book element. Note that the dbhtml
dir value does not have a trailing slash
because the stylesheet inserts one.
xsltproc --stringparam base.dir /usr/apache/htdocs/ chunk.xsl myfile.xml
Then the root element chunk will be in
base.dir does need a trailing slash.
If any of the descendants of the root element also have a
dir processing instruction, then that value is appended to
ancestor value. That means it is relative to its ancestor element's
directory. This allows you to build up a longer pathname to divide
the output into several subdirectories of the main directory. For
<book><?dbhtml dir="UserGuide" ?> <title>User Guide</title> ... <chapter id="intro"><?dbhtml dir="FrontMatter" ?> ... <chapter id="installing"> ... <appendix id="reference"><?dbhtml dir="BackMatter" ?> ...
Now the output chunks will be:
UserGuide/index.html UserGuide/FrontMatter/intro.html UserGuide/installing.html UserGuide/BackMatter/reference.html
Note that the second chapter is not a child of the first chapter, so
its directory reverts to that of the book-level PI. Again, if the
base.dir parameter is set, then all of these become relative to
that value. Remember that you need to create any directories you specify, because
the stylesheets will not.
dir processing instruction can be used to specify a full
pathname if you do not use a
base.dir parameter, but that's not a good idea. That hard codes the
path into your file, which means you have to edit the file to put the
output elsewhere. Generally this PI is used to create directories
relative to some base output directory that you specify on the
command line with a parameter. That gives you the flexibility to put
the output where you want, yet maintains the relative structure of
the subdirectories specified by the PIs.
In all cases, cross references between your chunked files should still resolve, regardless what the relative locations are.
If you are chunking large documents, then there is a stylesheet variation you can use that will speed up the processing. The caveat is that the XSL processor you are using must support the EXSLT
node-set() function. That includes Saxon, Xalan, and xsltproc. It does not include MSXSL, however.
To speed up chunking, use the
chunkfast.xsl stylesheet instead of the regular
chunk.xsl stylesheet. The
chunkfast.xsl stylesheet is a customization of
chunk.xsl and is included with the distribution in the
xhtml) directory. It handles chunks in a more efficient manner. In the regular
chunk.xsl stylesheet, the calculation of the Next and Previous elements for each
chunk is performed each time a chunk is output. That calculation
requires searching the document using XPath, which can take some time
for large documents. When
chunkfast.xsl is used instead, those calculations
are all done once ahead of time, so that output can proceed without
You may notice that there is a
chunk.fast parameter included in the stylesheets. Setting that parameter is not sufficient for getting the correct fast chunking behavior. You have to use the
chunkfast.xsl stylesheet in order for the headers and footers to be correct. That stylesheet sets the parameter and customizes some templates.
When chunking a book, the DocBook XSL stylesheets normally put the table of contents (TOC) in the same chunk as the book's title page. The stylesheets provide options for generating separate chunks for the table of contents, and for any lists of titles such as List of Tables.
If you set the stylesheet parameter
chunk.tocs.and.lots to 1, then the stylesheet will generate a separate chunk
that contains the table of contents and all the lists of titles. The
title page chunk will then contain a link to the new chunk. If you
also set the parameter
chunk.separate.lots to 1, then each of the lists of titles will get a
separate chunk as well. If you set only
chunk.separate.lots to 1, then your table of contents will appear in the
title page chunk, and only the lists of titles will get separate
chunk.separate.lots parameter was added in version 1.66.1 of the
chunk.toc parameter does not generate a separate table of contents
chunk. Rather, it is used to manually designate chunking boundaries.
See the section “Manually control chunking” for more information.
Set the parameters
Chunk based on a manually edited table of contents file.
If you only want to control what section levels get put into separate HTML files, then you should set the
chunk.section.depth parameter. By default it is set to 1. So if you want
sect2 elements to be chunked into
individual files, set the parameter to 2.
The chunk stylesheet by default includes the first
sect1 of a chapter (or article) with the content that precedes it in the chapter. If you want those also to be chunked to separate files, then set the
chunk.first.sections parameter to 1.
If the standard chunking process does not meet your needs, and you are willing to manually intervene, then you can completely control how content gets chunked. This might be useful if some sections are very short and you would rather keep them together. But since it requires hand editing of a generated table of contents file, it is only useful if done infrequently or with documents that have stable structure.
Here are the steps for manually chunking HTML output:
Make sure all the elements you want to become chunks have an
id attribute on them.
xsltproc -o mytoc.xml \ --stringparam chunk.section.depth 8 \ --stringparam chunk.first.sections 1 \ html/maketoc.xsl myfile.xml
The two parameters ensure that all sections are included in the generated TOC file.
Edit the generated
mytoc.xml file to remove any
tocentry elements that you do not want chunked, or add entries that you do want chunked.
xsltproc --output output/ \ --stringparam chunk.toc mytoc.xml \ html/chunktoc.xsl myfile.xml
This will chunk your document based on the entries in the generated TOC file. You can still use any of the chunking parameters to modify the chunking behavior.
When you use this process, you must have an
id attribute on every element that you want to start a new chunk. This includes the document element, which generates the title page and table of contents. You can see which elements do not have an id by examining the generated TOC file and looking for empty
id attributes in the
tocentrys. Any such entries will be merged with their parent elements during chunking.
If you want to control what elements produce chunks, beyond just the section level choice, then you must modify the templates that do chunk processing. See the section “Chunking customization” for more information.
You may need to change the output encoding for your chunked HTML files. The
chunker.output.encoding parameter lets you change the default value of the HTML
character encoding from the default value of
ISO-8859-1. For example, if you want your HTML files to use UTF-8
encoding instead, you could process your document with the
xsltproc --output output/ \ --stringparam chunker.output.encoding UTF-8 \ html/chunk.xsl myfile.xml
This will produce the following line in each chunked HTML file:
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
It will also encode the HTML content itself using UTF-8 encoding. When a browser opens the file, the
meta tag informs it that the file is encoded in UTF-8 so it will use a UTF-8 font to display the text. This feature is only available with Saxon and XSL processors that support EXSLT extensions (such as xsltproc). It does not work in Xalan, however.
By default, chunked HTML output from Saxon will not contain any non-ASCII characters, regardless of the encoding your specify. Any non-ASCII characters will be represented as named entities or numerical character references. This behavior is controlled by the
saxon.character.representation stylesheet parameter. See the section “Saxon output character representation” for more
The default output encoding for XHTML is UTF-8, as described in the section “XHTML”.
There are two stylesheet parameters for the chunking stylesheet that affect the DOCTYPE:
See the section “Generating XHTML” for an example of using these parameters. Note that they do not work with the Xalan processor because it uses a different way of writing chunk files.
Unfortunately, there is no way to add an internal subset to the output DTD using XSLT. If you do not know what an internal DTD subset is, then you probably do not need it. See a good XML reference for more information.
If you use a text editor to open an HTML file produced by DocBook XSL, you will notice that by default it produces long text lines that contain many elements. If you would prefer your HTML elements to start on a new line and have nested indents to show the HTML element structure, you can do that by setting the
chunker.output.indent parameter to
yes. Note that this feature is only available with XSL
processors that support EXSLT extensions, but that
includes most of the major ones. Xalan does not support this indenting option.
There are limits to which HTML elements can start an indented line. In general, any element that permits
#PCDATA (plain text) as part of its content model will not allow the extra line breaks inside it. That is because white space must be respected inside such elements, and that respect includes not adding extra white space.
To add indentation with the non-chunking
docbook.xsl stylesheet, you need to use a customization layer with an
xsl:output element similar to the example in the section “Output encoding”. Use the
indent="yes" attribute value to turn on indentation. The other approach for single-file output is to use the
onechunk.xsl stylesheet and its extra parameters, as described in the section “Single file options with onechunk”.
|DocBook XSL: The Complete Guide - 4th Edition||PDF version available|
Copyright © 2002-2007 Sagehill Enterprises