Jabber.org's Quick Guide to DocBook XML

Eliot Landrum

Jabber.org
<eliot@landrum.cx>
Revision History
18 May 2001
  • Updated entire document for XML use instead of SGML.

  • Converted source document to XML.

  • Added information for Windows and Macintosh authors.

14 Dec 2000
First version.

Table of Contents

Introduction
Fundamental Elements
Document Type
Metadata
The Content
Linking
Lists
Extras
Processing
Web Browser Transformation
XSLT Processor
References
DocBook XML Source for this Document

Introduction

While Jabber.org welcomes documents in any format (the content is what matters!), we prefer documents to be formatted using DocBook XML. DocBook allows conversion to many formats (PDF, HTML, RTF, ASCII text, etc.) and frees the writer from trying to maintain a consistant formatting style. DocBook XML is easy to learn and can be written with any plain text editor (some WYSIWYG editors exist, but I've found it easier just to use vim!). This document steps through the basics of starting a DocBook document from scratch and processing it to other formats.

As can be seen, this guide is very brief. For more information about DocBook, read the online version of DocBook: The Definitive Guide (a.k.a. TDG) authored by Norman Walsh and published by O'Reilly & Associates.

Windows users, I highly recommend getting a better editor than Notepad. In Notepad and Wordpad, it is extremely difficult to find line numbers, something that is extremely useful when processing documents. NoteTab Light is nice freeware editor that well suits the needs of any DocBook author.

If you're familiar with HTML, you're already well on your way to understanding DocBook XML. DocBook is a markup language much like HTML, just with different elements (tags). I highly recommend reviewing the "Making an XML Document" section in Norman Walsh's book if you are not familiar with XML documents.

Note

One critical difference between HTML (not XHTML) and XML is that "empty" elements must be closed internally (i.e. <xref/>).

Ready to jump into some of the fundamental elements? Load up your favourite text editor and join the fun!

Fundamental Elements

Document Type

Before any DocBook XML can be started, the document type must be stated:

<?xml version='1.0' encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
  "http://www.oasis-open.org/docbook/xml/4.0/docbookx.dtd">

The first line is standard for all XML documents; it specifies what version of XML the document is and what type of encoding the text is written in. The second and third lines specify that the document type is going to be an article and that we're using DocBook version 4.1.2. At the time of this writing, this is the most current version available.

The entire document must be wrapped within one main element. DocBook has several different main elements, depending on the size of the document. The most commonly used document type for Jabber documents is article. After the xxxx is stated, the actual DocBook elements can be started.

Metadata

First some metadata must be specified. This usually includes information about the document title, author(s), contributors, and publication date. All of this information is inclosed with the articleinfo element.

Note

artheader was used in versions of DocBook prior to 4.0 instead of articleinfo.

The metadata for this document:

<articleinfo>
	<title>Jabber.org's Quick Guide to DocBook XML</title>

	<author>
		<firstname>Eliot</firstname><surname>Landrum</surname>
		<affiliation>
			<orgname>Jabber.org</orgname>
			<address>
                            <email>eliot@landrum.cx</email>
                        </address>
                </affiliation>
	</author>

	<pubdate>2001/12/14</pubdate>
</articleinfo>

				

The Content

Now the good stuff can begin! All text must be inclosed within sections. Each section is marked with, originally enough, the section element. Every section must have a title, marked with the title element. Sections can also have subsections, simply by inclosing section's inside other sections. All text must be inclosed within the section by para elements. Here's a simple section with an inclosed section:

<section>
	<title>Section 1</title>

	<para>This is section 1 with a nice little paragraph.</para>

	<section>
		<title>Section 1.1</title>

		<para>Here we have a subsection and the text of it.</para>
	</section>
</section>
				

Linking

You can link to various parts of the document by using the link element. First, you must specify an ID for the element which you would like to link to. Nearly all the elements in DocBook may have the ID attribute. Each ID must be unique. The following is an example of using the ID attribute and the link element:

<section id="section1">
	<title>Section 1</title>

	<para>This is section 1 with a nice little paragraph.</para>
</section>

<section id="section2">
	<title>Section 2</title>

	<para>This is section 2. <link linkend="section1">Section 1</link>
              provides more in depth information.</para>
</section>
				

To link to a URL, use the ulink element. Simply provide the URL in the url attribute:

<ulink url="http://www.jabber.org">Jabber.org</ulink>

Lists

A bulleted list:

<itemizedlist>

     <listitem>
          <para>Item 1</para>
     </listitem>

     <listitem>
          <para>Item 2</para>
     </listitem>

</itemizedlist>
				

A variable list (useful for any definition lists):

	    
<variablelist>
    <title>List of Variables</title>

        <varlistentry>
            <term>Jabber</term>
            <term>ICQ</term>
            <term>AIM</term>
            <listitem>
                <para>Instant messaging systems.</para>
            </listitem>
	</varlistentry>

	<varlistentry>
	    <term>DocBook XML</term>
	    <listitem>
                <para>A markup language for creating structured documents.</para>
            </listitem>
	</varlistentry>
</variablelist>
				

Extras

In DocBook there is no "bold" or "italic" formatting, instead, words are marked specific to what they mean. For instance, a command would be wrapped with the command element. Commonly used formatting elements are: wordasword, emphasis, varname, literal, filename and replaceable.

Processing

To output to other formats, the XML must be processed. No matter the platform, the processing step is very similar. There are two basic ways that this can be done, both use the XSL stylesheets provided by Norman Walsh. I suggest downloading the current ZIP package, uncompress it somewhere nice and following along as we turn XML into HTML.

Web Browser Transformation

The simplest way is to add this processing instruction at the top of the document, right after the DOCTYPE has been declared:

<?xml-stylesheet type="text/xsl" href="docbook/html/docbook.xsl"?>
				

In the example, docbook/html/docbook.xsl is local to where the document is located on your hard drive. (docbook/html/docbook.xsl is in the downloadable package from Norman's web site.) After adding this processing directive, load the XML in an XSL-capable browser and the browser will do the transformation process. Unfortunatly, few web browsers are capable of this task. Supposedly, versions of Microsoft Internet Explorer greater than 5 can do it, but I was unable to get it to render DocBook XML document.

XSLT Processor

Using a browser to view the XML is excellent for quick proof-reading, but to output the data to formats available to a wider audience (HTML, RTF, PDF, etc.), an XSLT processor must be used. Many exist for many different platforms and languages.

Microsoft Windows

I highly recommend SAXON, written by Michael Kay in Java for processing XML files in Windows. I was able to quickly get Instant SAXON 5.5.1 running on my Windows ME system with very little fuss. Just download it, uncompress it and run saxon -o output.html input.xml docbook/html/docbook.xsl at the command prompt. If you include the stylesheet directive shown above in the section called “Web Browser Transformation”, you can specify that saxon use that information with the -a flag.

The author recommends installing Sun's Java engine instead of Microsoft's for speed concerns. After using SAXON a few times, I have to agree.

Apple Macintosh

As it turns out, SAXON works really great on Macintosh too. I found a very helpful (albeit a bit old) email from Bruce Rosenstock to help SAXON on a Macintosh. Since the command line is somewhat frozen in the application that JBindery creates, I suggest using the -a flag as described before.

GNU/Linux

I saved the best for last! There are lots of great XSLT engines for GNU/Linux as a quick search on Freshmeat.net will testify. I installed xsltproc from Debian because it was the first match on my search for XSLT. It installed quickly and is quite fast. For xsltproc, the syntax is simple (and quite similar to any other XSLT processor): xsltproc docbook/html/docbook.xsl infile.xml > outfile.html

GNU/Linux

I saved the best for last! There are lots of great XSLT engines for GNU/Linux as a quick search on Freshmeat.net will testify. I installed xsltproc from Debian because it was the first match on my search for XSLT. It installed quickly and is quite fast. For xsltproc, the syntax is simple (and quite similar to any other XSLT processor): xsltproc docbook/html/docbook.xsl infile.xml > outfile.html

Cleaning the HTML

I often use tidy from the W3.org to clean the HTML. More precisely, tidy with the -im flags seems to do the best.

Outputting a PDF or PS document is nearly as easy as outputting HTML. You may have to go through and make sure that your examples aren't destroyed in the process though. You'll need FOP from the Apache XML Project first. Instead of using the docbook/html/docbook.xsl stylesheet, you need to use the docbook/fo/docbook.xsl. This converts the XML to a formatting object file. This can then be converted to formats such as PDF. See Norman Walsh's instructions on this matter.

References

I have found the following sites invaluable in my DocBook XML work (many of these have been referenced in the above text):