High Level XML Tutorial

Overview

A High Level XML document that describes a set of PDF pages consists of two sections. The first section contains all of the document resources (fonts, images, colorspaces and bookmarks). The second section contains all of the pages and their corresponding page descriptions. The first section is technically optional, but without it the PDF documents being created will be severly limited. Every document must specify at least one page.

Basic Document

PDF Tag

The base tag for the document is pdf. The pdf tag may contain the follow optional attributes:

Example:
<pdf author="Steve Dunn" creationDate="D:20000215081632" modDate="D:20000601120000"> creator="PDFC" title="The migration habbits of wombats" subject="wombats" keywords="wombats, wandering" mode="show_bookmarks">
</pdf>

Page Tag

The next tag is the page tag. Technically the pdf and page tags are all that are needed to produce a valid, if not very exciting, pdf document. Page tags have only one possible attribute, an id. The page id attribute is used to reference the page for the purpose of linking to it.

First Working Example:

<pdf author="Steve Dunn" title="The migration habbits of wombats">

<!-- Page -->
<page id="1">
</page>

</pdf>

Before we start putting content in the page, we need to define some resources. The three supported resources are fonts, images and colorspaces. All three of these resource types can be contained either within the document or within a particular page. If the resource is located within a page, then it is only visible to that page, otherwise every page within the document can access the resource. The first resource we will look at is the font. Fonts are used to specify what a given piece of text will look like. Any time text is used within a page, it must have a font associated with it.

Font Tag

There are several different font formats that can be used in PDF files. Currently PDFC supports Type1 and TrueType Fonts. Font data can either be contained within the XML document, Base 85 encoded to convert the binary data, or reference a filename where the font data can be found. The PDF specification sets aside 14 special Type1 fonts. These fonts are expected to be present on every system and as a result, the font data and metrics can be left out. The 14 base fonts are:

In addition to the font data, the PDF specification requires that the font metrics be provided. The basically means the width of each character in the font pluss a couple of other values. The PDFC program and library has the capability to retrieve this information from TrueType fonts automatically. For Type1 fonts outside of the Base 14, this information must still be provided. For now we will leave the discussion of how to do that till later.

When ever a font is included within the document, it must contain an id attribute. The id attribute allows the page content commands to reference the font. If the font is included within the page tags, then the id is referenced by the id directly. If the font is included within the pdf tags, but outside of any page tags, then the page must provide an intermediate reference object to map a name to the id. This is done so that different pages can access the same font using different names.

Font tags can have the following attributes:

We will start with an example using one of the Base 14 Type1 fonts within a page:

<pdf author="Steve Dunn" title="The migration habbits of wombats">

<!-- Page -->
<page id="1">
<!-- Page Resource -->
<font id="2" type="Type1" font_name="Helvetica"></font>
</page>

</pdf>

For TrueType fonts, PDFC can extract the font name and metrics, all you need to supply is the id, type (TrueType) and the data. In this case we will supply the data in an external file. page:

<pdf author="Steve Dunn" title="The migration habbits of wombats">

<!-- Page -->
<page id="1">
<!-- Page Resource -->
<font id="2" type="TrueType" filename="arial.ttf"></font>
</page>

</pdf>

Page Contents

Now that we know how to add a font to the document, we can actually display some text. This is done through the content tags within page objects. The contents tag is a container for PDF commands. Commands specify how to draw the page. By inserting commands into a page's contents tags, you create the layout of a page.

To draw a basic chunk of text, you must first specify the font and size that the text will be drawn in and then you specify what the text is and where on the page it should be drawn.

To set the font and font size you can use a drawstring command giving it a font id, size, bounding box and the text itself. The name and size must be wrapped in tags indicating the type of each data and the bounding box is done through a box tag. The box tag has attributes denoting the bottom left X and Y coordinates as well as the height and width of the box.

A basic example of drawing a string on the page is shown below:

<pdf author="Steve Dunn" title="The migration habbits of wombats">

<!-- Page -->
<page id="1">
<!-- Page Resource -->
<font id="2" type="Type1" font_name="Helvetica"></font>
<!-- Content -->
<contents>
<drawstring>
<!-- Set Font Id -->
<name>2</name>
<!-- Set Font Size -->
<int>30</int>
<!-- Set Bounding Box -->
<box x="200" y="800" width="500" height="400"></box>
<!-- Text to Display -->
Hello World
</drawstring>
</contents>
</page>

</pdf>

Which should generate a page with the words Hello World on it.

Images

Images are another type of resource that can be added to a PDF document. Like fonts, images can either be contained within the XML document, Base 85 encoded to convert the binary data, or reference a filename where the image data can be found. All image resources must contain an id attribute. The id attribute allows the page content commands to reference the image. If the image is included within the page tags, then the id is referenced by the id directly. If the image is included within the pdf tags, but outside of any page tags, then the page must provide an intermediate reference object to map a name to the id. This is done so that different pages can access the same image using different names.

Image tags can have the following attributes:

To place an image on a page you must specify a size and position in the form of a box tag, just like we used in the drawstring. In addition to the box, you must specify the id of the image that should be used. This is the name used in the image tag as the id attribue.

We will start with a basic example:

<pdf author="Steve Dunn" title="The migration habbits of wombats">

<!-- Page -->
<page id="1">
<!-- Page Resource -->
<image id="Image1" filename="crab_vlt.jpg"></image>
<!-- Content -->
<contents>
<!-- Draw Image -->
<drawimage>
<box x="300" y="350" width="200" height="250" >
</box>
Image1
</drawimage>
</contents>
</page>

</pdf>

This should generate a blank page with an image in the middle.

To combine everything we have seen so far, you could create the following XML document:

<pdf author="Steve Dunn" title="The migration habbits of wombats">

<!-- Page -->
<page id="1">
<!-- Page Resources -->
<font id="2" type="Type1" font_name="Helvetica"></font>
<image id="Image1" filename="crab_vlt.jpg"></image>

<!-- Content -->
<contents>
<!-- Draw String -->
<drawstring>
<name>2</name>
<int>30</int>
<box x="200" y="800" width="400" height="200"></box>
Hello World
</drawstring>
<!-- Draw Image -->
<drawimage>
<box x="300" y="350" width="200" height="250" > </box>
Image1
</drawimage>
</contents>
</page>

</pdf>

Which should generate a page with an image and the words Hello World on it.

Resource References

To use resources such as fonts and images across multiple pages, the resource must be declared outide of the page and each page that uses that resource contains a resource reference tag. The resource decleration must specify a global identifier and each local resource reference must specify a page indentifier to be used for that page.

A very simple example of document level resources is as follows:

<pdf author="Steve Dunn" title="The migration habbits of wombats">

<!-- Global Resource -->
<font id="2" type="Type1" font_name="Helvetica"></font>

<!-- Page -->
<page id="1">
</page>

</pdf>

To reference the global resource within a particular page, you add a resource reference as follows:

<pdf author="Steve Dunn" title="The migration habbits of wombats">

<!-- Global Resource -->
<font id="font1" type="Type1" font_name="Helvetica"></font>

<!-- Page -->
<page id="1">
<!-- Page Resource -->
<page_resource id="font1" name="page1_font1"></page_resource>
</page>

</pdf>

For multiple pages it would look like:

<pdf author="Steve Dunn" title="The migration habbits of wombats">

<!-- Global Resource -->
<font id="font1" type="Type1" font_name="Helvetica"></font>

<!-- Page -->
<page id="1">
<!-- Page Resource -->
<page_resource id="font1" name="page1_font1"></page_resource>
</page>

<!-- Page -->
<page id="2">
<!-- Page Resource -->
<page_resource id="font1" name="page2_font1"></page_resource>
</page>

</pdf>

Putting all together we get:

<pdf author="Steve Dunn" title="The migration habbits of wombats">

<!-- Global Resources -->
<font id="font1" type="Type1" font_name="Helvetica"></font>
<image id="image1" filename="crab_vlt.jpg"></image>

<!-- Page -->
<page id="1">
<!-- Page Resources -->
<page_resource id="font1" name="page1_font1"></page_resource>
<page_resource id="image1" name="page1_image1"></page_resource>
<!-- Content -->
<contents>
<!-- Draw String -->
<drawstring>
<name>page1_font1</name>
<int>30</int>
<box x="200" y="800" width="400" height="200"></box>
Hello World
</drawstring>
<drawimage>
<box x="300" y="350" width="200" height="250" >
</box>
page1_image1
</drawimage>
</contents>
</page>

<!-- Page -->
<page id="2">
<!-- Page Resources -->
<page_resource id="font1" name="page2_font1"></page_resource>
<!-- Content -->
<contents>
<!-- Draw String -->
<drawstring>
<name>page2_font1</name>
<int>30</int>
<box x="200" y="800" width="400" height="200"></box>
Other Text
</drawstring>
</contents>
</page>

</pdf>

This XML should produce a document with two pages. The first page should contain the text Hello World and an image. The second page should just contain the text Other Text.

Inline Data

Both fonts and images can pull data from external files or store the data inline in a base 64 encoded format. All of the above the examples pulled data from external files. For images the encoded data is simply placed between the image tags as illustrated below:

<pdf author="Steve Dunn" title="The migration habbits of wombats">

<!-- Page -->
<page id="1">
<!-- Page Resource -->
<image id="Image1" encoding="Base85">Gb-6$*Ys`HIl
0AoOB5ljR-n7_!3J(q?:%e`$5bhE&quot;19cG?:Q_l7eNCg*
... More base 64 encoded data goes here ...
TPiLEKr!P3IRaCmddb\ni?EfKIis(7S1&gt;*r7h07h]F,kb9$
6smS]I^@,3-=d!P[I=!!GJA(^IQc0/=Ot</image>
<!-- Content -->
<contents>
<!-- Draw Image -->
<drawimage>
<box x="300" y="350" width="200" height="250" >
</box>
Image1
</drawimage>
</contents>
</page>

</pdf>

For fonts, the data is encapsulated within font_data tags. This is done to allow fonts to have other types of data to be associated with them. We will look at those other types of data later. To inline the data for a font, you use the following format:

<pdf author="Steve Dunn" title="The migration habbits of wombats">

<!-- Page -->
<page id="1">
<!-- Page Resource -->
<font id="2" type="TrueType">
<font_data encoding="Base85">!!*&apos;!ApY!!<3t:K
`5Ta:aN;TJbg&quot;GZd*^:jeCE.%f\,!5t
... More base 64 encoded data goes here ...
giEi8FY=![7UE!([)T!!N?.!.FqJ!AOUA!E0#-!I+WL
Z$,Y!\OIb!YPhD!s\u2!sJ`-!WrE&amp;!!</font_data>
</font>
<!-- Content -->
<contents>
<!-- Draw String -->
<drawstring>
<name>2</name>
<int>30</int>
<box x="200" y="800" width="400" height="100" >
</box>
Hello World
</drawstring>
</contents>
</page>

</pdf>

Font Metrics

PDF documents require a set of standard metrics for every font it uses. The Base 14 Type1 fonts listed above are standard to the PDF specification and the metrics are already known. The PDFC program and libraries has the capability to extract this information from TrueType fonts. When using a TrueType font with PDFC, all that must be supplied is an identifier, the font type (which is TrueType) and the TrueType data. The data can either be supplied as an external file or inlined in the document. During the conversion process, PDFC extracts all of the nessecary metrics from the font data. For all Type1 fonts, except the Base 14, these metrics must be explicitly supplied. The easiest way to get this information is to convert an existing PDF file that uses the desired font from PDF to high level XML and copy the nessecary metrics. To include an arbitrary Type1 font, the following metrics must be supplied:

An example of the full metrics is shown below:

<pdf author="Steve Dunn" title="The migration habbits of wombats">

<!-- Page -->
<page id="1">
<!-- Page Resource -->
<font id="2" type="Type1" font_name="Poetica-ChanceryI" filename="poetica.afm">
<font_metrics first_char="32" last_char="255" ascent="719" descent="-299"
flags="34">
<font_widths encoding="MacRomanEncoding">
<width>150</width>
<width>190</width>
<width>310</width>
<width>403</width>
<width>329</width>
<width>582</width>
<width>604</width>
... More Widths ...
<width>342</width>
<width>266</width>
<width>323</width>
<width>289</width>
<width>359</width>
<width>266</width>
<width>340</width>
</font_widths>
</font_metrics>
<font_bounding_box llx="-191" lly="-264" urx="965" ury="750"></font_bounding_box>
</font>
<!-- Content -->
<contents>
<!-- Draw String -->
<drawstring>
<name>2</name>
<int>30</int>
<box x="200" y="800" width="400" height="100" >
</box>
Hello World
</drawstring>
</contents>
</page>

</pdf>

Media Boxes

Pages may specify a media box. A media box is used to set the natural size of the page. This allows the user to create letter size pages, legal sizes, envelopes or any other page size they wish. Since PDF documents are scaled to the viewer this size specifies intentions and helps with layout more than anything else. By default, PDFC uses an 8.5 by 11 inch page size, but by specifying the page size explicitly, the user can create any page size they want. Page sizes are specified in points. The convention is that there are 72 points to an inch. Therefore an 8.5 by 11 inch page is 612 by 792 points. The arguments for a media box are:

An 11 by 15 inch piece page would be specified in the following way:

<pdf author="Steve Dunn" title="The migration habbits of wombats">

<!-- Page -->
<page id="1">
<media_box llx="0" lly="0 urx="792" ury="1080"></media_box>
<!-- Content -->
<contents>
</contents>
</page>

</pdf>

Links

Link tags allow the user to set up hyperlinks within the document. Hyperlinks are rectangular regions specified on a particular page that links to an arbitrary page within the document. The region is defined in points and the page to link to is specified by the page id. Link tags must be contained within a page tag. The following is an example that creates a block of text that links to another page.

<pdf author="Steve Dunn" title="The migration habbits of wombats">

<!-- Global Resources -->
<font id="font1" type="Type1" font_name="Helvetica"></font>
<image id="image1" filename="crab_vlt.jpg"></image>

<!-- Page -->
<page id="1">
<link llx="200" lly="700" urx="800" ury="800" page="2"></link>
<!-- Page Resources -->
<page_resource id="font1" name="page1_font1"></page_resource>
<page_resource id="image1" name="page1_image1"></page_resource>
<!-- Content -->
<contents>
<drawstring>
<name>page1_font1</name>
<int>30</int>
<box x="200" y="800" width="400" height="200" >
</box>
Hello World </drawstring>
<drawimage>
<box x="300" y="350" width="200" height="250" >
</box>
page1_image1
</drawimage>
</contents>
</page>

<!-- Page -->
<page id="2">
<!-- Page Resources -->
<page_resource id="font1" name="page2_font1"></page_resource>
<!-- Content -->
<contents>
<drawstring>
<name>page2_font1</name>
<int>30</int>
<box x="200" y="800" width="400" height="200" >
</box>
Other Text </drawstring>
</contents>
</page>

</pdf>

Bookmarks

Bookmarks are a way of providing a set of named links into the document. Basically bookmarks are used to create a table of contents. Bookmarks are hierarchical in nature. Bookmarks specify a title, page and an optional status attribute that specifies whether they are open or closed. Open bookmarks display their children by default, where as closed bookmarks hide their children by default. A basic table of contents may look like:

To produce this stucture in a PDF document, the following XML could be used:

<pdf author="Steve Dunn" title="The migration habbits of wombats">

<bookmark title="Main Section 1" page="1" status="open">
<bookmark title="SubSection 1.1" page="1"></bookmark>
</bookmark>

<bookmark title="Main Section 2" page="2"></bookmark>

<bookmark title="Main Section 3" page="3">
<bookmark title="SubSection 3.1" page="3"></bookmark>
<bookmark title="SubSection 3.2" page="3"></bookmark>
</bookmark>

<!-- Pages -->
<page id="1">
</page>

<page id="2">
</page>

<page id="3">
</page>

</pdf>

Text Annotations

Text annotations are sort of like sticky notes that sit on top of the document. All text annotations must be specified within a page. In order to create a text annotation, you must specify the size and location of the annotation, the text to be displayed and whether or not the annotation is open by default. A basic text annotation looks like this:

<pdf author="Steve Dunn" title="The migration habbits of wombats">

<!-- Page -->
<page id="1">
<text_annotation llx="10" lly="11" urx="100" ury="101" open="true">This is a test annotation</text_annotation>
</page>

</pdf>

Colorspaces

Colorspaces are a resource type that specifies which which colors should be used when displaying a page or image. The default colorspace is DeviceRGB. Colorspaces are fairly complicated and a full description of them is beyond the scope of this document. Further information can be found in the PDF specification. All of the PDF 1.2 colorspaces except seperation and pattern colorspaces are supported. This support was mostly added to allow modification of pre-existing PDF documents.