The PDFC program and function library provides a method of converting documents back and forth between PDF and XML formats. Esentially there are three different document formats that PDFC handles. These formats are PDF, a high level XML representation and a low level XML representation. The basic function of PDFC is to convert from any of these three formats to any other of these formats.
The high level XML representation is a conceptual representation of the PDF document. In the high level XML, the basic elements of the PDF document such as strings, dictionaries, streams and arrays are read and parsed to create abstractions such as pages, images, and fonts. In the low level XML representation, the basic elements are directly converted into XML tags. The low level XML representation has a one to one mapping with the objects in the PDF representation. The low level XML representation is mostly as a debugging tool.
In light of the fact that PDFC is in one of its initial revisions and the PDF specification is fairly expansive, the PDFC program and function libraries do not support all of the possible functionality of the PDF specification describes. As such, the high level XML representation may not completely represent all of the elements of some of the more complex PDF documents.
Currently PDFC supports the following structures:
PDFC represents documents internally in two possible ways, a high level representation and a low level representation. To read from or write to a PDF or low level XML format, PDFC must have the document represented in the low level internal format. To read from or write to a high level XML format requires that PDFC have the document in the high level internal representation. When needed, PDFC can convert the document from its low level representation to its high level represenation or the other way around. Each XML format mirrors PDFC's corresponding internal representation pretty closely.
Input | Output | ||||
Low Level XML | ![]() |
Low Level Internal Representation | ![]() |
Low Level XML | |
![]() |
![]() |
||||
![]() |
|||||
High Level XML | ![]() |
High Level Internal Representation | ![]() |
High Level XML |
The basic objects used in the PDF specification and the PDFC internal represenation are:
Objects are divided into two catagories: direct objects, indirect objects. All of the above objects are direct objects. Any direct object may have a label associated with it, at which point it becomes an indirect object. In addition to direct and indirect objects, there are object references. References are basically pointers to indirect objects.
The format of a PDF document consists of a header, body, xref table and trailer. Sometimes the body, xref table and trailer sections are repeated, but these repeated sections can be condensed down into one set. The header simply identifies the document as a PDF document. The xref or cross reference table is a list of byte offsets indicating the location of each indirect object within the document. The trailer contains meta information about the document such as the main indirect object, creation date and the byte offset of the start of the xref table. Everything else in the document is part of the body. All of the elements contained in the body are either direct objects, indirect objects and object references.
The low level XML format and PDFC internal document representation basically conform to the same structure listed above.
The basic set of objects used in the high level XML and high level internal document representation are:
Within a page object, the content for that page is represented through a set of PsCommand objects and low level direct objects.