Plasma GitLab Archive
Projects Blog Knowledge

Programming with Objective Caml Document Tree
 
  NAVIGATION

XML
PXP


RELATED
Download PXP
Download Findlib
Announcement of PXP
The PXP manual
The PXP release notes
Mailing list
Tony


Contact
    This is the archived PXP page. The new homepage can be found here!

The XML parser for O'Caml: PXP

In October, 1999, I started writing a validating XML parser for O'Caml; the first published versions have been called "Markup" (simply because the package name was "markup"). After this parser had some success, I decided to revise the whole code, and to redesign the parser where it was needed. The result of this work is PXP, the Polymorphic XML Parser. The name reflects an important property of the parser, namely that the type of the XML nodes can be customized; a feature which is missing in most other XML parsers.

Now, one year later, I can announce the first stable version of PXP. "Stable" means mostly that the interface of the parser has become stable, i.e. future changes will extend but not break the current interface. The parser worked relatively well from the very beginning, and during the pre-release phase (several months) users reported only few bugs. I am now relatively sure that PXP is mature enough to be used in applications.

In general, the task of a XML parser is to read XML text, and to represent the text somehow in memory. There are several models for the data structures; for PXP I have chosen the luxury representation as object tree, in which every XML node is stored as two objects. One object contains the set of methods describing the fixed properties of every node; the other object is called the extension object and can be configured by the user of the parser.

The extension object is the polymorphic part of the representation. The type of the class may be arbitrary (except three base methods which connect the object to the tree), and the parser has a mechanism to dynamically select the class of the object depending on the element type of the XML node.

The other features of the parser simply implement the XML standard:

  • The XML instance is validated against the DTD; any violation of a validation constraint leads to the rejection of the instance. The validator has been carefully implemented, and conforms strictly to the standard. If needed, it is also possible to run the parser in a well-formedness mode.

  • If possible, the validator applies a deterministic finite automaton to validate the content models. This ensures that validation can always be performed in linear time. However, in the case that the content models are not deterministic, the parser uses a backtracking algorithm which can be much slower. - It is also possible to reject non-deterministic content models.

  • The parser can read XML text encoded in a variety of character sets. Independent of this, it is possible to choose the encoding of the internal representation of the tree nodes; the parser automatically converts the input text to this encoding. Currently, the parser supports UTF-8 and ISO-8859-1 as internal encodings.

  • The interface of the parser has been designed such that it is best integrated into the language O'Caml. The first goal was simplicity of usage which is achieved by many convenience methods and functions, and by allowing the user to select which parts of the XML text are actually represented in the tree. For example, it is possible to store processing instructions as tree nodes, but the parser can also be configured such that these instructions are put into hashtables. The information model is compatible with the requirements of XML-related standards such as XPath.

  • There is also an interface for DTDs; you can parse and access sequences of declarations.

Although the parser is now complete, development does not stop. I have plans to implement namespace support, and there are rumours that other people are programming an XPath extension. The xduce project already uses a pre-release version of PXP for their XML transformation language.

For more details, see the announcement.

Documentation

I have begun to write a manual but it is still incomplete. It includes a short introduction into XML, and many examples how to use the parser.

There are also Release Notes.

How to get PXP

The distribution tarball of the parser contains the full source code. As usual, it is a Caml component, and requires that the Findlib library has been installed first. The parser also needs the Ocamlnet package. Note that PXP works only with O'Caml 3.00.

Sample applications

The distribution contains three working applications which demonstrate how XML can be used:

  • pxpvalidate: Reads an XML instance and prints all violations against the standard and against the DTD (if present).

  • readme: This is a simple DTD for text documents, intended to write README and INSTALL files with it. I have written two converters, one transforming the XML instance to HTML, one converting to ASCII text files. This example is discussed in the manual.

  • xmlforms: This application allows it to edit a single record of key/value pairs which are stored as XML instance. The editor itself is configurable by a style sheet given in XML, too. The editor has a GUI realised with the CamlTk widget set.

 
This web site is published by Informatikbüro Gerd Stolpmann
Powered by Caml