Polymorphic XML Parser is a validating XML parser for the programming language Objective Caml.
In October, 1999, I started writing a validating XML parser for O'Caml; the first published versions have been called "Markup" (simply because the package name was "markup"). After this parser had some success, I decided to revise the whole code, and to redesign the parser where it was needed. The result of this work is PXP, the Polymorphic XML Parser. The name reflects an important property of the parser, namely that the type of the XML nodes can be customized; a feature which is missing in most other XML parsers.
Now, one year later, I can announce the first stable version of PXP. "Stable" means mostly that the interface of the parser has become stable, i.e. future changes will extend but not break the current interface. The parser worked relatively well from the very beginning, and during the pre-release phase (several months) users reported only few bugs. I am now relatively sure that PXP is mature enough to be used in applications.
In general, the task of a XML parser is to read XML text, and to represent the text somehow in memory. There are several models for the data structures; for PXP I have chosen the luxury representation as object tree, in which every XML node is stored as two objects. One object contains the set of methods describing the fixed properties of every node; the other object is called the extension object and can be configured by the user of the parser.
The extension object is the polymorphic part of the representation. The type of the class may be arbitrary (except three base methods which connect the object to the tree), and the parser has a mechanism to dynamically select the class of the object depending on the element type of the XML node.
Here are some key features of "Polymorphic XML Parser":
- The XML instance is validated against the DTD; any violation of a validation constraint leads to the rejection of the instance. The validator has been carefully implemented, and conforms strictly to the standard. If needed, it is also possible to run the parser in a well-formedness mode.
- If possible, the validator applies a deterministic finite automaton to validate the content models. This ensures that validation can always be performed in linear time. However, in the case that the content models are not deterministic, the parser uses a backtracking algorithm which can be much slower. - It is also possible to reject non-deterministic content models.
- The parser can read XML text encoded in a variety of character sets. Independent of this, it is possible to choose the encoding of the internal representation of the tree nodes; the parser automatically converts the input text to this encoding. Currently, the parser supports UTF-8 and ISO-8859-1 as internal encodings.
- The interface of the parser has been designed such that it is best integrated into the language O'Caml. The first goal was simplicity of usage which is achieved by many convenience methods and functions, and by allowing the user to select which parts of the XML text are actually represented in the tree. For example, it is possible to store processing instructions as tree nodes, but the parser can also be configured such that these instructions are put into hashtables. The information model is compatible with the requirements of XML-related standards such as XPath.
- There is also an interface for DTDs; you can parse and access sequences of declarations.