W4 XML Parser

 

 

 

(c) Carlos Viegas Damásio, July 2003

 

Description
This software package implements a non-validating parser entirely written in Prolog, almost conformant with the W3C recommendations (we deviate from XML 1.0 by adopting XML 1 syntax for XML names and XML whitespace). It completely supports XML Namespaces and the internal representation produced respects the guidelines of XML Info Sets. The parser treats XML Base attributes but does not (yet) resolve relative references.

Internal DTDs are parsed and a Prolog representation is also produced. Default attributes declared in the DTD are automatically placed in the XML term representation. Furthermore, internal entities are expanded as well as parameter entities. Normalization of attribute values and  whitespace elimination is also properly done.
The following encodings are currently recognized: US-ASCII, UTF-8, UTF-16, UTF-16LE, UTF-16BE, and ISO-8859-1.

The package was developed for XSB Prolog 2.5, but porting to other Prolog systems is foreseen.

 

Download and Installation

The W4 XML Parser Library distribution contains the following XSB Prolog source files:

  • The file containing the main XML parser predicates (xml.P)
  • The XML parser (xmlparser.H and xmlparser.P), automaticaly produced by our Parser Generator, with lookahead information.
  • The Document Object Model (xmldom.H and xmldom.P) with the predicates for constructing the Prolog representation of XML documents.
  • A translator to the compact representation of XML in Prolog (xml2termns.P), with Namespace support.
  • The I/O stream support predicates (iostream.P and utf.P).
  • Several utility predicates (utilities.P)
  • Support of URIs (uri.P) - preliminary file.

You can also find some files with extension .G, whcih are the original source files to generate the XML Parser.

To start using the W4 XML Parser library, unpack the .zip file and compile the file xml.P within XSB Prolog.

 

Utilization

The main predicate is parse_xml_document( +Name, +DocURI, ?Encoding, -Document, -Timing ), where:

  • Name is a character code's list, an atom (converted to a character code's list), or a term of the form stream(StreamName). For the latter, the stream is opened, read and closed by the predicate.
  • DocURI is the character code's list with the Document Base URI (see RFC-2396 for details). If you don't require XML Base support, you can use simply [].
  • Encoding can be used to indicate the encoding information of the document. If the argument is not given (a variable) then the parser tries to extract the encoding from the document, either by using Byte-Order-Marks or using the encoding information specified in the XML declaration. It can take the following values:
    'US-ASCII', 'ISO-8859-1, 'UTF-8', 'UTF-16', 'UTF-16LE', 'UTF-16BE', 'UTF-32', 'UTF-32LE', and 'UTF-32BE'.
  • Document returns the Prolog representation, according to XML INFOSET, which is described below.
  • Timing is a term of the form time(LoadTime,ParseTime) providing the reading and parsing timings (in msecs).

The following two simpler versions of the parse_xml_document are provided:

  • parse_xml_document( +Name, ?Encoding, -Document )
  • parse_xml_document( +Name, +DocURI, ?Encoding, -Document )

In some situations, it might be desirable to skip the reading phase. In this case the xml_document(  +UnicodeList, +BaseURI, -Document) predicate can be used directly. Notice that the first argument is a Unicode List (i.e. a list of integers) terminated with -1. The other arguments are as before.

Example 1:

| ?- xml_document( [0'<,0'h,0'e,0'l,0'l,0'o,0'/,0'>,-1], [], T ).

T = document([element([],hello,[],[],[],[],[[] = [],xml = http://www.w3.org/XML/1998/namespace],[],[])],
              element([],hello,[],[],[],[],[[] = [],xml = http://www.w3.org/XML/1998/namespace],[],[]),
              [],[],[],1.0,UTF-8,[],[]
            );

If the documents are not well-formed then all the previous predicates fail.

Representation of XML documents

The Prolog representation of the XML documents follows closely the XML Information Set (XML INFOSET), whenever defined, with the exception of the parent and references properties. In this way, the creation of cyclic terms is avoided since these are difficult to handle and correctly use in most Prolog implementations.

We illustrate here the features of our parser with, resorting to a simple example, and refer the reader to an auxiliary document containing the full description of the Prolog representation adopted.Consider the following example XML document:

<?xml version="1.0"?>
<!-- A comment -->
<?log    this  file  ?>

<tag1 a='abc &lt;' xmlns="http://xpto.org" n1:b='1234' xmlns:n1="http://abc.com">
A very simple text
	<n1:tag2 xml:space="preserve">
		<!-- whitespace between markup should appear -->
		<tag3 xml:space="default">
		</tag3>
		<tag3/>
	</n1:tag2>
	<tag3 xml:lang="en" attrib1='This attribute has    spaces    and 
					     a line feed'>
		<tag4 xmlns="">
			<tag5>This tag shouldn't have a namespace</tag5>
		</tag4>
		<tag4>
			<!-- Whitespace shouldn't appear -->
		</tag4>
	</tag3>
</tag1>

The representation produced is the following (very complex...) term:

document(
  [ comment([32,65,32,99,111,109,109,101,110,116,32]),
    pi(log,[116,104,105,115,32,32,102,105,108,101,32,32],file:/example.xml),
    element(http://xpto.org,tag1,[],
      [ pcdata([10,65,32,118,101,114,121,32,115,105,109,112,108,101,32,116,101,120,116,10,9]),
        element(http://abc.com,tag2,n1,
          [ whitespace([10,9,9]),
            comment([32,119,104,105,116,101,115,112,97,99,101,32,98,101,116,119,101,101,110,32,
                     109,97,114,107,117,112,32,115,104,111,117,108,100,32,97,112,112,101,97,114,32]),
            whitespace([10,9,9]),
            element(http://xpto.org,tag3,[],
              [],
              [ename(http://www.w3.org/XML/1998/namespace,space) = attribute(http://www.w3.org/XML/1998/namespace,space,xml,[100,101,102,97,117,108,116],no,[])],
              [],
              [[] = http://xpto.org,n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace],
              file:/example.xml,[]
            ),
            whitespace([10,9,9]),
            element(http://xpto.org,tag3,[],
              [],
              [],
              [],
              [[] = http://xpto.org,n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace],
              file:/example.xml,
              []
            ),
            whitespace([10,9])
          ],
          [ename(http://www.w3.org/XML/1998/namespace,space) = attribute(http://www.w3.org/XML/1998/namespace,space,xml,[112,114,101,115,101,114,118,101],no,[])],
          [],
          [[] = http://xpto.org,n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace],
          file:/example.xml,
          []
        ),
        element(http://xpto.org,tag3,[],
          [ element([],tag4,[],
              [ element([],tag5,[],
                  [ pcdata([84,104,105,115,32,116,97,103,32,115,104,111,117,108,100,110,39,116,32,104,97,118,101,32,97,32,110,97,109,101,115,112,97,99,101])
                  ],
                  [],
                  [],
                  [[] = [],n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace],
                  file:/example.xml,
                  [101,110]
                )
              ],
              [],
              [ename(http://www.w3.org/2000/xmlns/,[]) = attribute(http://www.w3.org/2000/xmlns/,xmlns,[],[],no,[])],
              [[] = [],n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace],
              file:/example.xml,
              [101,110]
            ),
            element(http://xpto.org,tag4,[],
              [ comment([32,87,104,105,116,101,115,112,97,99,101,32,115,104,111,117,108,100,110,39,116,32,97,112,112,101,97,114,32])
              ],
              [],
              [],
              [[] = http://xpto.org,n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace],
              file:/example.xml,
              [101,110]
            )
          ],
          [ ename([],attrib1) = attribute([],attrib1,[],[84,104,105,115,32,97,116,116,114,105,98,117,116,101,32,104,97,115,32,32,32,32,115,112,97,99,101,115,
                                                         32,32,32,32,97,110,100,32,32,32,32,32,32,32,32,32,32,32,32,97,32,108,105,110,101,32,102,101,101,100],no,[]),
            ename(http://www.w3.org/XML/1998/namespace,lang) = attribute(http://www.w3.org/XML/1998/namespace,lang,xml,[101,110],no,[]) 
          ],
          [],
          [[] = http://xpto.org,n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace],
          file:/example.xml,
          [101,110]
        )
      ],
      [ ename([],a) = attribute([],a,[],[97,98,99,32,60],no,[]),
        ename(http://abc.com,b) = attribute(http://abc.com,b,n1,[49,50,51,52],no,[])],
      [ ename(http://www.w3.org/2000/xmlns/,[]) = attribute(http://www.w3.org/2000/xmlns/,xmlns,[],[104,116,116,112,58,47,47,120,112,116,111,46,111,114,103],no,[]),
        ename(http://www.w3.org/2000/xmlns/,n1) = attribute(http://www.w3.org/2000/xmlns/,n1,xmlns,[104,116,116,112,58,47,47,97,98,99,46,99,111,109],no,[])],
      [[] = http://xpto.org,n1 = http://abc.com,xml = http://www.w3.org/XML/1998/namespace],
      file:/example.xml,
      []
    )
  ], 
  ... The representation of the document element again ...,
  [],
  [],
  file:/example.xml,
  1.0,
  UTF-8,
  yes,
  []
)

 

Current Limitations
  • The reading of documents is rather inefficient, due to the necessity of supporting several encodings. If you have a more efficient way of obtaining terminated Unicode Lists then do use it, and then resort to xml_document/3.
  • It does not resolve relative URI references in xml:base attributes.
  • Does not expand External Parsed Entities
Future developments
  • Complete the documentation
  • Improve the generation of the tree representation

 

Disclaimer

THIS IS AN EXPERIMENTAL TOOL. I DO NOT GIVE ANY GUARANTEE.

 
 
Last update: July 30th, 2003