4. Expat Interface

The ‘(gamma expat)’ module provides interface to libexpat, a library for parsing XML documents. See http://expat.sourceforge.net, for a description of the library.

Usage:

 
(use-modules ((gamma expat)))

4.1 Expat Basics

Parsing of XML documents using Expat is based on user-defined callback functions. You create a parser object, and associate callback (or handler) functions with the events he is interested in. Such events may be, for instance, encountering of a open or closing tag, encountering of a comment block, etc. Once the parser object is ready, you start feeding the document to it. As the parser recognizes XML constructs, it calls the callbacks that are registered for them.

Parsers are created using xml-make-parser function. In the simplest case, it takes no arguments, e.g.:

 
(let ((parser (xml-make-parser)))
  …

The function xml-parse takes the parser as its argument, reads the document from the current input stream and feeds it to the parser. Thus, the simplest program for parsing XML documents is:

 
(use-modules ((gamma expat)))
(xml-parse (xml-make-parser))

This program is perhaps not so useful, but you may already use it to check whether its input is a correctly formed XML document. If xml-parse encounters an error, it signals the gamma-xml-error error. See section error handling, for a discussion on how to handle it.

The xml-make-parser function takes optional arguments, which allow to set callback functions for the new parser. For example, the following code sets function ‘elt-start’ as a handler for start elements:

 
(xml-make-parser #:start-element-handler elt-start)

The #:start-element-handler keyword informs the function that the argument following it is a handler for start XML documents. Any number of handlers may be set this way, e.g.:

 
(xml-make-parser #:start-element-handler elt-start
                 #:end-element-handler elt-end
                 #:comment-handler comment)

Definitions of particular handler functions differ depending on their purpose, i.e. on the event they are defined to handle. For example, a start element handler must be defined as having two arguments. First of them is the name of the tag, and the second one is a list of attributes supplied for that tag. Thus, for example, the following start handler prints the tag and the number of attributes:

 
(define (elt-start name attrs)
  (format #t "~A (~A)~%" name (length attrs)))

For a detailed description of all available handlers and handler keywords, see Expat Handlers.

To further improve our example, suppose you need a program that will take an XML document as its input and create a description of its structure on output, showing element nesting levels by indenting their description. Here is how to write it.

First, define handlers for start and end elements. Start element handler will print two indenting spaces for each level of ancestor elements, followed by the element name and its attributes and a newline. It will then increase the global level variable:

 
(define level 0)

(define (elt-start name attrs)
  (display (make-string (* 2 level) #\space))
  (display name)
  (for-each
   (lambda (x)
    (display " ")
    (display (car x))
    (display "=")
    (display (cdr x)))
   attrs)
  (newline)
  (set! level (1+ level)))

The handler for end tags is simpler: it must only decrease the level:

 
(define (elt-end name)
  (set! level (1- level)))

Finally, create a parser and parse the input:

 
(xml-parse (xml-make-parser #:start-element-handler elt-start
                            #:end-element-handler elt-end))

4.2 Creating XML Parsers

Gamma provides several functions for creating and modifying XML parsers. The xml-primitive-make-parser and xml-primitive-set-handler are lower level interfaces, provided for those who wish to further extend Gamma functionality. Higher level interfaces are xml-make-parser and xml-set-handler which we recommend for regular users.

Scheme procedure: xml-primitive-make-parser enc sep

Return a new XML parser. If enc is given, it must be one of: ‘US-ASCII’, ‘UTF-8’, ‘UTF-16’, ‘ISO-8859-1’. If sep is given, the returned parser has namespace processing in effect. In that case, sep is a character which is used as a separator between the namespace URI and the local part of the name in returned namespace element and attribute names.

Scheme procedure: xml-set-encoding parser enc

Set the encoding to be used by the parser. The latter must be a value returned from a previous call to xml-primitive-make-parser or xml-make-parser.

The sequence:

 
  (let ((parser (xml-primitive-make-parser)))
    (xml-set-encoding parser encoding)
    …
   

is equivalent to:

 
  (let ((parser (xml-primitive-make-parser encoding)))
   …

and to:

 
  (let ((parser (xml-make-parser encoding)))
   …
Scheme procedure: xml-primitive-set-handler parser key handler

Set XML handler for an event. Arguments are:

parser

A valid XML parser

key

A key, identifying the event. For example, ‘#:start-element-handler’ sets handler which is called for start tags.

See section Expat Handlers, for its values and their meaning.

handler

Handler procedure.

Scheme function: xml-set-handler parser args…

Sets several handlers at once. Optional arguments (args) are constructed of keywords (as described in see handler-keyword), followed by their arguments, for example:

 
(xml-set-handler parser
      #:start-element-handler elt-start
      #:end-element-handler elt-end)
Scheme function: xml-make-parser [enc [sep]] args…

Create a parser and set its handlers. Optional enc and sep have the same meaning as in xml-primitive-make-parser. The rest of arguments define handlers for the new parser. They must be supplied in pairs: a keyword (as described in see handler-keyword), followed by its argument. For example:

 
(xml-make-parser "US-ASCII"
      #:start-element-handler elt-start
      #:end-element-handler elt-end)

This call creates a new parser for documents in ‘US-ASCII’ encoding and sets two handlers: for element start and for element end. This call is equivalent to:

 
(let ((p (xml-primitive-make-parser "US-ASCII")))
   (xml-primitive-set-handler p #:start-element-handler elt-start)
   (xml-primitive-set-handler p #:end-element-handler elt-end)
   …

4.3 Parser Functions

Scheme procedure: xml-primitive-parse parser input isfinal

Parse next piece of input. Arguments are:

parser

A parser returned from a previous call to xml-primitive-make-parser or xml-make-parser.

input

A piece of input text.

isfinal

Boolean value indicating whether input is the last part of input.

Scheme function: xml-parse-more parser input

Equivalent to:

 
(xml-primitive-parse parser input #f)

unless input is an end-of-file object, in which case it is equivalent to:

 
(xml-primitive-parse parser "" #t)
Scheme function: xml-parse parser [port]

Reads XML input from port (or the standard input port, if it is not given) and parses it using xml-primitive-parse.

4.4 Error Handling

When encountering an error. the ‘gamma xml’ functions use Guile error reporting mechanism (see Procedures for Signaling Errors: (guile)Error Reporting section `Error Reporting' in The Guile Reference Manual). The error key indicates what type of error it was, and the rest of arguments supply additional information about the error. Recommended ways for handling errors in Guile are described in How to Handle Errors: (guile)Handling Errors section `Handling Errors' in The Guile Reference Manual). In this chapter we will describe how to handle errors in XML input and other errors reported by the underlying ‘libexpat’ library.

Error Key: gamma-xml-error

An error of this type is signalled when a of ‘gamma xml’ functions encounters an XML-related error.

The arguments supplied with this error are:

key

The error key (gamma-xml-error).

func

Name of the function that generated the error.

fmt

Format string

fmt-args

Arguments for ‘fmt’.

descr

Error description. If there are no additional information, it is #f. Otherwise it is a list of 5 elements which describes the error and its location in the input stream:

  1. Error code (number).
  2. Line number (starts at 1).
  3. Column number (starts at 0).
  4. Context in which the error occurred, i.e. a part of the input text which was found to contain the error.
  5. Offset of point that caused the error within the context.

A special syntax is provided to extract parts of the ‘descr’ list:

Gamma Syntax: xml-error-descr descr key

Extract from descr the part identified by key. Use this macro in the error handlers. Valid values for key are:

xml-error-descr key: #:error-code

Return the error code.

xml-error-descr key: #:line

Return line number.

xml-error-descr key: #:column

Return column number.

xml-error-descr key: #:has-context?

Return #t if the description has context part. Use the two keywords below only if

 
(xml-error-descr d #:has-context?

returned #t.

xml-error-descr key: #:context

Return context string.

xml-error-descr key: #:error-offset

Return the location within #:context where the error occurred.

If no special handler is set, the default guile error handler displays the error and its approximate location on the standard error port. For example, given the following input file:

 
$ cat input.xml
<input>
 <ref a=1/>
</input>

the ‘xmlck.scm’ (see xmlck.scm) produces:

 
$ guile -s examples/xmlck.scm < input.xml
ERROR: In procedure xml-primitive-parse:
ERROR: not well-formed (invalid token) near line 2

To provide a more detailed diagnostics, catch the gamma-xml-error code and use information from the ‘descr’ list. For example:

 
(catch 'gamma-xml-error
       (lambda ()
	 (xml-parse (xml-make-parser)))
       (lambda (key func fmt args descr)
	 (with-output-to-port
	     (current-error-port)
	   (lambda ()
	     (cond
	      ((not descr)
	       (apply format #t fmt args)
	       (newline))
	      (else
	       (format #t
		       "~A:~A: ~A~%"
		       (xml-error-descr descr #:line)
		       (xml-error-descr descr #:column)
		       (xml-error-string (xml-error-descr descr #:error-code)))
	       (if (xml-error-descr descr #:has-context?)
		   (let ((ctx-text (xml-error-descr descr #:context))
			 (ctx-pos  (xml-error-descr descr #:error-offset)))
		     (format #t
			     "Context (^ marks the point): ~A^~A~%"
			     (substring ctx-text 0 ctx-pos)
			     (substring ctx-text ctx-pos))))
	       (exit 1)))))))

When applied to the same input document as in the previous example, this code produces:

 
$ guile -s examples/xml-check.scm < input.xml
2:8: not well-formed (invalid token)
Context (^ marks the point): <input>
 <ref a=^1/>

4.5 Expat Handlers

This section describes all available element handlers. For clarity, each handler is described in its own subsection. For each handler, we indicate a keyword that is used when registering this handler and the handler prototype.

To register handlers, use xml-make-parser or xml-set-handler functions. See section Creating XML Parsers, for a detailed discussion of these functions.

4.5.1 start-element-handler

Handler Keyword: #:start-element-handler

Sets handler for start (and empty) tags.

The handler must be defined as follows:

Handler prototype: start-element name attrs

Arguments:

name

Element name.

attrs

A list of element attributes. Each attribute is represented by a cons (‘car’ holds attribute name, ‘cdr’ holds its value).

4.5.2 end-element-handler

Handler Keyword: #:end-element-handler

Sets handler for end (and empty) tags. An empty tag generates a call to both start and end handlers (in that order).

The handler must be defined as follows:

Handler prototype: end-element name

Arguments:

name

Element name

4.5.3 character-data-handler

Handler Keyword: #:character-data-handler

Sets a text handler. A single block of contiguous text free of markup may result in a sequence of calls to this handler. So, if you are searching for a pattern in the text, it may be split across calls to this handler.

The handler itself is defined as:

Handler prototype: character-data text

Arguments:

text

The text.

4.5.4 processing-instruction-handler

Handler Keyword: #:processing-instruction-handler

Set a handler for processing instructions.

Handler prototype: processing-instruction target data

Arguments are:

target

First word in the processing instruction.

data

The rest of the characters in the processing instruction, after target and whitespace following it.

4.5.5 comment-handler

Handler Keyword: #:comment-handler

Sets a handler for comments.

Handler prototype: comment text
text

The text inside the comment delimiters.

4.5.6 start-cdata-section-handler

Handler Keyword: #:start-cdata-section-handler

Sets a handler that gets called at the beginning of a CDATA section.

The handler is defined as follows:

Handler prototype: start-cdata-section

4.5.7 end-cdata-section-handler

Handler Keyword: #:end-cdata-section-handler

Sets a handler that gets called at the end of a CDATA section.

The handler is defined as:

Handler prototype: end-cdata-section

4.5.8 default-handler

Handler Keyword: #:default-handler

Sets a handler for any characters in the document which wouldn't otherwise be handled. This includes both data for which no handlers can be set (like some kinds of DTD declarations) and data which could be reported but which currently has no handler set.

Handler prototype: default text
text

A string containing all non-handled characters, which are passed exactly as they were present in the input XML document except that they will be encoded in UTF-8 or UTF-16. Line boundaries are not normalized. Note that a byte order mark character is not passed to the default handler. There are no guarantees about how characters are divided between calls to the default handler: for example, a comment might be split between multiple calls. Setting the ‘default’ handler has the side effect of turning off expansion of references to internally defined general entities. Such references are passed to the default handler verbatim.

4.5.9 default-handler-expand

Handler Keyword: #:default-handler-expand

This sets a default handler as above, but does not inhibit the expansion of internal entity references. Any entity references are not passed to the handler.

The handler prototype is the same as in default-handler.

4.5.10 skipped-entity-handler

Handler Keyword: #:skipped-entity-handler

Set a skipped entity handler, i.e. a handler which is called if:

Handler prototype: skipped-entity entity-name parameter?

Arguments are:

entity-name

Name of the entity.

parameter?

This argument is #t if the entity is a parameter, and #f otherwise.

4.5.11 start-namespace-decl-handler

Handler Keyword: #:start-namespace-decl-handler

Set a handler to be called when a namespace is declared.

Handler prototype: start-namespace-decl prefix uri

Arguments:

prefix

Namespace prefix.

uri

Namespace URI.

4.5.12 end-namespace-decl-handler

Handler Keyword: #:end-namespace-decl-handler

Set a handler to be called when leaving the scope of a namespace declaration. This will be called, for each namespace declaration, after the handler for the end tag of the element in which the namespace was declared.

The handler prototype is:

Handler prototype: end-namespace-decl prefix

4.5.13 xml-decl-handler

Handler Keyword: #:xml-decl-handler

Sets a handler that is called for XML declarations and also for text declarations discovered in external entities.

Handler prototype: xml-decl version encoding . detail

Arguments:

version

Version specification (string), or #f, for text declarations.

encoding

Encoding. May be #f.

detail

Unspecified’, if there was no standalone parameter in the declaration. Otherwise, #t or #f depending on whether it was given as ‘yes’ or ‘no’.

4.5.14 start-doctype-decl-handler

Handler Keyword: #:start-doctype-decl-handler

Set a handler that is called at the start of a ‘DOCTYPE’ declaration, before any external or internal subset is parsed.

Handler prototype: start-doctype-decl name sysid pubid   has-internal-subset?

Arguments:

name

Declaration name.

sysid

System ID. May be #f.

pubid

Public ID. May be #f.

has-internal-subset?

#t if the ‘DOCTYPE’ declaration has an internal subset, #f otherwise.

4.5.15 end-doctype-decl-handler

Handler Keyword: #:end-doctype-decl-handler

Set a handler that is called at the end of a ‘DOCTYPE’ declaration, after parsing any external subset.

The handler takes no arguments:

Handler prototype: end-doctype-decl

4.5.16 attlist-decl-handler

Handler Keyword: #:attlist-decl-handler

Sets a handler for ‘attlist’ declarations in the DTD. This handler is called for each attribute, which means, in particular, that a single attlist declaration with multiple attributes causes multiple calls to this handler.

The handler prototype is:

Handler prototype: attlist-decl el-name att-name att-type detail

Argument:

el-name

Name of the element for which the attribute is being declared.

att-name

Attribute name.

detail

Default value, if el-name is a ‘#FIXED’ attribute, #t, if it is a ‘#REQUIRED’ attribute, and #f, if it is a ‘#IMPLIED’ attribute.

4.5.17 entity-decl-handler

Handler Keyword: #:entity-decl-handler

Sets a handler that will be called for all entity declarations.

Handler prototype: entity-decl name param? value base sys-id pub-id   notation

Arguments:

name

Entity name.

param?

For parameter entities, #t. Otherwise, #f.

value

For internal entities, entity value. Otherwise, #f.

base

Base.

sys-id

System ID. For internal entities – #f.

pub-id

Public ID. For internal entities – #f.

notation

Notation name, for unparsed entity declarations. Otherwise, #f. Unparsed are entity declarations that have a notation (‘NDATA’) field, such as:

 
<!ENTITY logo SYSTEM "images/logo.gif" NDATA gif>

4.5.18 notation-decl-handler

Handler Keyword: #:notation-decl-handler

Sets a handler that receives notation declarations.

Handler prototype is:

Handler prototype: notation-decl notation-name base   system-id public-id

4.5.19 not-standalone-handler

Handler Keyword: #:not-standalone-handler

Sets a handler that is called if the document is not standalone, i.e. when there is an external subset or a reference to a parameter entity, but does not have ‘standalone’ set to "yes" in an XML declaration.

The handler takes no arguments:

Handler prototype: not-standalone

4.6 miscellaneous functions

Scheme function: xml-expat-version-string

Return the version of the expat library as a string.

For example:

 
(xml-expat-version-string) ⇒ "expat_2.0.1"
Scheme function: xml-expat-version

Return the version of the expat library as a triplet: ‘(major minor micro)’.

For example:

 
(xml-expat-version) ⇒ (2 0 1)
Scheme function: xml-default-current

Pass current markup to the default handler (see section default-handler). This function may be called only from a callback handler.

Scheme function: xml-error-string code)

Return a textual description corresponding to the code argument. See catching gamma-xml-error, for an example of using this function.

Scheme function: xml-current-line-number parser

Return number of the current input line in parser. Input lines are numbered from ‘1’.

Scheme function: xml-current-column-number parser

Return number of column in the current input line.

Scheme function: xml-current-byte-count parser

Return the number of bytes in the current event. Returns ‘0’ if the event is inside a reference to an internal entity and for the end-tag event for empty element tags (the later can be used to distinguish empty-element tags from empty elements using separate start and end tags).