Collection of Guile Modules
|Gamma (split by chapter):||?|
The ‘(gamma expat)’ module provides interface to
libexpat, a library for parsing XML documents.
See http://expat.sourceforge.net, for a description of the
(use-modules ((gamma expat)))
Parsing of XML documents using Expat is based on user-defined callback functions. You create a parser object, and associate callback (or handler) functions with the events he is interested in. Such events may be, for instance, encountering of a open or closing tag, encountering of a comment block, etc. Once the parser object is ready, you start feeding the document to it. As the parser recognizes XML constructs, it calls the callbacks that are registered for them.
Parsers are created using
xml-make-parser function. In the
simplest case, it takes no arguments, e.g.:
(let ((parser (xml-make-parser))) …
xml-parse takes the parser as its argument, reads
the document from the current input stream and feeds it to the parser.
Thus, the simplest program for parsing XML documents is:
(use-modules ((gamma expat))) (xml-parse (xml-make-parser))
This program is perhaps not so useful, but you may already use it to
check whether its input is a correctly formed XML document.
xml-parse encounters an error, it signals the
gamma-xml-error error. See section error handling, for a
discussion on how to handle it.
xml-make-parser function takes optional arguments, which
allow to set callback functions for the new parser. For example, the
following code sets function ‘elt-start’ as a handler for
(xml-make-parser #:start-element-handler elt-start)
#:start-element-handler keyword informs the function that
the argument following it is a handler for start XML documents.
Any number of handlers may be set this way, e.g.:
(xml-make-parser #:start-element-handler elt-start #:end-element-handler elt-end #:comment-handler comment)
Definitions of particular handler functions differ depending on their purpose, i.e. on the event they are defined to handle. For example, a start element handler must be defined as having two arguments. First of them is the name of the tag, and the second one is a list of attributes supplied for that tag. Thus, for example, the following start handler prints the tag and the number of attributes:
(define (elt-start name attrs) (format #t "~A (~A)~%" name (length attrs)))
For a detailed description of all available handlers and handler keywords, see Expat Handlers.
To further improve our example, suppose you need a program that will take an XML document as its input and create a description of its structure on output, showing element nesting levels by indenting their description. Here is how to write it.
First, define handlers for start and end elements. Start element handler will print two indenting spaces for each level of ancestor elements, followed by the element name and its attributes and a newline. It will then increase the global level variable:
(define level 0) (define (elt-start name attrs) (display (make-string (* 2 level) #\space)) (display name) (for-each (lambda (x) (display " ") (display (car x)) (display "=") (display (cdr x))) attrs) (newline) (set! level (1+ level)))
The handler for end tags is simpler: it must only decrease the level:
(define (elt-end name) (set! level (1- level)))
Finally, create a parser and parse the input:
(xml-parse (xml-make-parser #:start-element-handler elt-start #:end-element-handler elt-end))
Gamma provides several functions for creating and modifying
XML parsers. The
xml-primitive-set-handler are lower level interfaces, provided
for those who wish to further extend Gamma functionality. Higher level
we recommend for regular users.
Return a new XML parser. If enc is given, it must be one of: ‘US-ASCII’, ‘UTF-8’, ‘UTF-16’, ‘ISO-8859-1’. If sep is given, the returned parser has namespace processing in effect. In that case, sep is a character which is used as a separator between the namespace URI and the local part of the name in returned namespace element and attribute names.
Set the encoding to be used by the parser. The latter must be a
value returned from a previous call to
(let ((parser (xml-primitive-make-parser))) (xml-set-encoding parser encoding) …
is equivalent to:
(let ((parser (xml-primitive-make-parser encoding))) …
(let ((parser (xml-make-parser encoding))) …
Set XML handler for an event. Arguments are:
A valid XML parser
A key, identifying the event. For example, ‘#:start-element-handler’ sets handler which is called for start tags.
See section Expat Handlers, for its values and their meaning.
Sets several handlers at once. Optional arguments (args) are constructed of keywords (as described in see handler-keyword), followed by their arguments, for example:
(xml-set-handler parser #:start-element-handler elt-start #:end-element-handler elt-end)
Create a parser and set its handlers. Optional enc and sep have the same meaning as in xml-primitive-make-parser. The rest of arguments define handlers for the new parser. They must be supplied in pairs: a keyword (as described in see handler-keyword), followed by its argument. For example:
(xml-make-parser "US-ASCII" #:start-element-handler elt-start #:end-element-handler elt-end)
This call creates a new parser for documents in ‘US-ASCII’ encoding and sets two handlers: for element start and for element end. This call is equivalent to:
(let ((p (xml-primitive-make-parser "US-ASCII"))) (xml-primitive-set-handler p #:start-element-handler elt-start) (xml-primitive-set-handler p #:end-element-handler elt-end) …
Parse next piece of input. Arguments are:
A parser returned from a previous call to
A piece of input text.
Boolean value indicating whether input is the last part of input.
(xml-primitive-parse parser input #f)
unless input is an end-of-file object, in which case it is equivalent to:
(xml-primitive-parse parser "" #t)
Reads XML input from port (or the standard input port,
if it is not given) and parses it using
When encountering an error. the ‘gamma xml’ functions use Guile error reporting mechanism (see Procedures for Signaling Errors: (guile)Error Reporting section `Error Reporting' in The Guile Reference Manual). The error key indicates what type of error it was, and the rest of arguments supply additional information about the error. Recommended ways for handling errors in Guile are described in How to Handle Errors: (guile)Handling Errors section `Handling Errors' in The Guile Reference Manual). In this chapter we will describe how to handle errors in XML input and other errors reported by the underlying ‘libexpat’ library.
An error of this type is signalled when a of ‘gamma xml’ functions encounters an XML-related error.
The arguments supplied with this error are:
The error key (
Name of the function that generated the error.
Arguments for ‘fmt’.
Error description. If there are no additional information, it is
#f. Otherwise it is a list of 5 elements which describes the
error and its location in the input stream:
A special syntax is provided to extract parts of the ‘descr’ list:
Extract from descr the part identified by key. Use this macro in the error handlers. Valid values for key are:
Return the error code.
Return line number.
Return column number.
#t if the description has context part. Use the two
keywords below only if
(xml-error-descr d #:has-context?
Return context string.
Return the location within
#:context where the error occurred.
If no special handler is set, the default
handler displays the error and its approximate location on the
standard error port. For example, given the following input file:
$ cat input.xml <input> <ref a=1/> </input>
the ‘xmlck.scm’ (see xmlck.scm) produces:
$ guile -s examples/xmlck.scm < input.xml ERROR: In procedure xml-primitive-parse: ERROR: not well-formed (invalid token) near line 2
To provide a more detailed diagnostics, catch the
gamma-xml-error code and use information from the ‘descr’
list. For example:
(catch 'gamma-xml-error (lambda () (xml-parse (xml-make-parser))) (lambda (key func fmt args descr) (with-output-to-port (current-error-port) (lambda () (cond ((not descr) (apply format #t fmt args) (newline)) (else (format #t "~A:~A: ~A~%" (xml-error-descr descr #:line) (xml-error-descr descr #:column) (xml-error-string (xml-error-descr descr #:error-code))) (if (xml-error-descr descr #:has-context?) (let ((ctx-text (xml-error-descr descr #:context)) (ctx-pos (xml-error-descr descr #:error-offset))) (format #t "Context (^ marks the point): ~A^~A~%" (substring ctx-text 0 ctx-pos) (substring ctx-text ctx-pos)))) (exit 1)))))))
When applied to the same input document as in the previous example, this code produces:
$ guile -s examples/xml-check.scm < input.xml 2:8: not well-formed (invalid token) Context (^ marks the point): <input> <ref a=^1/>
This section describes all available element handlers. For clarity, each handler is described in its own subsection. For each handler, we indicate a keyword that is used when registering this handler and the handler prototype.
To register handlers, use
xml-set-handler functions. See section Creating XML Parsers, for a
detailed discussion of these functions.
Sets handler for start (and empty) tags.
The handler must be defined as follows:
A list of element attributes. Each attribute is represented by a cons (‘car’ holds attribute name, ‘cdr’ holds its value).
Sets handler for end (and empty) tags. An empty tag generates a call to both start and end handlers (in that order).
The handler must be defined as follows:
Sets a text handler. A single block of contiguous text free of markup may result in a sequence of calls to this handler. So, if you are searching for a pattern in the text, it may be split across calls to this handler.
The handler itself is defined as:
Set a handler for processing instructions.
First word in the processing instruction.
The rest of the characters in the processing instruction, after target and whitespace following it.
Sets a handler for comments.
The text inside the comment delimiters.
Sets a handler that gets called at the beginning of a CDATA section.
The handler is defined as follows:
Sets a handler that gets called at the end of a CDATA section.
The handler is defined as:
Sets a handler for any characters in the document which wouldn't otherwise be handled. This includes both data for which no handlers can be set (like some kinds of DTD declarations) and data which could be reported but which currently has no handler set.
A string containing all non-handled characters, which are passed exactly as they were present in the input XML document except that they will be encoded in UTF-8 or UTF-16. Line boundaries are not normalized. Note that a byte order mark character is not passed to the default handler. There are no guarantees about how characters are divided between calls to the default handler: for example, a comment might be split between multiple calls. Setting the ‘default’ handler has the side effect of turning off expansion of references to internally defined general entities. Such references are passed to the default handler verbatim.
This sets a default handler as above, but does not inhibit the expansion of internal entity references. Any entity references are not passed to the handler.
The handler prototype is the same as in default-handler.
Set a skipped entity handler, i.e. a handler which is called if:
Name of the entity.
This argument is
#t if the entity is a parameter, and
Set a handler to be called when a namespace is declared.
Set a handler to be called when leaving the scope of a namespace declaration. This will be called, for each namespace declaration, after the handler for the end tag of the element in which the namespace was declared.
The handler prototype is:
Sets a handler that is called for XML declarations and also for text declarations discovered in external entities.
Version specification (string), or
#f, for text declarations.
Encoding. May be
‘Unspecified’, if there was no standalone parameter in the
#f depending on whether
it was given as ‘yes’ or ‘no’.
Set a handler that is called at the start of a ‘DOCTYPE’ declaration, before any external or internal subset is parsed.
System ID. May be
Public ID. May be
#t if the ‘DOCTYPE’ declaration has an internal subset,
Set a handler that is called at the end of a ‘DOCTYPE’ declaration, after parsing any external subset.
The handler takes no arguments:
Sets a handler for ‘attlist’ declarations in the DTD. This handler is called for each attribute, which means, in particular, that a single attlist declaration with multiple attributes causes multiple calls to this handler.
The handler prototype is:
Name of the element for which the attribute is being declared.
Default value, if el-name is a ‘#FIXED’ attribute,
#t, if it is a ‘#REQUIRED’ attribute, and
#f, if it
is a ‘#IMPLIED’ attribute.
Sets a handler that will be called for all entity declarations.
For parameter entities,
For internal entities, entity value. Otherwise,
System ID. For internal entities –
Public ID. For internal entities –
Notation name, for unparsed entity declarations. Otherwise,
#f. Unparsed are entity declarations that have a notation
(‘NDATA’) field, such as:
<!ENTITY logo SYSTEM "images/logo.gif" NDATA gif>
Sets a handler that receives notation declarations.
Handler prototype is:
Sets a handler that is called if the document is not standalone, i.e. when there is an external subset or a reference to a parameter entity, but does not have ‘standalone’ set to "yes" in an XML declaration.
The handler takes no arguments:
Return the version of the expat library as a string.
(xml-expat-version-string) ⇒ "expat_2.0.1"
Return the version of the expat library as a triplet: ‘(major minor micro)’.
(xml-expat-version) ⇒ (2 0 1)
Pass current markup to the default handler (see section default-handler). This function may be called only from a callback handler.
Return a textual description corresponding to the code argument. See catching gamma-xml-error, for an example of using this function.
Return number of the current input line in parser. Input lines are numbered from ‘1’.
Return number of column in the current input line.
Return the number of bytes in the current event. Returns ‘0’ if the event is inside a reference to an internal entity and for the end-tag event for empty element tags (the later can be used to distinguish empty-element tags from empty elements using separate start and end tags).
Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved.