BOOST WIKI | RecentChanges | Preferences | Page List | Links List

boost::xml - Discussion of requirements for a new XML processing library

Some thoughts posted by People/ChrisRussell:

I have a specific problem that I'm working out that involves a lot of XML processing. My set of requirements is perhaps a bit more involved than we should try to tackle for a first go at this. But I throw them out to start the discussion...

Validation against schema should be optional

In my project, I use XML-encoded data in two distinct contexts: internal to the program, and externally. Internally, I'm typically working with a well-known and fixed data schema and don't want to incur the overhead of formal validation (a well-formed XML document is fine). The project additionally sources and sinks XML data from external sources. It would be nice to be able to validate these against a schema.

The boost::xml library should be flexible enough to allow for both non-validating and validating deserialization of XML data.

Currently I use the MIT/X-licesed non-validating [expat] XML Parser Toolkit written by James Clark to generate a series of callbacks that I use to build a directed graph representation of the document using the BGL. I'll expound further on this approach if it's not met with overwhelming resistance. It's super fast, and super flexible. Using BGL's visitor concept, I can convert the graph representation created by my expat callbacks directly into a structure containing STL containers. Because of the way I've implemented the visitor (and the mapping tables it uses), essentially I get light-weight schema validation without ever formally dealing with an external schema document (the schema is fixed when I create the maps passed to the visitor).

Currently I just skim the output structure off and free the graph when I'm done. However, keeping the object around and maintaining the data in a graph is an approach that should be carefully considered because it makes several other features I'd like to see in boost::xml quite easy to implement.

Some have suggested that Spirit, and not expat, be used. I haven't used Spirit yet. I'm concerned that Sprit will be much larger than expat. Personally, I don't see the problem with using a small, and proven, non-validating parser as a front end but am interested in hearing what other people have to say about it. (Note that I have my own ideas about how to validate against a schema that use BGL visitors and maps - this is why expat works for me - all I care about is that the XML document is well-formed and expat tells me this. Parenthetically, validation against a schema in my scheme requires that the maps fed to the BGL visitor be created from the schema document - Spirit is probably an excellent choice for parsing the schema document in order to create these maps).

File under "would be nice to have"

That's enough for now. Hack up this page and add your comments. A boost::xml library would be very useful to me and many others I think.

When I think of an 'XML library', I think of much more than just parsing xml files. The DOM API in particular involves in-memory manipulation of a document tree, and the specs suggest a specifically optimized internal structure to make access efficient.

So what's really at stake here is a tree interface and implementation that can support DOM-like manipulation (node insertion, removal, xpath-based node lookup, etc., etc.). The parser is only a small part of it.

As XML and co. is quite a huge set of specs, I wouldn't dare to suggest to create yet another implementation. Rather, I'm suggesting that a C++-like API is built that can wrap existing implementations, such as libxml2 (http://xmlsoft.org)

 -- Stefan Seefeld

Hi Stefan, I think you've codifeid my point for me. Basically what I'm saying is that the parser is a front-end (could be expat e.g.) Of course we're all free to download the [Apache Xerces C++ Parser] but it's huge. Given some non-validating XML parser front-end, the C++ interface you suggest I think is most naturally supported by a BGL representation of the document tree which is quite trivial to construct given a series of events from the front-end parser. And once the data is in a graph, then you can re-arrange it easily by add/removing edges, and quite elegantly expand the scope of the library by adding algorithms using the BGL visitor concept.

- Chris

Hi Chris, I'm not argueing about the possibility to internally represent a dom tree using BGL. However, lots of tree manipulations can be highly optimized taking the semantical specifics of xml and related standards into account (xpath, xml namespaces, xinclude, xlink, etc.). I doubt you can get as efficient (speed and memory wise) with a generic graph library as you can get with a domain specific implementation such as libxml2.

[actually, it may be an interesting experience: the examples I include in my submission are fairly small. Could you rewrite them with a dom tree (manually) built using BGL ? Could you measure performance for things like xpath lookup or node insertion (respecting all the specs such as namespace adjustments etc.) ]

 -- Stefan Seefeld

... lots of tree manipulations can be highly optimized taking the semantical specifics of xml and related standards into account (xpath, xml namespaces, xinclude, xlink, etc.)

Stefan, it seems to me that we're ultimatly dealing with a tree of vertices that correspond to entities in the XML document. Remember that BGL is generic like the STL is generic. That is, once you compile the container it stores specific types of data in an extremely efficient manner (underlying storage is provided STL containers). I'm having trouble imagining any traversal, edge/vertex insert/remove operation that might be required by any of the XML-related specifications that you cite that couldn't be handled with extreme efficiency and elegance by the BGL.

Could you rewrite them with a dom tree (manually) built using BGL?

Yes - I'm swamped right now trying to get a product demo running but will be able to do this early in June. No big deal to work up some simple examples that use expat to create the graph and visitor algorithms to operate on it. It will be a small amount of work to extract my current stuff from the context that it's in currently and package it as a little toy for us to play with. Note: all these navigation, indexing related XML specifications I think are fairly trivial to get going. I don't have a good story about how to do XSLT transforms yet though. Something to think about.

This is a great discussion. Keep it going.

- Chris


We started to work on an XML library based on C++ iostreams specification two and half years ago and a little before we discovered the boost libraries. In the mean time, the library has come to a point where the main features of the base layer have been developed and simultaneously we have become enthusiastic users of the boost libraries. Slowly, we have come to the conclusion that we should submit the library to boost.

We do not think that the work can undergo a formal review yet but we feel it has reached a stage where we can reasonably expect comments, critics and maybe help from other developers.

The library is called XiMoL (XML input/output) and is compliant with the C++/STL streams specification. It introduces a new type of streams (xiostream) that derives from wiostream and that is used for XML input/output. The library is divided in three parts.

A first part tackles character encoding. We rely on the iconv library from GNU for the raw functionality, which has been wrapped as a facet called codecvt. It might be possible that the GNU licence is unacceptable for boost and we could then change the underlying library, which would not affect the interface of the facet.

A second part of the library is dedicated to XML parsing. This part could have benefited from Spirit but it was written long before the parser generator appeared on our radar-screen.

Finally, the third part is the API of the library, which consists mainly in functions for stream parsing.

For those who are interested a CUJ article is due for publication and for those who cannot wait XiMoL is available as a CVS module on SourceForge.

We are looking forward to any suggestion and critics.

Florent Tournois and Cyril Godart.

Revision 2004.05.09 by People/ChrisRussell

In the interest of wrapping up a dangling thread: I started this discussion by proffering ideas about using James Clark's expat C library to drive the creation of a parse tree stored in a BGL graph container. I have recently started working with Spirit and see that this library has much to offer. I need to get further into Spirit before I decide just how bad my BGL-based XML parser solution is. At a minimum, expat is history in favor of a Spirit-based front-end for my current implementation.

I still believe that BGL can be usefully applied to this problem. But it's also likely that Spirit can handle the entire task as well. I'm not prepared to comment on which technique is preferable in terms of performance and code readability - I just haven't gotten that far with Spirit yet.

Parenthetically, my current expat / BGL-based XML parser implementation is part of a larger project destined for SourceForge in the hopefully not-too-distant future (see my profile for more info). Once this larger project is released, I will be able to refer directly to my implementation source in CVS and would be happy to debate the merits of various approaches to XML document parsing then.

- Chris

add your comments/suggestions/flames here...

BOOST WIKI | RecentChanges | Preferences | Page List | Links List
Edit text of this page | View other revisions
Last edited December 18, 2004 12:19 pm (diff)
Disclaimer: This site not officially maintained by Boost Developers