The elephant in the living room, or: The case for an open source C++ parser
Evaluating Existing Approaches
Potential SoC? Mentors
Note:This document is a draft, we will review it during the next days and weeks, please feel free to make whatever augmentations you consider useful! ---
This is something that has meanwhile been requested numerous times during the last 5 years in dozens of discussions on the Spirit/Boost? mailing lists (where it is still being brought up on a regular basis)  as well as in various related feature requests: a full ISO C++ parser, implemented in, or based on Boost.Spirit.
If you check out the Spirit mailing list archives concerning discussions about a Spirit C++ parser, you'll probably agree that it's safe to say that this is one of the most (if not even the most at all!) frequently requested features for the Boost.Spirit parser framework library. This is mainly due to the fact that there is currently not a single free/open source tool/framework (or library) available that can properly parse ISO C++ in order to provide the (complete!) parsed data in tree format for further evaluation/processing.
While several Spirit users have meanwhile attempted to address the need for an open C++ parser using Spirit, there has so far only been made few progress. It's turned out to be a pretty complex job, on the one hand because Spirit may not yet offer all the required infrastructure code to make an implementation solely based on Spirit feasible at all, on the other hand also because of the various difficulties related to parsing an ambiguous langauge as C++.
So, while requests for an open source C++ parser have usually in fact been communicated via the Spirit mailing lists, they are in no way specific to, or limited to Spirit users. Rather, there is apparently a whole population of C++ developers hoping to finally get access to an open source C++ parser that they can use to develop tools to process C++ source code in an abstract fashion, thereby facilitating the development of new tools and technologies to further software development-thus the repeated requests are in fact reflective of the significant demand for such a tool/library.
This, again, can mainly be attributed to the lack of a generic, ISO compliant C++ parser framework because the absolute majority of advanced (read: working!) C++ parsers are not standalone parsers, rather they are usually tightly integrated with compilers or other complex pieces of non-trivial software and thus do not lend themselves for immediate use in different scenarios, even though they do have access to the required data-i.e. because the corresponding datastructures were never designed to be exposed via standardized interfaces for use in other applications.
So, while there are in fact a few commercial C++ re-engineering frontends available (often with a pretty limited scope) for industrial use, even fewer vendors provide standalone C++ parsers that could be used for the development of new C++ source code processing software. Moreso, the few standalone parsers available are extremely costly products and thus are not necessarily suited for low budgets projects/efforts, not to mention open source software.
So, even apart from the aforementioned Spirit related feature requests and mailing list discussions, there's still high (not to say daily increasing) demand for such a C++ Parser framework/library:
There are numerous open source projects that would sooner or later definitely benefit directly from such a tool/library: pretty much any of the big established IDE projects that support C++ (i.e. Eclipse, KDevelop, CodeBlocks?, Anjuta etc.) uses some sort of self-made minimalistic C++ parser, i.e. in order to provide syntax-highlighting capabilities , symbol lookup/auto-completion functionality or in order to compute class hierarchy diagrams and caller trees etc. There's a whole variety of potential application areas!
Likewise, source code documentation systems such as the popular DoxyGen? tool depend on having reliable (and complete) meta information available for the source code they process. This meta information can only be obtained from parsing and analyzing the affected source code.
Given the current lack of an open source, ISO compliant C++ parser framework, source code processing projects like DoxyGen? have to resort to making compromises and i.e. have come up with their own implementation of a C++ parser. An implementation that is, again, highly specific to the requirements of the corresponding project and thus only of limited use for other purposes/projects.
While this is only one example, it illustrates quite well the predominant dilemma in projects that (try to) deal with C++ code on the source level: rather than being able to immediatley concentrate on the actual scope of the original project, such projects usually have to come up with some sort of custom C++ parser in the first place - or (if lucky) have to do plenty of research to find open source approaches to parsing and analyzing C++ that sufficiently fulfill their own requirements or at least can be modified accordingly.
Additionally, it is crucial to keep in mind that parser writing is a science in itself (particular for a language as complex as C++) and that the required expertise is only rarely available, anyway. Rather, developers will usually have to get familiar with the details of the C++ language, as well as with the various approaches to writing a parser/lexer. Both of which are inherently complex domains. Basically, meaning that developers are forced to developing outside of their actual problem domains and thus may not be as efficient/capable (or motivated) as if they were allowed to concentrate directly on the scope of their very efforts because their capabilities cannot be applied ideally.
Essentially, resulting in draining important development resources (manpower & time) from the original project while simultaneously providing even -under ideal circumstances- only a limited compromise, compared to having a full ISO C++ compliant parser available.
However, pretty much all of the various existing custom (open source) approaches to parsing & analyzing C++ are usually not only very specific to each project's individual requirements and goals but they are (understandably) also very restricted in their usability and functionality, i.e. because they do not expose well-defined interfaces, and are thus only of limited use for other purposes/projects than they were originally conceived for, as they were developed with a very specific problem-domain in mind and accordingly do not lend themselves well for more generic application, or generalization. Even moreso, these custom approaches often have their own shortcomings, too and for example, do not even fulfill some of their own project's most fundamental requirements either.
Besides, another potential problem are usually project-specific dependencies (i.e. big libraries) that may not be available/suitable (respectively, convenient/acceptable) in a different setting/project. In fact, some of those projects that deal with C++ source code processing (i.e. antlr (java), Source Navigator (tcl/tk) etc.), do not even employ C/C++ for the necessary source code parsing or analysis. Thus, even if a corresponding project should provide required and reusable functionality, the overhead of integrating it with different projects, may simply be too significant to be feasible, i.e. because the corresponding overhead might either cause the actual project to no longer scale well or be simply too inconvenient to set up for non-developers.
All of which renders most currently available approaches to parsing C++ basically unusable for other purposes or projects, so that dozens of projects sort of have to keep trying to "re-invent the wheel", again addressing merely their own specific requirements only in order to come up with yet other solutions that are neither particularly generic or modularized, nor re-usable.
All software that deals with C++ sources could eventually benefit from an open source C++ Parser, there are many potential usage scenarios, the following is just a short list of examples:
etc. (feel free to add ideas!)
While several of such tools would have different requirements and scopes and may in fact appear seemingly unrelated, the common denominator between all such projects is that all of them would require an ISO C++ compliant parser in the first place, in order to obtain an abstract representation of the processed source code.
We need a C++ parser that's able to parse ISO C++ in order to:
A Generic Solution for a General Problem
Amongst others, we feel that the Boost.Spirit parser framework library is an extremely suitable candidate for this effort, mainly for the following reasons:
As already mentioned previously, even almost 8 years after the C++ programming language got first standardized, there are still no advanced re-engineering (or more generally, metaprogramming) tools for C++ source code commonly available. With the exception of some few proprietary products, with extremely limited functionality or scope, this is a predominant fact-in both, the open source- as well as the closed source world.
This can be considered a major shortcoming, as metaprogramming support in other mainstream programming languages such as Java has meanwhile been available for several years.
Obviously, there is a strong and growing demand for C++ re-engineering tools for the most important programming language in use today. However, the few commercial products that are meant to fill this void in the commecial world are certainly not viable options for non-commercial settings, and open source software in particular.
As has also been previously pointed out, the underlying problem needs to be attributed to the lack of an available open C++ parser framework, which in turn is due to the complexities related with fully parsing the C++ programming language, as the ability to properly and completely parse C++ source code is the absolute and most fundamental prerequisite for dealing with C++ source code in an abstract fashion, enabling C++ source-level metaprocessing in the end.
While there were (and are) dozens of approaches with varying scope, trying to satisfy the demand for metaprogramming tools, there has no far not been a single comprehensive open source C++ parser framework available. Even though there is admittedly a number of projects that provide in fact useful functionality, these appproaches are again way too specific, restricted and hardly re-usable. Which is why many of the tools that depend on C++ source code processing capabilities, try to implement their own custom C++ parsers, concentrating again only on very specific functionality.
While all this is in itself quite an unfortunate situation, it is also an exceptional chance for open source software to fill this very void, while setting standards at the same time. Because this is exactly where open source software can prove its merits, and leverage all its power-simply because there are currently no equivalent solutions available. This creates many exciting new possibilities.
However, this will require some serious pioneer work, so that we can actually create the basis for future C++ metaprogramming applications.
We feel that it is now time to bring this dilemma to an end, and solve the problem on a broader scale-by finally providing an option to OSS developers to use one generic and comprehensive approach to parsing C++ properly, an approach that has right from the beginning been designed to be modular, portable, extensible and reusable, without requiring unreasonable external dependencies.
While it is correct to say that only few tools are likely to be interested in all of the information that such a comprehensive parser could provide, we think the advantages of using a full ISO C++ parser would by far outweigh the potential disadvantages. Particularly, when compared with the various approaches currently in use. Additionally, the final integration and runtime overhead should be neglegible if an approach is pursued that provides data on demand, sort of on a "subscription" basis where a host application can request certain data to be made available or not.
Ultimately, this would ensure that there is one generic and comprehensive solution, rather than dozens of highly limited and hardly reusable approaches. That is, one working C++ parser framework could easily satisfy all of the requirements of dozens of C++ source code processing tools.
While we are aware of the complexity involved in actually making this happen, we are sure that this is an important steps towards making a whole variety of new metaprogramming software for C++ developers possible because an open source, ISO C++ parser can probably be considered an essential milestone for this to happen.
Thus, we are convinced that the right thing to do now, is to start discussing the required steps for establishing the required infrastructure in Spirit in order to make a Spirit based ISO C++ parser possible. By concentrating on establishing the required low level infrastructure in the Spirit framework library, we will ensure an elegant design and robust architecture, so that we don't end up with a hardly maintainable parser. Meanwhile, this will also result in the Spirit library being improved accordingly.
While this approach might mean that we don't end up with a usable parser anytime soon, it also ensures a robust implementation. Apart from that: even a generic parser that supports only a subset of the features required during its development process is better than none parser at all, and also better than parsers that cannot be easily employed. This applies in particular, if that parser has been designed to be modular and extendable because potential users could contribute features back to the parser.
TODO: design, ideas, potential problems etc.
Roadmap: (this is preliminary, feel free to augment this)
Ideas: How about an subscription based approach? So that backends can subscribe to specific details (i.e. parse tree, AST, ASG)
We need an concerted effort to start implementing the required infrastructure in Spirit in order to be able to create a Spirit based parser for the C++ programming language.
Goals: The goal is to come up with an implementation to parse ISO C++ in order to provide said trees(PT/AST/ASG).
Requirements: C++, STL, C++ Template Meta Programming, familiar with Boost.Spirit, markup languages (XML)
Related Boost Efforts: Spirit.Wave, Synopsis, Boost.Tree Hartmut Kaiser's Spirit C Parser
Ideas: GXL looks like a worthwhile thing to support with such an effort. Alternatively, the possibility to serialize the tree to XML or some sort of RDF markup would also seem like a good idea.
Please feel free to augment this entry.
(If you would be willing to mentor such an effort, please add yourself to the list of potential mentors) ---
TODO: we will have to populate this some more, there are really numerous archived discussions-simply search the archives for "C++ parse/r", the following topics should bring up dozens of relevant discussions: General-List:
Related Mailing List Discussions: Spirit-General:
Related Feature Requests:
A Spirit/Wave? related attempt
Projects that would benefit from such a lib
[buy lipitor online] [buy lipitor] [[buy lipitor online]]
[buy fioricet online] [buy fioricet] [[buy fioricet online]]