blob: 093bba38006c410c0a97733442fe59a12b42c05a [file] [log] [blame]
<title>Treecc: An Aspect-Oriented Approach to Writing Compilers</title>
<body bgcolor="#ffffff">
<h1>Treecc: An Aspect-Oriented Approach to Writing Compilers</h1>
Rhys Weatherley, <a href=""></a>.<p>
Copyright &copy; 2001, 2002 Rhys Weatherley<br>
Verbatim copying and distribution of this entire article is permitted
in any medium, provided this copyright notice is preserved.<p>
This is a copy of an article that was published in
<a href="">Issue 2</a>
of Free Software Magazine.<p>
<h2>1. Introduction</h2>
The C# compiler in Portable.NET <a href="#pnetref">[1]</a> is built on top
of the "Tree Compiler Compiler" (treecc) utility program.
Treecc <a href="#treeccref">[2]</a> is distributed as Free Software
under the terms of the GNU General Public License.<p>
This article provides some background of why treecc came about. It discusses
two common compiler implementation techniques, and the reasons why they
often fail to manage the complexity of large programming languages like C#.<p>
We then discuss a new programming technique called "Aspect-Oriented
Programming". Treecc is an example of applying this technique to managing
the complexity of compiler construction.<p>
<h2>2. Patterns and Compiler Design</h2>
Compiler writing is generally seen as a black art, but in reality it isn't
all that hard. The basic compilation steps are:<p>
<li>Convert the program into an abstract syntax tree.</li>
<li>Perform type-checking and semantic analysis on the tree.</li>
<li>Rearrange the tree to perform optimisations.</li>
<li>Convert the tree into the target code.</li>
The difficulty in writing compilers is not the steps involved, but rather
the sheer number of tiny little details to keep straight. Modern languages
contain large numbers of operators, which are all very similar, but slightly
Add, substract, multiply, divide, remainder, shift left, shift right,
shift right unsigned, bitwise and, bitwise or, exclusive-or, negate,
bitwise not, logical not, logical and, logical or, ...
And that's just the operators. Introduce statements, arrays, type coercion,
method invocation, and class definition, and it becomes very easy to
forget something amongst the forest of code.<p>
In an attempt to control this complexity, two common pattern-based
approaches have arisen over the years: Inheritance and Visitor.<p>
The inheritance pattern can be characterised as follows:<p>
<li>Declare a node type for every syntactic element in the language.
All types ultimately inherit from "<code>Node</code>".</li>
<li>Declare virtual methods in "<code>Node</code>" for
operations on node types: semantic analysis, optimization,
code generation, etc.</li>
<li>Override the virtuals in sub-classes to provide the
compiler implementation.</li>
The visitor pattern can be characterised as follows:<p>
<li>Declare a node type for every syntactic element in the language.
All types ultimately inherit from "<code>Node</code>".</li>
<li>Declare a "<code>Visitor</code>" class with abstract virtual
methods such as "<code>VisitAdd</code>", "<code>VisitSub</code>",
"<code>VisitIf</code>", "<code>VisitFunction</code>",
etc for all of the node types.</li>
<li>Define a "walking procedure" over "<code>Node</code>" objects for
walking around a syntax tree, making callbacks on a supplied
visitor object.</li>
<li>Create multiple classes that inherit from "<code>Visitor</code>",
one for each operation. Implement the "<code>Visit*</code>"
functions for that type of operation.</li>
In the following sections, we will explore why these two patterns often
fail to manage compiler complexity, even when the programmer applies them
<h2>3. The implementation language is your worst enemy</h2>
We will start with the inheritance pattern. Consider that we've written
the following C# code during the "declare all the node types" phase of
the project:<p>
<pre>public class UnaryExpression : Expression
protected Expression expr;
public UnaryExpression(Expression _expr) { expr = _expr; }
public class NegateExpression : UnaryExpression
public NegateExpression(Expression _expr) : base(_expr) {}
public class UnaryPlusExpression : UnaryExpression
public UnaryPlusExpression(Expression _expr) : base(_expr) {}
public class BitwiseNotExpression : UnaryExpression
public BitwiseNotExpression(Expression _expr) : base(_expr) {}
We continue this process for several hundred other node types, gradually
building up the entire syntax tree. This will probably take several weeks
to complete for a substantial language like C#, assuming that we are writing
the parser alongside the node types.<p>
We now want to go in and implement type-checking, so we modify the
"<code>UnaryExpression</code>" class as follows:<p>
<pre>public class UnaryExpression : Expression
protected Expression expr;
public UnaryExpression(Expression _expr) { expr = _expr; }
public override LanguageType TypeCheck()
LanguageType type = expr.TypeCheck();
if(type.IsInteger() || type.IsFloat())
return type;
throw new TypeCheckException();
We've put the common unary expression type-checking code in a common base
class. This makes it easier to maintain because there is only one copy.
We continue the process over the next few weeks and months for all the
other operators, statements, and declarations in the language. So far,
so good.<p>
Unfortunately, we've made a mistake. The "<code>BitwiseNot</code>"
operator is only legal on integer values; not floating-point.<p>
But will we find this bug? It was several weeks ago when we first wrote
the "<code>BitwiseNotExpression</code>" class, and we have since forgotten
all about it. It may even have been written by another programmer on the
team, who has also forgotten all about it. When we build our compiler,
we don't get any errors because the implemention language is perfectly
happy with the above code.<p>
Surely testing will find it? We are building a test suite alongside
the code, right? Unfortunately, that doesn't help either. The test
suite for a major language will be at least as complex as the compiler
itself, and so there is always the temptation to abstract common tests
into common test classes. We've just shifted the bug into the test suite
and given ourselves a false sense of security.<p>
So we keep coding for several more months, adding lots more code. And then
a really nasty bug pops up. The "<code>BitwiseNot</code>" operator is acting
strangely, and we have no idea why. The system is now so complex, with
so many common base classes implementing fallback defaults, that tracking
this down becomes very hard.<p>
The inheritance pattern has a fatal flaw. Adding a new operation entails
a very large maintainence burden, because hundreds of classes must be
modified. This is very error-prone, so we try to abstract details into
common base classes. But this introduces other errors.<p>
The problem basically boils down to semantic analysis: the implementation
language does not have enough knowledge about the application domain to
spot the problem and warn us about it. So it happily compiles the code
and leaves us to hang ourselves on the system's complexity later.
Programming languages are supposed to help us manage complexity, not
make the problem worse!<p>
I wrote a number of compilers using the inheritance approach, and every
single time the complexity killed me. I needed fallback defaults for
code maintainence reasons, but using fallbacks introduced massive numbers
of bugs. I was stuck.<p>
<h2>4. Design patterns aren't always what they are cracked up to be</h2>
After much hair-pulling, I searched <em>Design Patterns</em>
by Gamma, et al <a href="#gammaref">[3]</a>. "<em>Is there a better
way of doing this?</em>". Visitor patterns are the answer:
the book even gives a compiler example.<p>
Visitors solve the "<em>I forgot to implement the <code>BitwiseNot</code></em>"
problem. Because every node type has its own "<code>Visit*</code>" method,
we will get an error when we try to build the compiler without implementing
the operation for a node type. Of course, this assumes that we haven't
been dumb and implemented fallback defaults in the "<code>Visitor</code>"
base class.<p>
Instead of using virtual methods, we can implement visitors using
switch statements over node types. e.g.<p>
case Negate: ...
case UnaryPlus: ...
case BinaryNot: ...
However, switch statements have a similar flaw to inheritance: the
implementation language will not warn us if we forget to put in
a case for every node type. It will happily fall out through the
"<code>default</code>" case with no warning. Tracking down these bugs
can be just as hard as tracking down inheritance fallback bugs.
Using virtual methods is "safer", if not quite as efficient.<p>
Unfortunately, there is a catch with visitors, as explained in
<em>Design Patterns</em>:<p>
<em>Use the Visitor pattern when ... the classes defining the object
structure rarely change, but you often want to define new operations
over the structure. Changing the object structure classes requires
redefining the interface to all visitors, which is potentially
costly. If the object structure classes change often, then it's
probably better to define the operations in those classes.</em>
During the early stages of writing a compiler, the node types change very
frequently. This activates the Achilles heel of the Visitor pattern,
and creates a maintainence nightmare. The book suggests that we should
use the inheritance approach to solve this problem.<p>
<h2>5. So which one do we use? Inheritance or Visitor?</h2>
The inheritance pattern becomes a problem when new operations are needed.
The solution is visitors. Visitors become a problem when new node types
are needed. The solution is inheritance.<p>
What we have is a situation that the design patterns gurus didn't
consider: if the set of nodes and operations are both changing rapidly,
then we will have problems no matter what we do.<p>
We need a solution that combines the strengths of both patterns without
the drawbacks of either. We want the implementation language to catch
us when we forget something, but we also want it to handle large numbers
of nodes and operations smoothly. We want to split different operations
into different modules, but also keep them closely associated with the
node type.<p>
None of the standard patterns provide this combination of functionality,
because none of the existing implementation languages support both styles
of program design at the same time.<p>
<h2>6. Aspect-Oriented Programming</h2>
A new field in language design has emerged in recent years called
"Aspect-Oriented Programming" (AOP). A good review of the field
can be found in the October 2001 issue of the "<em>Communications of
the ACM</em>" <a href="#aopref">[4]</a>, and on the AspectJ Web site
<a href="#aspectjref">[5]</a>.<p>
The following excerpt from the introduction to the AOP section in the
CACM issue describes the essential aspects of AOP, and the difference
between OOP and AOP:<p>
<em>AOP is based on the idea that computer systems are better programmed
by separately specifying the various concerns (properties or areas
of interest) of a system and some description of their relationships,
and then relying on mechanisms in the underlying AOP environment to
weave or compose them together into a coherent program. ...
While the tendancy in OOP's is to find commonality among classes
and push it up the inheritance tree, AOP attempts to realize
scattered concerns as first-class elements, and eject them
horizontally from the object structure.</em>
Aspect-orientation gives us some hope of solving our compiler
complexity problems. When we moved from the inheritance pattern
to the visitor pattern, we were attempting to eject the operations
horizontally. But it didn't quite work as well as we had hoped: the
intrinsic complexity of the set of nodes kept interfering. AOP
allows us to take the idea further, without re-introducing the problems
that visitors have.<p>
We can view each operation on node types (semantic analysis,
optimization, code generation, etc) as an "aspect" of the compiler's
construction. The AOP language weaves these aspects with the node
types to create the final compiler.<p>
However, we don't really want to invent a completely new programming
language for compiler construction. We would have to implement this
new language using all of the flawed techniques that makes writing
compilers hard. It's a classic chicken and egg problem: we don't
want to replace a buggy compiler with a buggy compiler implementation
We can strike a compromise, similar to that used by lex and yacc.
Those tools use a custom syntax for the difficult parts, and a
pre-existing underlying language (usually C) to implement everything
else. The code is expanded by the tool and then compiled with the
underlying language's compiler.<p>
Treecc uses a domain-specific aspect-oriented programming language for
declaring and managing nodes and operations, and uses an underlying language
to implement the body of the operations. C, C++, C#, or Java can be used
as the underlying language, depending upon your personal preference.<p>
Treecc is about 13,000 lines of code in size, which is relatively
easy to debug by hand.<p>
<em>Aside:</em> treecc does not support all of the AOP features that are
described in the literature. Treecc weaves together classes from multiple
method definitions. Other AOP languages can also weave together methods
from fragments in multiple aspects. We concentrated on those AOP features
that were useful for compiler construction. Other features could be
incorporated at a later date.<p>
<h2>7. Using Treecc to Beat Inheritance Bugs</h2>
Now that we've identified the problems of inheritance and visitor patterns
for compiler implementation, we will show how treecc helps the programmer
avoid these traps.<p>
The following is the treecc definition of our example node types:
<pre>%option lang = "C#"
%node Expression %abstract %typedef
%node UnaryExpression Expression %abstract =
Expression expr;
%node NegateExpression UnaryExpression
%node UnaryPlusExpression UnaryExpression
%node BitwiseNotExpression UnaryExpression</pre>
Treecc converts this into a number of C# classes, one for each
node type. It also inserts helper methods for testing the type
of a node and for tracking source line numbers. If the output language
is C or C++, treecc will also insert code for allocating large numbers
of nodes efficiently.<p>
The type-checking operation (or "aspect") is declared as follows:<p>
<pre>%operation %virtual LanguageType TypeCheck(Expression e)
LanguageType type = e.expr.TypeCheck();
if(type.IsInteger() || type.IsFloat())
return type;
throw new TypeCheckException();
LanguageType type = e.expr.TypeCheck();
return type;
throw new TypeCheckException();
We have not declared the operation to cover "<code>Expression</code>" or
"<code>UnaryExpression</code>". Instead, we list all of the applicable
subtypes explicitly. When treecc is run on the above file, it will check
that every non-abstract node type is handled by an operation case. If it finds
a missing type, it will report an error. Let's demonstrate that by adding a
new unary expression type:<p>
%node LogicalNotExpression UnaryExpression<br>
<br> node type `LogicalNotExpression' is not handled in operation `TypeCheck'<br>
This is at the heart of treecc's power: it performs a large amount of
semantic analysis over the node types to determine if the programmer has
forgotten something. The programmer is notified of this early in
development process, when it is easier to fix the problem.<p>
As we discussed earlier, we want to implement common code in common
base classes to improve code sharing. However, this introduces
hard to find bugs. Treecc avoids the need to do this by allowing
the programmer to attach multiple cases to the same code block,
as in the case of "<code>NegateExpression</code>" and
"<code>UnaryPlusExpression</code>" above.<p>
An important feature of aspect-oriented languages is aspect modularity:
it should be possible to implement separate aspects in different parts
of the code. Treecc supports this by separating node and operation
definitions. Operations do not need to be implemented in the same
file as the nodes to which they apply, and multiple operations on
the same node types can be scattered through-out the code.<p>
Portable.NET takes this even further by separating individual operations
across multiple files for expressions, statements, declarations, etc.
This introduces a clearer structure to the code that makes it easier
to navigate the source during compiler construction. The semantic analysis
routines in treecc ensure that nothing is missed when all of the
separate modules are recombined.<p>
<h2>8. Painless Visitors</h2>
The previous example used a "<code>%virtual</code>" operation, which
is defined over all node types. Treecc inserts the virtual method
body wherever it is required.<p>
Sometimes we don't want to define an operation as a virtual method.
We would prefer to use the visitor approach. The following is what
a visitor version of "<code>TypeCheck</code>" would look like:
<pre>%operation LanguageType TypeChecker::TypeCheck(Expression e) = {null}
LanguageType type = TypeCheck(e.expr);
if(type.IsInteger() || type.IsFloat())
return type;
throw new TypeCheckException();
LanguageType type = TypeCheck(e.expr);
return type;
throw new TypeCheckException();
This is almost identical to the previous version, except for the
definition of the operation, and the calling conventions for
"<code>TypeCheck</code>". Behind the scenes, treecc creates
a class called "<code>TypeChecker</code>" that contains a static
method called "<code>TypeCheck</code>". This is the visitor.<p>
Interestingly, if we had used C as the underlying language, then no
changes are necessary to the bodies of the operation cases. Only
the "<code>%virtual</code>" keyword changes. The C macro pre-processor
is used to smooth out the differences.<p>
This illustrates another useful property of treecc: it is very simple
to flip operations between inhertance-based virtuals and visitor-based
non-virtuals. This allows the programmer to start developing the compiler
one way, change their mind, and quickly flip to the other way.<p>
Normally, changing inheritance-based code into visitor-based code
would entail a complete system rewrite. Changing patterns with
treecc is trivial.<p>
This is a common property of aspect-oriented programming languages:
because the language takes care of the inserting the aspects into
the main classes, it is easier to change the style of insertion
without a major system overhaul.<p>
Non-virtual operations can be applied to multiple arguments, which
can be very useful when implementing coercions:
<pre>%enum SimpleType =
%operation %inline SimpleType Binary::Coerce
([SimpleType type1], [SimpleType type2]) = {Error}
Coerce(Integer, Integer)
return Integer;
Coerce(Integer, Long),
Coerce(Long, Integer),
Coerce(Long, Long)
return Long;
Coerce(Integer, Float),
Coerce(Float, Integer),
Coerce(Long, Float),
Coerce(Float, Float)
return Float;
Coerce(Integer, Error),
Coerce(Error, Integer),
Coerce(Long, Error),
Coerce(Error, Long),
Coerce(Float, Error),
Coerce(Error, Float),
Coerce(Error, Error)
return Error;
Treecc turns this into a highly efficient nested switch statement,
which would be extremely difficult to debug by hand. We actually
left out one of the cases above, so we get an error:
<blockquote><code> case `Float, Long' is missing from operation `Coerce'</code></blockquote>
Casts and coercions on primitive types can now be implemented as
simple table lookups, with treecc checking the completeness of the
table for us.<p>
<h2>9. Conclusion and Future Directions</h2>
Treecc provides a new way to attack the complexity of compiler implementation
by automating error-prone tasks. It performs a large amount of semantic
analysis on the program to ensure that common problems are caught early
in the development cycle.<p>
Because treecc is based on an aspect-oriented foundation, it allows the
programmer to separate out concerns and deal with them individually.
Treecc puts the whole system back together in the most efficient manner
The system is not necessarily complete. We'd like to experiment with
rule-based code generation techniques. At present, optimizers and
code generators must be written by hand, as operations on node types.<p>
A rule-based system would make it easier to build clever optimizers
as a set of pattern matching directives. Operations are already a
special class of pattern matcher, but they don't have any back-tracking
and retry capabilities.<p>
Another area that treecc can be applied to is the construction of
"Just In Time" compilers. The first phase of the JIT process is the
reconstruction of the intermediate code form of the program. This
intermediate code typically takes the form of abstract syntax trees or
three-address statements.<p>
Treecc is well-suited to the management of JIT intermediate forms.
Register allocation, dynamic flow analysis, and machine-dependent
code generation can be added as JIT aspects. We are currently
exploring the use of treecc to assist in the construction of a
JIT for Portable.NET<p>
<a name="pnetref">[1] Portable.NET Web Site, <a href=""></a>.<p>
<a name="treeccref">[2] Treecc Web Site, <a href=""></a>.<p>
<a name="gammaref">[3] Gamma, et al., <em>Design Patterns: Elements of
Reusable Object-Oriented Software</em>, Addison-Wesley, 1995.<p>
<a name="aopref">[4] Aspect-Oriented Programming, <em>Communications of
the ACM</em>, October 2001.<p>
<a name="aspectjref">[5] AspectJ Web Site, <a href=""></a>.<p>