Monday, August 09, 2010

Parsing Expressions with Xtext

Parsing simple XML-like, structural languages with Xtext is a no-brainer. However, parsing nested expressions is often considered a bit more complicated. This is because they are more complicated due to their recursive nature and also because with Xtext you have to avoid left recursive parser rules. As the underlying parser (generated by Antlr) uses a top-down approach it would recurse endlessly if you had a left recursive grammar.

Let's have a look at parsing a simple arithmetic expression:
2 + 20 * 2
If you know EBNF a bit and wouldn't think about avoiding left recursion, operator precedence or associativity, you'ld probably write a grammar like this:
Expression :
Expression '+' Expression |
Expression '-' Expression |
INT;
This grammar would be left recursive because the parser reads the grammar top down and left to right and would endlessly call the Expression rule without consuming any characters, i.e. altering the underlying state of the parser. While this kind of grammars can be written for bottom-up parsers, you'ld still have to deal with operator precedence in addition. That is define that a multiplication has higher precedence than an addition for example.

In Xtext you define the precedence implicitly when left-factoring such a grammar. Left-factoring means you get rid of left recursion by applying a certain technique, which I will show in the following.

So here is a left-factored grammar (not yet working with Xtext) for the expression language above :
Addition :
Multiplication ('+' Multiplication)*;

Multiplication:
NumberLiteral ('*' NumberLiteral)*;

NumberLiteral:
INT;
As you can see the main difference is that we have three rules instead of one and if you look a bit closer you see, that there's a certain delegation pattern involved. The rule Addition doesn't call itself but calls Multiplication instead. The operator precedence is defined by the order of delegation. The later the rule is called the higher is its precedence. This is at least the case for the first two rules which are of a left recursive nature (but we've left-factored them now). The last rule is not left recursive which is why you can write them down without applying this pattern.

We should allow users to explicitly adjust precedence by adding parenthesis, e.g. write something like (2 + 20) * 2.
So let's add support for that (note that the grammar is still not working with Xtext):
Addition :
Multiplication ('+' Multiplication)*;

Multiplication:
Primary ('*' Primary)*;

Primary :
NumberLiteral |
'(' Addition ')';

NumberLiteral:
INT;
So once again: if you have some construct that recurses on the left hand side, you need to put it into the delegation chain according to their operator precedence. The pattern is always the same, the thing that recurses delegates to the rule with the next higher precedence.

Construction of an AST

Now that we know how to avoid left recursion, let's have a look at what the parser produces. In Xtext each rule returns some value. Parser rules return AST nodes (i.e. EObject instances), enum rules return enum literals and datatype rules as well as terminal rules return simple values like strings and the like (EDatatype in EMF jargon).
Xtext can automatically infer whether some rule is a parser rule, i.e. constructs and returns an AST node or if it is a datatype rule. Above's grammars only consisted of datatype rules so all they would produce is a string.
In order to construct an AST we need to add Assignments and Actions. But before we do that we need to talk about return types.

The return type of a rule can be specified explicitly using the 'returns' keyword but can be inferred if the type's name is the same as the rule's name.
That is
NumberLiteral : … ;
is a short form of
NumberLiteral returns NumberLiteral : …. ;
However in the case of the expressions grammar above, the rules all need to return the same type since they are recursive. So in order to make the grammar functional we need to add a common return type explicitly (but the grammar is still missing some bits):
Addition returns Expression:
Multiplication ('+' Multiplication)*;

Multiplication returns Expression:
Primary ('*' Primary)*;

Primary returns Expression:
NumberLiteral |
'(' Addition ')';

NumberLiteral:
INT;
The AST type inference mechanism of Xtext will infer two types: Expression and NumberLiteral. Now we need to add assignments and Actions in order to store all the important information in the AST and to create reasonable subtypes for the two operations.

In the following you see the final fully working Xtext grammar:

Addition returns Expression:
Multiplication ({Addition.left=current} '+' right=Multiplication)*;

Multiplication returns Expression:
Primary ({Multiplication.left=current} '*' right=Primary)*;

Primary returns Expression:
NumberLiteral |
'(' Addition ')';

NumberLiteral:
value=INT;
Let's go through the grammar as the parser would do it for the expression
( 1 + 20 ) * 2
(I'm sure it's pretty hard to follow what's going just by reading this text. Therefore I have prepared a small video where I visualize and explain what is going on. You can find it at the end of this section.)

The parser always starts with the first rule (Addition). Therein the first element is an unassigned rule call to Multiplication which in turn calls Primary. Primary now has two alternatives. The first on is calling NumberLiteral which consists only of one assignment to a feature called 'value' of what the INT rule returns.

But as the first token in the expression is an opening parenthesis '(' the parser will take the second alternative in Primary, consume the '(' and call the rue Addition. Now the value '1' is the look ahead token and again Addition calls Multiplication and Multiplication calls Primary. This time the parser takes the first alternative because '1' was consumed by the INT rule (which btw. is a reused library terminal rule).

As soon as the parser hits an assignment it checks whether an AST node for the current rule was already created. If not it will create one based on the return type, which is NumberLiteral. The Xtext generator will have created an EClass 'NumberLiteral' before which can now be instantiated. That type will also have a property called value of type Integer, which will get the value '1' set. This is what the Java equivalent would look like:
// value=INT
if (current == null)
current = new NumberLiteral();
current.setValue(ruleINT());
...
Now that the rule has been completed the created EObject is returned to the calling rule Primary, which in turn returns the object unchanged to its own caller. Within Multiplication the call to Primary has been successfully parsed and returned an instance of NumberLiteral. The second part of the rule is a so called group (everything within the parenthesis). The asterisk behind the closing parenthesis states that this part can be consumed zero or more times. The first token to consume in this part is the multiplication operator '*'. Unfortunately in the current situation the next token to consume is the plus operator '+', so the group is not consumed at all and the rule returns what they got from the unassigned rule call (the NumberLiteral) .

In rule Addition there's a similar group but this time it expects the correct operator so the parser goes into the group.
The first element in the group is a so called action. As Xtext grammars are highly declarative and bi-directional it is not a good idea to allow arbitrary expression within actions as it is usually the case with other parser generators. Instead we only support two kinds of actions. This one will create a new instance of type Addition and assign what was the to be returned object to the feature left. In Java this would have been something like:
// Multiplication rule call
current = ruleMultiplication();
// {Addition.left=current}
Addition temp = new Addition();
temp.setLeft(current);
current = temp;
...
As a result the rule would now return an instance of Addition which has a NumberLiteral set to its property left. Next up the parser consumes the '+' operator. We do not store the operator in the AST because we have an explicit Addition type, which implicitly contains this information.
The assignment ('right=Multiplication') calls Multiplication another time and assigns the returned object (a NumberLiteral of value=20) to the property named right.

If we now had an additional plus operation '+' (e.g. 1 + 2 + 3) the group would match another time and create another instance of Addition. But we don't and therefore the rule is completed and returns the created instance of Addition to its caller which was the second alternative in Primary. Now the closing parenthesis is matched and consumed and the stack is reduced once more.

We are now in rule Multiplication and have the multiplication operator '*' on the look ahead. The parser goes into the group and applies the action. Finally it calls the Primary rule gets another instance of NumberLiteral (value=2), assigns it as the 'right' operand of the Multiplication and returns the Multiplication to Addition which in turn returns the very same object as there's nothing left to parse.

The resulting AST looks like this:


The following video tries to explain what you just have read. The grammar is slightly different, but I'm sure this won't be problem for you ;-).




Associativity

There is still one topic I should mention, which is associativity. There is left and right associativity as well as none associativity. In the example we have seen left associativity. Associativity tells the parser how to construct the AST when there are two infix operations with the same precedence. The following example is taken from the corresponding wikipedia entry:

Consider the expression a ~ b ~ c. If the operator ~ has left associativity, this expression would be interpreted as (a ~ b) ~ c and evaluated left-to-right. If the operator has right associativity, the expression would be interpreted as a ~ (b ~ c) and evaluated right-to-left. If the operator is non-associative, the expression might be a syntax error, or it might have some special meaning.
We already know the most important form which is left associativity:
Addition returns Expression:
Multiplication ({Addition.left=current} '+' right=Multiplication)*;
Right associativity is done using the following pattern (note the quantity operator and the call to the rule itself at the end):
Addition returns Expression:
Multiplication ({Addition.left=current} '+' right=Addition)?;
And if you don't want to allow multiple usages of the same expression in a row (hence non-associativity) you write:
Addition returns Expression:
Multiplication ({Addition.left=current} '+' right=Multiplication)?;
Note that sometimes it's better to allow associativity on parser level, but forbid it later using validation, because you can come up with a better error message. Also the whole parsing process won't be interrupted, so your tooling will generally be more forgiving.

(This post has also been included into the Xtext User Guide since version 1.0.1)