Let's have a look at parsing a simple arithmetic expression:

2 + 20 * 2If you know EBNF a bit and wouldn't think about avoiding left recursion, operator precedence or associativity, you'ld probably write a grammar like this:

Expression :This grammar would be left recursive because the parser reads the grammar top down and left to right and would endlessly call the Expression rule without consuming any characters, i.e. altering the underlying state of the parser. While this kind of grammars can be written for bottom-up parsers, you'ld still have to deal with operator precedence in addition. That is define that a multiplication has higher precedence than an addition for example.

Expression '+' Expression |

Expression '-' Expression |

INT;

In Xtext you define the precedence implicitly when left-factoring such a grammar. Left-factoring means you get rid of left recursion by applying a certain technique, which I will show in the following.

So here is a left-factored grammar (not yet working with Xtext) for the expression language above :

Addition :As you can see the main difference is that we have three rules instead of one and if you look a bit closer you see, that there's a certain delegation pattern involved. The rule Addition doesn't call itself but calls Multiplication instead. The operator precedence is defined by the order of delegation. The later the rule is called the higher is its precedence. This is at least the case for the first two rules which are of a left recursive nature (but we've left-factored them now). The last rule is not left recursive which is why you can write them down without applying this pattern.

Multiplication ('+' Multiplication)*;

Multiplication:

NumberLiteral ('*' NumberLiteral)*;

NumberLiteral:

INT;

We should allow users to explicitly adjust precedence by adding parenthesis, e.g. write something like (2 + 20) * 2.

So let's add support for that (note that the grammar is still not working with Xtext):

Construction of an AST

Now that we know how to avoid left recursion, let's have a look at what the parser produces. In Xtext each rule returns some value.

Xtext can automatically infer whether some rule is a parser rule, i.e. constructs and returns an AST node or if it is a datatype rule. Above's grammars only consisted of datatype rules so all they would produce is a string.

In order to construct an AST we need to add Assignments and Actions. But before we do that we need to talk about return types.

The return type of a rule can be specified explicitly using the '

That is

In the following you see the final fully working Xtext grammar:

Addition :So once again: if you have some construct that recurses on the left hand side, you need to put it into the delegation chain according to their operator precedence. The pattern is always the same, the thing that recurses delegates to the rule with the next higher precedence.

Multiplication ('+' Multiplication)*;

Multiplication:

Primary ('*' Primary)*;

Primary :

NumberLiteral |

'(' Addition ')';

NumberLiteral:

INT;

Construction of an AST

Now that we know how to avoid left recursion, let's have a look at what the parser produces. In Xtext each rule returns some value.

*Parser rules*return AST nodes (i.e. EObject instances),*enum rules*return enum literals and*datatype rules*as well as*terminal rules*return simple values like strings and the like (*EDatatype*in EMF jargon).Xtext can automatically infer whether some rule is a parser rule, i.e. constructs and returns an AST node or if it is a datatype rule. Above's grammars only consisted of datatype rules so all they would produce is a string.

In order to construct an AST we need to add Assignments and Actions. But before we do that we need to talk about return types.

The return type of a rule can be specified explicitly using the '

*returns*' keyword but can be inferred if the type's name is the same as the rule's name.That is

NumberLiteral : … ;is a short form of

NumberLiteral returns NumberLiteral : …. ;However in the case of the expressions grammar above, the rules all need to return the same type since they are recursive. So in order to make the grammar functional we need to add a common return type explicitly (but the grammar is still missing some bits):

Addition returns Expression:The AST type inference mechanism of Xtext will infer two types: Expression and NumberLiteral. Now we need to add assignments and Actions in order to store all the important information in the AST and to create reasonable subtypes for the two operations.

Multiplication ('+' Multiplication)*;

Multiplication returns Expression:

Primary ('*' Primary)*;

Primary returns Expression:

NumberLiteral |

'(' Addition ')';

NumberLiteral:

INT;

In the following you see the final fully working Xtext grammar:

Let's go through the grammar as the parser would do it for the expression

Addition returns Expression:

Multiplication ({Addition.left=current} '+' right=Multiplication)*;

Multiplication returns Expression:

Primary ({Multiplication.left=current} '*' right=Primary)*;

Primary returns Expression:

NumberLiteral |

'(' Addition ')';

NumberLiteral:

value=INT;

**( 1 + 20 ) * 2**

(I'm sure it's pretty hard to follow what's going just by reading this text. Therefore I have prepared a small video where I visualize and explain what is going on. You can find it at the end of this section.)

The parser always starts with the first rule (

*Addition*). Therein the first element is an unassigned rule call to

*Multiplication*which in turn calls

*Primary*.

*Primary*now has two alternatives. The first on is calling

*NumberLiteral*which consists only of one assignment to a feature called 'value' of what the

*INT*rule returns.

But as the first token in the expression is an opening parenthesis '

*(*' the parser will take the second alternative in

*Primary*, consume the '(' and call the rue

*Addition*. Now the value '1' is the look ahead token and again

*Addition*calls

*Multiplication*and

*Multiplication*calls

*Primary*. This time the parser takes the first alternative because '1' was consumed by the

*INT*rule (which btw. is a reused library terminal rule).

As soon as the parser hits an assignment it checks whether an AST node for the current rule was already created. If not it will create one based on the return type, which is

In rule

The first element in the group is a so called a

The assignment ('right=Multiplication') calls

If we now had an additional plus operation '+' (e.g. 1 + 2 + 3) the group would match another time and create another instance of

We are now in rule

The resulting AST looks like this:

The following video tries to explain what you just have read. The grammar is slightly different, but I'm sure this won't be problem for you ;-).

*NumberLiteral*. The Xtext generator will have created an EClass 'NumberLiteral' before which can now be instantiated. That type will also have a property called*value*of type Integer, which will get the value '1' set. This is what the Java equivalent would look like:// value=INTNow that the rule has been completed the created EObject is returned to the calling rule

if (current == null)

current = new NumberLiteral();

current.setValue(ruleINT());

...

*Primary*, which in turn returns the object unchanged to its own caller. Within*Multiplication*the call to*Primary*has been successfully parsed and returned an instance of*NumberLiteral*. The second part of the rule is a so called g*roup*(everything within the parenthesis). The asterisk behind the closing parenthesis states that this part can be consumed zero or more times. The first token to consume in this part is the multiplication operator '*'. Unfortunately in the current situation the next token to consume is the plus operator '+', so the group is not consumed at all and the rule returns what they got from the unassigned rule call (the*NumberLiteral*) .In rule

*Addition*there's a similar group but this time it expects the correct operator so the parser goes into the group.The first element in the group is a so called a

*ction*. As Xtext grammars are highly declarative and bi-directional it is not a good idea to allow arbitrary expression within actions as it is usually the case with other parser generators. Instead we only support two kinds of actions. This one will create a new instance of type*Addition*and assign what was the to be returned object to the feature*left*. In Java this would have been something like:// Multiplication rule call

current = ruleMultiplication();

// {Addition.left=current}As a result the rule would now return an instance of

Addition temp = new Addition();

temp.setLeft(current);

current = temp;

...

*Addition*which has a*NumberLiteral*set to its property*left*. Next up the parser consumes the '+' operator. We do not store the operator in the AST because we have an explicit*Addition*type, which implicitly contains this information.The assignment ('right=Multiplication') calls

*Multiplication*another time and assigns the returned object (a*NumberLiteral*of value=20) to the property named*right*.If we now had an additional plus operation '+' (e.g. 1 + 2 + 3) the group would match another time and create another instance of

*Addition*. But we don't and therefore the rule is completed and returns the created instance of*Addition*to its caller which was the second alternative in*Primary*. Now the closing parenthesis is matched and consumed and the stack is reduced once more.We are now in rule

*Multiplication*and have the multiplication operator '*' on the look ahead. The parser goes into the group and applies the action. Finally it calls the Primary rule gets another instance of*NumberLiteral*(value=2), assigns it as the 'right' operand of the*Multiplication*and returns the*Multiplication*to*Addition*which in turn returns the very same object as there's nothing left to parse.The resulting AST looks like this:

The following video tries to explain what you just have read. The grammar is slightly different, but I'm sure this won't be problem for you ;-).

Associativity

There is still one topic I should mention, which is associativity. There is left and right associativity as well as none associativity. In the example we have seen left associativity. Associativity tells the parser how to construct the AST when there are two infix operations with the same precedence. The following example is taken from the corresponding wikipedia entry:

Consider the expression a ~ b ~ c. If the operator ~ has left associativity, this expression would be interpreted as (a ~ b) ~ c and evaluated left-to-right. If the operator has right associativity, the expression would be interpreted as a ~ (b ~ c) and evaluated right-to-left. If the operator is non-associative, the expression might be a syntax error, or it might have some special meaning.We already know the most important form which is

**left associativity**:

Addition returns Expression:

Multiplication ({Addition.left=current} '+' right=Multiplication)*;

**Right associativity**is done using the following pattern (note the quantity operator and the call to the rule itself at the end):

Addition returns Expression:And if you don't want to allow multiple usages of the same expression in a row (hence

Multiplication ({Addition.left=current} '+' right=Addition)?;

**non-associativity**) you write:

Addition returns Expression:Note that sometimes it's better to allow associativity on parser level, but forbid it later using validation, because you can come up with a better error message. Also the whole parsing process won't be interrupted, so your tooling will generally be more forgiving.

Multiplication ({Addition.left=current} '+' right=Multiplication)?;

(This post has also been included into the Xtext User Guide since version 1.0.1)