Monday, August 09, 2010

Parsing Expressions with Xtext

Parsing simple XML-like, structural languages with Xtext is a no-brainer. However, parsing nested expressions is often considered a bit more complicated. This is because they are more complicated due to their recursive nature and also because with Xtext you have to avoid left recursive parser rules. As the underlying parser (generated by Antlr) uses a top-down approach it would recurse endlessly if you had a left recursive grammar.

Let's have a look at parsing a simple arithmetic expression:
2 + 20 * 2
If you know EBNF a bit and wouldn't think about avoiding left recursion, operator precedence or associativity, you'ld probably write a grammar like this:
Expression :
Expression '+' Expression |
Expression '-' Expression |
INT;
This grammar would be left recursive because the parser reads the grammar top down and left to right and would endlessly call the Expression rule without consuming any characters, i.e. altering the underlying state of the parser. While this kind of grammars can be written for bottom-up parsers, you'ld still have to deal with operator precedence in addition. That is define that a multiplication has higher precedence than an addition for example.

In Xtext you define the precedence implicitly when left-factoring such a grammar. Left-factoring means you get rid of left recursion by applying a certain technique, which I will show in the following.

So here is a left-factored grammar (not yet working with Xtext) for the expression language above :
Addition :
Multiplication ('+' Multiplication)*;

Multiplication:
NumberLiteral ('*' NumberLiteral)*;

NumberLiteral:
INT;
As you can see the main difference is that we have three rules instead of one and if you look a bit closer you see, that there's a certain delegation pattern involved. The rule Addition doesn't call itself but calls Multiplication instead. The operator precedence is defined by the order of delegation. The later the rule is called the higher is its precedence. This is at least the case for the first two rules which are of a left recursive nature (but we've left-factored them now). The last rule is not left recursive which is why you can write them down without applying this pattern.

We should allow users to explicitly adjust precedence by adding parenthesis, e.g. write something like (2 + 20) * 2.
So let's add support for that (note that the grammar is still not working with Xtext):
Addition :
Multiplication ('+' Multiplication)*;

Multiplication:
Primary ('*' Primary)*;

Primary :
NumberLiteral |
'(' Addition ')';

NumberLiteral:
INT;
So once again: if you have some construct that recurses on the left hand side, you need to put it into the delegation chain according to their operator precedence. The pattern is always the same, the thing that recurses delegates to the rule with the next higher precedence.

Construction of an AST

Now that we know how to avoid left recursion, let's have a look at what the parser produces. In Xtext each rule returns some value. Parser rules return AST nodes (i.e. EObject instances), enum rules return enum literals and datatype rules as well as terminal rules return simple values like strings and the like (EDatatype in EMF jargon).
Xtext can automatically infer whether some rule is a parser rule, i.e. constructs and returns an AST node or if it is a datatype rule. Above's grammars only consisted of datatype rules so all they would produce is a string.
In order to construct an AST we need to add Assignments and Actions. But before we do that we need to talk about return types.

The return type of a rule can be specified explicitly using the 'returns' keyword but can be inferred if the type's name is the same as the rule's name.
That is
NumberLiteral : … ;
is a short form of
NumberLiteral returns NumberLiteral : …. ;
However in the case of the expressions grammar above, the rules all need to return the same type since they are recursive. So in order to make the grammar functional we need to add a common return type explicitly (but the grammar is still missing some bits):
Addition returns Expression:
Multiplication ('+' Multiplication)*;

Multiplication returns Expression:
Primary ('*' Primary)*;

Primary returns Expression:
NumberLiteral |
'(' Addition ')';

NumberLiteral:
INT;
The AST type inference mechanism of Xtext will infer two types: Expression and NumberLiteral. Now we need to add assignments and Actions in order to store all the important information in the AST and to create reasonable subtypes for the two operations.

In the following you see the final fully working Xtext grammar:

Addition returns Expression:
Multiplication ({Addition.left=current} '+' right=Multiplication)*;

Multiplication returns Expression:
Primary ({Multiplication.left=current} '*' right=Primary)*;

Primary returns Expression:
NumberLiteral |
'(' Addition ')';

NumberLiteral:
value=INT;
Let's go through the grammar as the parser would do it for the expression
( 1 + 20 ) * 2
(I'm sure it's pretty hard to follow what's going just by reading this text. Therefore I have prepared a small video where I visualize and explain what is going on. You can find it at the end of this section.)

The parser always starts with the first rule (Addition). Therein the first element is an unassigned rule call to Multiplication which in turn calls Primary. Primary now has two alternatives. The first on is calling NumberLiteral which consists only of one assignment to a feature called 'value' of what the INT rule returns.

But as the first token in the expression is an opening parenthesis '(' the parser will take the second alternative in Primary, consume the '(' and call the rue Addition. Now the value '1' is the look ahead token and again Addition calls Multiplication and Multiplication calls Primary. This time the parser takes the first alternative because '1' was consumed by the INT rule (which btw. is a reused library terminal rule).

As soon as the parser hits an assignment it checks whether an AST node for the current rule was already created. If not it will create one based on the return type, which is NumberLiteral. The Xtext generator will have created an EClass 'NumberLiteral' before which can now be instantiated. That type will also have a property called value of type Integer, which will get the value '1' set. This is what the Java equivalent would look like:
// value=INT
if (current == null)
current = new NumberLiteral();
current.setValue(ruleINT());
...
Now that the rule has been completed the created EObject is returned to the calling rule Primary, which in turn returns the object unchanged to its own caller. Within Multiplication the call to Primary has been successfully parsed and returned an instance of NumberLiteral. The second part of the rule is a so called group (everything within the parenthesis). The asterisk behind the closing parenthesis states that this part can be consumed zero or more times. The first token to consume in this part is the multiplication operator '*'. Unfortunately in the current situation the next token to consume is the plus operator '+', so the group is not consumed at all and the rule returns what they got from the unassigned rule call (the NumberLiteral) .

In rule Addition there's a similar group but this time it expects the correct operator so the parser goes into the group.
The first element in the group is a so called action. As Xtext grammars are highly declarative and bi-directional it is not a good idea to allow arbitrary expression within actions as it is usually the case with other parser generators. Instead we only support two kinds of actions. This one will create a new instance of type Addition and assign what was the to be returned object to the feature left. In Java this would have been something like:
// Multiplication rule call
current = ruleMultiplication();
// {Addition.left=current}
Addition temp = new Addition();
temp.setLeft(current);
current = temp;
...
As a result the rule would now return an instance of Addition which has a NumberLiteral set to its property left. Next up the parser consumes the '+' operator. We do not store the operator in the AST because we have an explicit Addition type, which implicitly contains this information.
The assignment ('right=Multiplication') calls Multiplication another time and assigns the returned object (a NumberLiteral of value=20) to the property named right.

If we now had an additional plus operation '+' (e.g. 1 + 2 + 3) the group would match another time and create another instance of Addition. But we don't and therefore the rule is completed and returns the created instance of Addition to its caller which was the second alternative in Primary. Now the closing parenthesis is matched and consumed and the stack is reduced once more.

We are now in rule Multiplication and have the multiplication operator '*' on the look ahead. The parser goes into the group and applies the action. Finally it calls the Primary rule gets another instance of NumberLiteral (value=2), assigns it as the 'right' operand of the Multiplication and returns the Multiplication to Addition which in turn returns the very same object as there's nothing left to parse.

The resulting AST looks like this:


The following video tries to explain what you just have read. The grammar is slightly different, but I'm sure this won't be problem for you ;-).




Associativity

There is still one topic I should mention, which is associativity. There is left and right associativity as well as none associativity. In the example we have seen left associativity. Associativity tells the parser how to construct the AST when there are two infix operations with the same precedence. The following example is taken from the corresponding wikipedia entry:

Consider the expression a ~ b ~ c. If the operator ~ has left associativity, this expression would be interpreted as (a ~ b) ~ c and evaluated left-to-right. If the operator has right associativity, the expression would be interpreted as a ~ (b ~ c) and evaluated right-to-left. If the operator is non-associative, the expression might be a syntax error, or it might have some special meaning.
We already know the most important form which is left associativity:
Addition returns Expression:
Multiplication ({Addition.left=current} '+' right=Multiplication)*;
Right associativity is done using the following pattern (note the quantity operator and the call to the rule itself at the end):
Addition returns Expression:
Multiplication ({Addition.left=current} '+' right=Addition)?;
And if you don't want to allow multiple usages of the same expression in a row (hence non-associativity) you write:
Addition returns Expression:
Multiplication ({Addition.left=current} '+' right=Multiplication)?;
Note that sometimes it's better to allow associativity on parser level, but forbid it later using validation, because you can come up with a better error message. Also the whole parsing process won't be interrupted, so your tooling will generally be more forgiving.

(This post has also been included into the Xtext User Guide since version 1.0.1)

25 comments:

  1. Great post, thank you! Actually I spent a day figuring out how to left-factor a grammar with Xtext (because simply left-factoring creates ugly trees), and I already have prepared a blog post myself, but I haven't had the time to post it...

    One thing I was wondering whether it won't be possible to write
    -------------------------------
    Addition :
    Multiplication ('+' Expression)*;
    -------------------------------

    I tried that in my grammar and it works. This enables a little bit more flexible grammars, as you will be able to use less parentheses. This is especially useful if you have to combine this left-factored rules with a non-left-recursive rule. I finally solved this second problem by using pseudo expressions, in which the real expressions are defined using the Xtext current-feature. This patterns seems to work quite nice, and it is very easy to apply. In my case I wanted to added quantifiers to logic expressions. Logic expressions like "or" or "and" are to be left-factored, but the quantifier doesn't need that. The problem was to enable expression like a and forAll x ..., as the forAll part should not be written in parentheses. The number of the pseudo expression directly reflect the precedence of operators. Here is a snippet from my grammar:

    -------------------------------
    Expression: Expression_1 | QuantifiedExpression;

    QuantifiedExpression returns Expression:
    ( {ForAllExpression} "forAll" | {ExistExpression} "exists" |
    {NotExistExpression} "nexists"
    )
    vardecl+=VariableDecl ("," vardecl+=VariableDecl)* expression=Expression
    ;

    Expression_1 returns Expression: Expression_2
    (
    {AndExpression.lhs=current} "and"
    rhs=Expression
    );

    Expression_2 returns Expression: Expression_3
    (
    {OrExpression.lhs=current} "or"
    rhs=Expression
    );
    ...

    Expression_10 returns Expression: PrimaryExpression
    |
    {SetExpression} "{" element+=PrimaryExpression ( "," element+=PrimaryExpression )* "}";


    PrimaryExpression returns Expression:
    '(' Expression ')'
    |
    TerminalExpression;
    -------------------------------

    This grammar is working well, but maybe I've missed something.

    Cheers,

    Jens

    P.S.: I really missed that topic in the user guide! Thank you for adding it!

    ReplyDelete
  2. I have been following all your Xtext posts. thay are ll really helped me a lot ot understand the xtext.

    Now i am designing a DSL to host c grammar and few tags based syntax. i like to know how to avoid/ignore some code section which is not needed to be taken to my xtext model. Thanks in advance.

    Raj

    ReplyDelete
  3. Hi Raj,

    please ask such question on the newsgroup.
    You'll basically have to find lexical terminals for the ranges you want to skip.

    Sven

    ReplyDelete
  4. Hello Sven,

    thank your for the enlightened video!
    Without it, i would hardly understand
    your blog post!


    Greetings,
    Andreas

    ReplyDelete
  5. Great Post! It really help me.. Your explanation is more clear than eclipse documentation..

    Two thumbs up!

    ReplyDelete
  6. Note, that I've added the description to the general Xtext documentation.

    ReplyDelete
  7. First of all thanks for the great post. It's my first day of Xtext so bear with me: Why did you call the action {Addition.left=current}. Not the current part but the Addition.left. Is left a specific keyword? And why a composite one (Addtion.left)?

    ReplyDelete
  8. @dierre

    Addition.left points to the EClass 'Addition' and its feature 'left'. Such an action creates a new instance of 'Addition' and assigns the value so far constructed to the feature 'left'. It's also explained in the documentation of Xtext.

    ReplyDelete
    Replies
    1. Ok, understood what Addition.left refers to. What does 'right' mean in "right=Multiplication"?
      Why is there no reference to an EClass?
      Finally, is this information in the official documentation? I haven't been able to find it.

      Delete
    2. It's an assignment and 'right' refers to the EReference 'right' of the type Addition. It assigns another value to the Addition AST object.
      You can find more explanation here : http://www.eclipse.org/Xtext/documentation.html#assignments

      Delete
  9. Hello,

    I've been playing with xtext a bit and it looks great, but what I need is an Interpreter that can interpret string runtime.

    Something like:
    Object result = MyDslInterpreter.parse("( 1 + 20 ) * 2)";

    Where the resulting object would contain the value 42.

    I can't seem to find a clear example for that. Could someone here point me in the right direction?

    I can be contacted through if necessary.

    ReplyDelete
  10. @Arjan: In Xtext 1 we had an example which did exactly that. We removed it because it was broken and it was too late when we realized it. But you can have a look at Xbase's interpreter to get an idea of how an interpreter could be implemented :

    https://github.com/eclipse/xtext/tree/master/plugins/org.eclipse.xtext.xbase/src/org/eclipse/xtext/xbase/interpreter

    ReplyDelete
  11. Thanks for the helpfull hints. I'm quite jealous about your visualized AST in the video. Is there a way to visualize an AST (i.e. the parsers' output) graphically? E.g. with GraphViz?

    Regards Frank

    ReplyDelete
  12. Hallo Sven,

    thank you very much for your post, it is really good and easy to understand. But I have got some more subtle problem to solve: what can be done if DSL allows creating and using operators at runtime, it means, there are not only predefined operators, but also operators created by user at runtime according to predefined grammar rule:

    MyDSLOperator:
    name = Name
    precedence = INT
    specifier = Specifier
    ;

    Specifier:
    xfx |xfy | yfx | xf | yf
    ;

    Here specifier can be understood as an associativity of the operator. Is it possible to solve this problem with XTEXT at grammar level ? Or do I have to try to solve it with manipulating Model Objects in AST, let say within validators ?

    ReplyDelete
  13. You can't use parsed information for alter how the parser behaves. I.e. the precedence is something which needs to be defined in the grammar if you want the AST to reflect that. I'd rethink if you really need to specify the precedence. In the end relying on precedence is error prone ...

    ReplyDelete
  14. Hello Sven,
    Thank you for your helpful video.
    I'm a beginner in Xtext. I've understood your post but I have a question.
    We can parse an Expression whit your example, My question is how to calculate the expression?
    I found out that we cannot write the following expression in xtext:

    Addition returns Expression:
    Multiplication ({Addition.left=current} '+' right=Multiplication)* {Addition.value = Addition.left+Addition.right}

    I was wondering if you could please answer my question.
    Thank you.

    ReplyDelete
  15. You would write an interpreter which works on the created AST.
    Typically this would involve a dispatch over the different expression kinds where you return the results. Example (using Xtend):

    def dispatch BigDecimal invoke(Multiplication it) {
    invoke(left) * invoke(right)
    }

    def dispatch BigDecimal invoke(Addition it) {
    invoke(left) + invoke(right)
    }

    def dispatch BigDecimal invoke(NumberLiteral it) {
    value
    }

    In addition (not shown in the example) you would pass a context around containing the currently visible variables. Hope that helps. If not, please ask additional questions in the official Xtext forum @Eclipse.

    ReplyDelete
    Replies
    1. Thank you for your helpful answer.
      I understood and it works out :)

      Delete
  16. Hi,

    I am very new to ANTLR and working on a parsing Script like below.

    I have a Stirng ACTION and ACTION GROUP in the script.
    Here Action can allow Numeric Range by ',' separated or with '-'.
    Action Group can allow Integer only.

    my grammar is like below, but it is not working.

    callingScript :(actionBlock | actiongroupblock)*;

    actionBlock :ACTION numericRange;
    actionGroupBlock : ACTION GROUP INTEGER;
    numericRange : i1= INTEGER ((',' | '-') i2=INTEGER){

    Evaluator eval = new ActionCodeEvaluator($i1.text);
    Evaluate(eval,$i1.line, $i1.pos, $i1.text);

    eval = new ActionCodeEvaluator($i2.text);
    Evaluate(eval, $i2.line, $i2.pos, $i2.text);
    }
    ;

    The requirement is like, when my script starts with Action with out number (Action), it is invalid.
    Action 1 or Action 1-10 or Action Group 1 or Action Group 10 are valid.

    When i am giving the Action with out any number then my script should not parse but actually it is not throwing error.

    In the parser
    case ACTION:
    {
    int LA1_3 = input.LA(2);

    if ( (LA1_3==GROUP) ) {
    alt1=3;
    }
    else if ( (LA1_3==INTEGER) ) {
    alt1=2;
    }


    }
    break;
    When we are giving Action in Script with out any integer or numeric Range, this case is not getting handled in above generated parser.

    Can any one please help me what is wrong with my grammar.

    ReplyDelete
    Replies
    1. Please ask Antlr specific questions on the Antlr user group : http://www.antlr.org/mailman/listinfo/antlr-interest

      Delete
  17. How did I miss that article ?!!

    ReplyDelete
  18. Hi, i'm trying this but can't make it works in my dsl.

    This is what i have:

    Variable:
    direct=DirectVariable | symbolic=SymbolicVariable
    ;
    SymbolicVariable returns Expression:
    name=ID | element=MultiElementVariable
    ;

    MultiElementVariable returns Expression:
    array=ArrayVariable | StructuredVariable
    ;

    ArrayVariable returns Expression:
    subs=SubscriptedVariable right=SubscriptList
    ;
    SubscritpedVariable returns Expression:
    value=SymbolicVariable
    ;

    SubscriptList returns Expression:
    {Subscript}'['expr=Expression ']' (',' expression+=Expression)* ']'
    ;

    And the expression rule in some part calls Variable in your definition. I tried your solution but i get the "An action is not allowed, when the current may still be unassigned" message.
    What i'm doing wrong?

    ReplyDelete
  19. This comment has been removed by the author.

    ReplyDelete
  20. Hello Sven Efftinge,

    i have a small question
    consider an example 2+3*3 , in java the result would be 11 . since multiplication has higher precedence than addition

    but according to this posting , the result will be 15 ie.. addition is performed first then multiplication

    how can you differentiate between the associativity and precedence ??

    correct me if i am wrong

    thanks
    regards
    vinayak

    ReplyDelete
  21. Precedence is handled by the order of rules. See the picture on the AST in this post. You'll see that the multiplication is the root AST.
    It's working correctly. It's explained in the text.

    ReplyDelete