Eclipse with plugin for DSL with following grammar (xtext)
AbstractStatement returns AbstractStatement:
IfStructureStatement | DeclarativeStatement | BreakStatement | EqualityStatement | SignalStatement;
Component returns Component:
LED_Panel | Switch | Timer | LED_Light;
Setup returns Setup:
{Setup}
'SETUP BEGIN'
( abstractstatement+=AbstractStatement ( "\r" abstractstatement+=AbstractStatement)* )?
'SETUP END';
DeclarativeStatement returns DeclarativeStatement:
{DeclarativeStatement}
'DECLARE'
( component+=[Component|EString] ( "," component+=[Component|EString])* )?
( variable+=[Variable|EString] ( "," variable+=[Variable|EString])* )?
( constant+=[Constant|EString] ( "," constant+=[Constant|EString])* )?";";
LED_Panel returns LED_Panel:
{LED_Panel}
'LED_PANEL'
ElementName=EString
('{'
'PanelWidth' PanelWidth=EInt
'PanelHeight' PanelHeight=EInt
'PanelText' PanelText=EString
'ON' '{' pin+=Pin ( "," pin+=Pin)* '}'
'}')?;
And the following source file:
SETUP BEGIN
DECLARE LED_PANEL p;
SETUP END
This code gives me error "missmatched input LED_PANEL", expecting ";"
It is acting like he can not recognize Component LED_PANEL
I expect that he can validate this code.
In your DeclarativeStatement rule you have component+=[Component|EString]. This means "match an EString token; that token should be the name of a Component (meaning an instance of the Component class)". As far as the parser is concerned, that's equivalent to component+=EString - the fact that it's a cross reference only comes into play once we get to the linker.
It does not mean "match a Component". If that's what you want, you should just write component+=Component (or even better components+=Component since lists should have plural names).
Cross references are intended for situations where you expect the name of something defined elsewhere. If you expect the whole thing, there should be no cross reference.
Related
I've found a strange issue regarding error recovery in ANTLR4. If I take the grammar example from the ANTLR book
grammar simple;
prog: classDef+ ; // match one or more class definitions
classDef
: 'class' ID '{' member+ '}' // a class has one or more members
;
member
: 'int' ID ';' // field definition
| 'int' f=ID '(' ID ')' '{' stat '}' // method definition
;
stat: expr ';'
| ID '=' expr ';'
;
expr: INT
| ID '(' INT ')'
;
INT : [0-9]+ ;
ID : [a-zA-Z]+ ;
WS : [ \t\r\n]+ -> skip ;
and use the input
class T {
y;
int x;
}
it will see the first member as an error (as it expects 'int' before 'y').
classDef
| "class"
| ID 'T'
| "{"
|- member
| | ID "y" -> error
| | ";" -> error
|- member
| | "int"
| | ID "x"
| | ";"
In this case ANTLR4 recovers from the error in the first member subrule and parses the second member correct.
But if the member classDef is changed from mandatory member+ to optional member*
classDef
: 'class' ID '{' member* '}' // a class has zero or more members
;
then the parsed tree will look like
classDef
| "class" -> error
| ID "T" -> error
| "{" -> error
| ID "y" -> error
| ";" -> error
| "int" -> error
| ID "x" -> error
| ";" -> error
| "}" -> error
It seems that the error recovery cannot solve the issue inside the member subrule anymore.
Obviously using member+ is the way forward as it provides the correct error recovery result. But how do I allow empty class bodies? Am I missing something in the grammar?
The DefaultErrorStrategy class is quite complex with token deletions and insertions and the book explains the theory of this class in a very good way. But what I'm missing here is how to implement custom error recovery for specific rules?
In my case I would add something like "if { is already consumed, try to find int or }" to optimize the error recovery for this rule.
Is this possible with ANTLR4 error recovery in a reasonable way at all? Or do I have to implement manual parser by hand to really gain control over error recovery for those use cases?
It is worth noting that the parser never enters the sub rule for the given input. The classDef rule fails before trying to match a member.
Before trying to parse the sub-rule, the sync method on DefaultErrorStrategy is called. This sync recognizes there is a problem and tries to recover by deleting a single token to see if that fixes things up.
In this case it doesn't, so an exception is thrown and then tokens are consumed until a 'class' token is found. This makes sense because that is what can follow a classDef and it is the classDef rule, not the member rule that is failing at this point.
It doesn't look simple to do correctly, but if you install a custom subclass of DefaultErrorStrategy and override the sync() method, you can get any recovery strategy you like.
Something like the following could be a starting point:
#Override
public void sync(Parser recognizer) throws RecognitionException {
if (recognizer.getContext() instanceof simpleParser.ClassDefContext) {
return;
}
super.sync(recognizer);
}
The result being that the sync doesn't fail, and the member rule is executed. Parsing the first member fails, and the default recovery method handles moving on to the next member in the class.
I am trying to update an ANTLR grammar that follows the following spec
https://github.com/facebook/graphql/pull/327/files
In logical terms its defined as
StringValue ::
- `"` StringCharacter* `"`
- `"""` MultiLineStringCharacter* `"""`
StringCharacter ::
- SourceCharacter but not `"` or \ or LineTerminator
- \u EscapedUnicode
- \ EscapedCharacter
MultiLineStringCharacter ::
- SourceCharacter but not `"""` or `\"""`
- `\"""`
(Not the above is logical - not ANTLR syntax)
I tried the follow in ANTRL 4 but it wont recognize more than 1 character inside a triple quoted string
string : triplequotedstring | StringValue ;
triplequotedstring: '"""' triplequotedstringpart? '"""';
triplequotedstringpart : EscapedTripleQuote* | SourceCharacter*;
EscapedTripleQuote : '\\"""';
SourceCharacter :[\u0009\u000A\u000D\u0020-\uFFFF];
StringValue: '"' (~(["\\\n\r\u2028\u2029])|EscapedChar)* '"';
With these rules it will recognize '"""a"""' but as soon as I add more characters it fails
eg: '"""abc"""' wont parse and the IntelliJ plugin for ANTLR says
line 1:14 extraneous input 'abc' expecting {'"""', '\\"""', SourceCharacter}
How do I do triple quoted strings in ANTLR with '\"""' escaping?
Some of your parer rules should really be lexer rules. And SourceCharacter should probably be a fragment.
Also, instead of EscapedTripleQuote* | SourceCharacter*, you probably want ( EscapedTripleQuote | SourceCharacter )*. The first matches aaa... or bbb..., while you probably meant to match aababbba...
Try something like this instead:
string
: Triplequotedstring
| StringValue
;
Triplequotedstring
: '"""' TriplequotedstringPart*? '"""'
;
StringValue
: '"' ( ~["\\\n\r\u2028\u2029] | EscapedChar )* '"'
;
// Fragments never become a token of their own: they are only used inside other lexer rules
fragment TriplequotedstringPart : EscapedTripleQuote | SourceCharacter;
fragment EscapedTripleQuote : '\\"""';
fragment SourceCharacter :[\u0009\u000A\u000D\u0020-\uFFFF];
I would like to be able to use 'NULL' as both a parameter (the value null) and a function name in my grammar. See this reduced example :
grammar test;
expr
: value # valueExpr
| FUNCTION_NAME '(' (expr (',' expr)* )* ')' # functionExpr
;
value
: INT
| 'NULL'
;
FUNCTION_NAME
: [a-zA-Z] [a-zA-Z0-9]*
;
INT: [0-9]+;
Now, trying to parse:
NULL( 1 )
Results in the parse tree failing because it parses NULL as a value, and not a function name.
Ideally, I should even be able to parse NULL(NULL)..
Can you tell me if this is possible, and if yes, how to make this happen?
That 'NULL' string in your grammar defines an implicit token type, it's equivalent to adding something along this:
NULL: 'NULL';
At the start of the lexer rules. When a token matches several lexer rules, the first one is used, so in your grammar the implicit rule get priority, and you get a token of type 'NULL'.
A simple solution would be to introduce a parser rule for function names, something like this:
function_name: FUNCTION_NAME | 'NULL';
and then use that in your expr rule. But that seems brittle, if NULL is not intended to be a keyword in your grammar. There are other solution to this, but I'm not quite sure what to advise since I don't know how you expect your grammar to expand.
But another solution could be to rename FUNCTION_NAME to NAME, get rid of the 'NAME' token type, and rewrite expr like that:
expr
: value # valueExpr
| NAME '(' (expr (',' expr)* )* ')' # functionExpr
| {_input.LT(1).getText().equals("NULL")}? NAME # nullExpr
;
A semantic predicate takes care of the name comparison here.
Hey I have a quick question. I am using ANTLRworks to create an interpreter in Java from a set of grammar. I was going to write it out by hand but then realized I didn't have to because of antlrworks. I am getting this error though
T.g:9:23: label ID conflicts with token with same name
Is ANTLRworks the way to go when creating a interpreter from grammar. And do y'all see any error in my code?
I am trying to make ID one letter from a-z and not case sensitive. and to have white space in between every lexeme. THANK YOU
grammar T;
programs : ID WS compound_statement;
statement:
if_statement|assignment_statement|while_statement|print_statement|compound_statement;
compound_statement: 'begin' statement_list 'end';
statement_list: statement|statement WS statement_list;
if_statement: 'if' '(' boolean_expression ')' 'then' statement 'else' statement;
while_statement: 'while' boolean_expression 'do' statement;
assignment_statement: ID = arithmetic_expression;
print_statement: 'print' ID;
boolean_expression: operand relative_op operand;
operand : ID |INT;
relative_op: '<'|'<='|'>'|'>='|'=='|'/=';
arithmetic_expression: operand|operand WS arithmetic_op WS operand;
arithmetic_op: '+'|'-'|'*'|'/';
ID : ('a'..'z'|'A'..'Z'|'_').
;
INT : '0'..'9'+
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
and here is the grammar
<program> → program id <compound_statement>
<statement> → <if_statement> | <assignment_statement> | <while_statement> |
<print_statement> | <compound_statement>
<compound_statement> → begin <statement_list> end
<statement_list> → <statement> | <statement> ; <statement_list>
<if_statement> → if <boolean_expression> then <statement> else <statement>
<while_statement> → while <boolean_expression> do <statement>
<assignment_statement> -> id := <arithmetic_expression>
<print_statement> → print id
<boolean_expression> → <operand> <relative_op> <operand>
<operand> → id | constant
<relative_op> → < | <= | > | >= | = | /=
<arithmetic_expression> → <operand> | <operand> <arithmetic_op> <operand>
<arithmetic_op> → + | - | * | /
Is ANTLRworks the way to go when creating a interpreter from grammar.
No.
ANTLRWorks can only be used to write your grammar and possibly test to see if it input properly (through its debugger or interpreter). It cannot be used to create an interpreter for the language you've written the grammar for. ANTLRWorks is just a fancy text-editor, nothing more.
And do y'all see any error in my code?
As indicated by Treebranch: you didn't have quotes around the = sign in:
assignment_statement: ID = arithmetic_expression;
making ANTLR "think" you wanted to assign the label ID to the parser rule arithmetic_expression, which is illegal: you can't have a label-name that is also the name of a rule (ID, in your case).
Some possible issues in your code:
I think you want your ID tag to have a + regex so that it can be of length 1 or more, like so:
ID : ('a'..'z'|'A'..'Z'|'_')+
;
It also looks like you are missing quotes around your = sign:
assignment_statement: ID '=' arithmetic_expression;
EDIT
Regarding your left recursion issue: ANTLR is very powerful because of the regex functionality. While an EBNF (like the one you have presented) may be limited in the way it can express things, ANTLR can be used to express certain grammar rules in a much simpler way. For instance, if you want to have a statement_list in your compound_statement, just use your statement rule with closure (*). Like so:
compound_statement: 'begin' statement* 'end';
Suddently, you can remove unnecessary rules like statement_list.
During taking advantage of ANTLR 3.3, I'm changing the current grammar to support inputs without parenthesis too. Here's the first version of my grammar :
grammar PropLogic;
NOT : '!' ;
OR : '+' ;
AND : '.' ;
IMPLIES : '->' ;
SYMBOLS : ('a'..'z') | '~' ;
OP : '(' ;
CP : ')' ;
prog : formula EOF ;
formula : NOT formula
| OP formula( AND formula CP | OR formula CP | IMPLIES formula CP)
| SYMBOLS ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
Then I changed it this way to support the appropriate features :
grammar PropLogic;
NOT : '!' ;
OR : '+' ;
AND : '.' ;
IMPLIES : '->' ;
SYMBOL : ('a'..'z') | '~' ;
OP : '(' ;
CP : ')' ;
EM : '' ;
prog : formula EOF ;
formula : OP formula( AND formula CP | OR formula CP | IMPLIES formula CP)
| ( NOT formula | SYMBOL )( AND formula | OR formula | IMPLIES formula | EM ) ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
But I've been faced with following error :
error<100>: syntax error: invalid char literal: ''
error<100>: syntax error: invalid char literal: ''
Does anybody know that how can I overcome this error?
Your EM token:
EM : '' ;
is invalid: you can't match an empty string in lexer rules.
To match epsilon (nothing), you should do:
rule
: A
| B
| /* epsilon */
;
Of course, the comment /* epsilon */ can safely be removed.
Note that when you do it like that in your current grammar, ANTLR will complain that there can be rules matched using multiple alternatives. This is because your grammar is ambiguous.
I'm not an ANTLR expert, but you might try:
formula : term ((AND | OR | IMPLIES ) term )*;
term : OP formula CP | NOT term | SYMBOL ;
If you want traditional precedence of operators this won't do the trick, but that's another issue.
EDIT: OP raised the ante; he wants precedence too. I'll meet him halfway, since it wasn't part
of the orginal question. I've added precedence to the grammar that makes IMPLIES
the lower precedence than other operators, and leave it to OP to figure out how to do the rest.
formula: disjunction ( IMPLIES disjunction )* ;
disjunction: term (( AND | OR ) term )* ;
term: OP formula CP | NOT term | SYMBOL ;
OP additionally asked, "how to convert (!p or q ) into p -> q". I think he should
have asked this as a separate question. However, I'm already here.
What he needs to do is walk the tree, looking for the pattern he doesn't
like, and change the tree into one he does, and then prettyprint the answer.
It is possible to do all this with ANTLR, which is part of the reason
it is popular.
As a practical matter, procedurally walking the tree and checking the node
types, and splicing out old nodes and splicing in new is doable, but a royal PitA.
Especially if you want to do this for lots of transformations.
A more effective way to do this is to use a
program transformation system, which allows surface syntax patterns to be expressed for matching and replacement. Program transformation systems of course include parsing machinery and more powerful ones let you (and indeed insist) that you define
a grammar up front much as you for ANTLR.
Our DMS Software Reengineering Toolkit is such a program transformation tool, and with a suitably defined grammar for propositions,
the following DMS transformation rule would carry out OP's additional request:
domain proplogic; // tell DMS to use OP's definition of logic as a grammar
rule normalize_implies_from_or( p: term, q: term): formula -> formula
" NOT \p OR \q " -> " \p IMPLIES \q ";
The " ... " is "domain notation", e.g, surface syntax from the proplogic domain, the "\" are meta-escapes,
so "\p" and "\q" represent any arbitrary term from the proplogic grammar. Notice the rule has to reach "across" precedence levels when being applied, as "NOT \p OR \q" isn't a formula and "\p IMPLIES \q" is; DMS takes care of all this (the "formula -> formula" notation is how DMS knows what to do). This rule does a tree-to-tree rewrite. The resulting tree can be prettyprinted by DMS.
You can see a complete example of something very similar, e.g., a grammar for conventional algebra and rewrite rule to simplify algebraic equations.