ANTLR Grammar - Error parsing script block - java

I'm trying to create a grammar in ANTLR, which is depicted bellow.
grammar EPL2;
standard_rule:
'STANDARD' 'RULE' ':'
'FILTER' SCRIPT
'SINK' SCRIPT;
SCRIPT
: '{' SCRIPT_ATOM* '}'
;
fragment SCRIPT_ATOM
: ~[{}]
| '"' ( ~('"') )* '"'
| '//' ~[\r\n]*
| SCRIPT
;
MultiLineComment
: '/*' .*? '*/' -> channel(HIDDEN)
;
SingleLineComment
: '//' ~[\r\n\u2028\u2029]* -> channel(HIDDEN)
;
SPACES
: [ \u000B\t\r\n] -> channel(HIDDEN)
;
When I run the grammar against the following input:
STANDARD RULE:
FILTER { data.get("abc") == "a"; }
SINK { data.get("xyz") > 10 ;}
I get this error:
line 3:36 mismatched input '' expecting 'SINK'
I'm using a IntelliJ plugin to visualize the parse tree and hierarchy. I can see that the second SCRIPT is getting attached to the first as illustrated by the following figure.
When I close the bracket (}) the parser should advance to the sink clause but this isn't happening.
If I add an open or closing bracket to the second String ("{a") the scond
script appears correctly.
I don't know what am I doing wrong, any clues?

Related

Antlr - Parsing Multiline #define for C.g4

I am using Antlr4 to parse C code.
I want to parse multiline #defines alongwith C.g4 provided in
C.g4
But the grammar mentioned in the link above does not support preprocessor directives, so I have added the following new rules to support preprocessing.
Link to my previous question
Whitespace
: [ \t]+
-> channel(HIDDEN)
;
Newline
: ( '\r' '\n'?
| '\n'
)
-> channel(HIDDEN)
;
BlockComment
: '/*' .*? '*/'
;
LineComment
: '//' ~[\r\n]*
;
IncludeBlock
: '#' Whitespace? 'include' ~[\r\n]*
;
DefineStart
: '#' Whitespace? 'define'
;
DefineBlock
: DefineStart ~[\r\n]*
;
MultiDefine
: DefineStart MultiDefineBody
;
MultiDefineBody
: [\\] [\r\n]+ MultiDefineBody
| ~[\r\n]
;
preprocessorDeclaration
: includeDeclaration
| defineDeclaration
;
includeDeclaration
: IncludeBlock
;
defineDeclaration
: DefineBlock | MultiDefine
;
comment
: BlockComment
| LineComment
;
declaration
: declarationSpecifiers initDeclaratorList ';'
| declarationSpecifiers ';'
| staticAssertDeclaration
| preprocessorDeclaration
| comment
;
It works only for Single line pre-processor directives if MultiBlock rule is removed
But for multiline #defines it is not working.
Any help will be appreciated
By Multiline #define I mean
#define MACRO(num, str) {\
printf("%d", num);\
printf(" is");\
printf(" %s number", str);\
printf("\n");\
}
Basically I need to find a grammar that can parse the above block
I'm shamelessly copying part of my answer from here:
This is because ANTLR's lexer matches "first come, first serve". That
means it will tray to match the given input with the first specified
(in the source code) rule and if that one can match the input, it
won't try to match it with the other ones.
In your case the input sequence DefineStart \\\r\n (where DefineStart stands for an input-sequence corresponsing to the respective rule) will be matched by DefineBlock because the \\ is being consumed by the ~[\r\n]* construct.
You now have two possibilities: Either you tweak your current set of rules in order to circumvent this problem or (my sugestion) you simply use one rule for matching a define-statement (single and multiline).
Such a merged rule could look like this:
DefineBlock:
DefineStart (~[\\\r\n] | '\\\\' '\r'? '\n' | '\\'. )*
;
Note that this code is untested but it should read like this: Match DefineStart and afterwards an arbitrary long character sequence matching the following pattern: The current character is either not \, \r or \n, it is an escaped newline or a backslash followed by an arbitrary character.
This should allow for the wished newline-escaping.

antlr grammar for triple quoted string

I am trying to update an ANTLR grammar that follows the following spec
https://github.com/facebook/graphql/pull/327/files
In logical terms its defined as
StringValue ::
- `"` StringCharacter* `"`
- `"""` MultiLineStringCharacter* `"""`
StringCharacter ::
- SourceCharacter but not `"` or \ or LineTerminator
- \u EscapedUnicode
- \ EscapedCharacter
MultiLineStringCharacter ::
- SourceCharacter but not `"""` or `\"""`
- `\"""`
(Not the above is logical - not ANTLR syntax)
I tried the follow in ANTRL 4 but it wont recognize more than 1 character inside a triple quoted string
string : triplequotedstring | StringValue ;
triplequotedstring: '"""' triplequotedstringpart? '"""';
triplequotedstringpart : EscapedTripleQuote* | SourceCharacter*;
EscapedTripleQuote : '\\"""';
SourceCharacter :[\u0009\u000A\u000D\u0020-\uFFFF];
StringValue: '"' (~(["\\\n\r\u2028\u2029])|EscapedChar)* '"';
With these rules it will recognize '"""a"""' but as soon as I add more characters it fails
eg: '"""abc"""' wont parse and the IntelliJ plugin for ANTLR says
line 1:14 extraneous input 'abc' expecting {'"""', '\\"""', SourceCharacter}
How do I do triple quoted strings in ANTLR with '\"""' escaping?
Some of your parer rules should really be lexer rules. And SourceCharacter should probably be a fragment.
Also, instead of EscapedTripleQuote* | SourceCharacter*, you probably want ( EscapedTripleQuote | SourceCharacter )*. The first matches aaa... or bbb..., while you probably meant to match aababbba...
Try something like this instead:
string
: Triplequotedstring
| StringValue
;
Triplequotedstring
: '"""' TriplequotedstringPart*? '"""'
;
StringValue
: '"' ( ~["\\\n\r\u2028\u2029] | EscapedChar )* '"'
;
// Fragments never become a token of their own: they are only used inside other lexer rules
fragment TriplequotedstringPart : EscapedTripleQuote | SourceCharacter;
fragment EscapedTripleQuote : '\\"""';
fragment SourceCharacter :[\u0009\u000A\u000D\u0020-\uFFFF];

Antlr Eclipse IDE White Space not being skipped

I apologize in advance if this question has already been asked, can't seem to find it.
I'm just beginning with Antlr, using the antlr4IDE for Eclipse to create a parser for a small subset of Java. For some reason, unless I explicitly state the presence of a white space in my regex, the parser will throw an error.
My grammar:
grammar Hello;
r :
(Statement ';')+
;
Statement:
DECL | INIT
;
DECL:
'int' ID
;
INIT:
DECL '=' NUMEXPR
;
NUMEXPR :
Number OP Number | Number
;
OP :
'+'
| '-'
| '/'
| '*'
;
WS :
[ \t\r\n\u000C]+ -> skip
;
Number:
[0-9]+
;
ID :
[a-zA-Z]+
;
When trying to parse
int hello = 76;
I receive the error:
Hello::r:1:0: mismatched input 'int' expecting Statement
Hello::r:1:10: token recognition error at: '='
However, when I manually add the token WS into the rules, I receive no error.
Any ideas where I'm going wrong? I'm new to Antlr, so I'm probably making a stupid mistake. Thanks in advance.
EDIT : Here is my parse tree and error log:
Error Log:
Change syntax like this.
grammar Hello;
r : (statement ';')+ ;
statement : decl | init ;
decl : 'int' ID ;
init : decl '=' numexpr ;
numexpr : Number op Number | Number ;
op : '+' | '-' | '/' | '*' ;
WS : [ \t\r\n\u000C]+ -> skip ;
Number : [0-9]+ ;
ID : [a-zA-Z]+ ;
After looking at the documentation on antlr4, it seems like you have to have a specification for all of the character combinations that you expect to see in your file, from start to finish - not just those that you want to handle.
In that regards, it's expected that you would have to explicitly state the whitespace, with something like:
WS : [ \t\r\n]+ -> skip;
That's why the skip command exists:
A 'skip' command tells the lexer to get another token and throw out the current text.
Though note that sometimes this can cause a little trouble such as in this post.

antlr how to define optional parts in any order

Suppose need the grammar to parse the following templates:
1. REPORT
2. BEGIN
3. QUERY
4. BEGIN
5. AGGREGATION: day
6. DIMENSION: department
7. END
8. END
Where line #5 and #6 are optional and the order of the 2 lines doesn't matter. How can I specify this in my grammar file? Below is my solution (see line #12):
1. grammar PRL;
2. report
3. : REPORT
4. BEGIN
5. query
6. END
7. ;
8.
9. query
10. : QUERY
11. BEGIN
12. (aggregation_decl dimension_decl | dimension_decl aggregation_decl)?
13. END
14. ;
So it works, however it looks ugly, and if I have more than 2 parts it's going to become unmanageable very quickly? Any advice?
Something like this? Generally you would enforce only one of each item exists at a later processing step. Otherwise, as you see, the grammar gets unwieldy.
grammar PRL;
report
: REPORT
BEGIN
query
END
;
query
: QUERY
BEGIN
body_decl*
END
;
body_decl :
aggregation_decl dimension_decl
| dimension_decl aggregation_decl;
As already mentioned by Adam: this generally something done after the parser has created some sort of (abstract) parse tree. You simply collect all types of declarations like this:
grammar PRL;
report
: REPORT BEGIN query END
;
query
: QUERY BEGIN decl* END
;
decl
: NAME ':' NAME
;
REPORT : 'REPORT';
BEGIN : 'BEGIN';
END : 'END';
QUERY : 'QUERY';
NAME : ('a'..'z' | 'A'..'Z')+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
and after that, check if there are duplicates in decl* in your AST.
But if you really want to do this during parsing, you need to grab the left hand side of decl and add these in a Set and when you stumble upon a duplicate, throw a predicate exception:
grammar PRL;
#parser::header {
import java.util.Set;
import java.util.HashSet;
}
report
: REPORT BEGIN query END
;
query
: QUERY BEGIN unique_decls END
;
unique_decls
#init{Set<String> set = new HashSet<String>();}
: (decl {set.add($decl.key)}?)*
;
decl returns[String key]
: k=NAME ':' NAME {$key = $k.text;}
;
REPORT : 'REPORT';
BEGIN : 'BEGIN';
END : 'END';
QUERY : 'QUERY';
NAME : ('a'..'z' | 'A'..'Z')+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
The {set.add($decl.key)}?, called a Validating Semantic Predicates, will throw an exception when the code inside it (set.add($decl.key)) evaluates to false. In this case, it evaluates to false whenever the set already contains a certain key.

ANTLR Decision can match input using multiple alternatives

I have this simple grammer:
expr: factor;
factor: atom (('*' ^ | '/'^) atom)*;
atom: INT
| ':' expr;
INT: ('0'..'9')+
when I run it it says :
Decision can match input such as '*' using multiple alternatives 1,2
Decision can match input such as '/' using multiple alternatives 1,2
I can't spot the ambiguity. How are the red arrows pointing ?
Any help would be appreciated.
Let's say you want to parse the input:
:3*4*:5*6
The parser generated by your grammar could match this input into the following parse trees:
and:
(I omitted the colons to keep the trees more clear)
Note that what you see is just a warning. By specifically instructing ANTLR that (('*' | '/') atom)* needs to be matched greedily, like this:
factor
: atom (options{greedy=true;}: ('*'^ | '/'^) atom)*
;
the parser "knows" which alternative to take, and no warning is emitted.
EDIT
I tested the grammar with ANTLR 3.3 as follows:
grammar T;
options {
output=AST;
}
parse
: expr EOF!
;
expr
: factor
;
factor
: atom (options{greedy=true;}: ('*'^ | '/'^) atom)*
;
atom
: INT
| ':'^ expr
;
INT : ('0'..'9')+;
And then from the command line:
java -cp antlr-3.3.jar org.antlr.Tool T.g
which does not produce any warning (or error).

Categories