antlr grammar for triple quoted string - java

I am trying to update an ANTLR grammar that follows the following spec
https://github.com/facebook/graphql/pull/327/files
In logical terms its defined as
StringValue ::
- `"` StringCharacter* `"`
- `"""` MultiLineStringCharacter* `"""`
StringCharacter ::
- SourceCharacter but not `"` or \ or LineTerminator
- \u EscapedUnicode
- \ EscapedCharacter
MultiLineStringCharacter ::
- SourceCharacter but not `"""` or `\"""`
- `\"""`
(Not the above is logical - not ANTLR syntax)
I tried the follow in ANTRL 4 but it wont recognize more than 1 character inside a triple quoted string
string : triplequotedstring | StringValue ;
triplequotedstring: '"""' triplequotedstringpart? '"""';
triplequotedstringpart : EscapedTripleQuote* | SourceCharacter*;
EscapedTripleQuote : '\\"""';
SourceCharacter :[\u0009\u000A\u000D\u0020-\uFFFF];
StringValue: '"' (~(["\\\n\r\u2028\u2029])|EscapedChar)* '"';
With these rules it will recognize '"""a"""' but as soon as I add more characters it fails
eg: '"""abc"""' wont parse and the IntelliJ plugin for ANTLR says
line 1:14 extraneous input 'abc' expecting {'"""', '\\"""', SourceCharacter}
How do I do triple quoted strings in ANTLR with '\"""' escaping?

Some of your parer rules should really be lexer rules. And SourceCharacter should probably be a fragment.
Also, instead of EscapedTripleQuote* | SourceCharacter*, you probably want ( EscapedTripleQuote | SourceCharacter )*. The first matches aaa... or bbb..., while you probably meant to match aababbba...
Try something like this instead:
string
: Triplequotedstring
| StringValue
;
Triplequotedstring
: '"""' TriplequotedstringPart*? '"""'
;
StringValue
: '"' ( ~["\\\n\r\u2028\u2029] | EscapedChar )* '"'
;
// Fragments never become a token of their own: they are only used inside other lexer rules
fragment TriplequotedstringPart : EscapedTripleQuote | SourceCharacter;
fragment EscapedTripleQuote : '\\"""';
fragment SourceCharacter :[\u0009\u000A\u000D\u0020-\uFFFF];

Related

How do I make a Simple ANTLR grammar extension?

I'm writing a framework that uses ANTLR to parse Java-style expressions. I had in mind to create a new type of free-form literal. The literal will look similar to a string, so I thought to extend the Java8 grammar I'm using with a new literal identical to StringLiteral, but bounded by '`' characters instead of '"'.
So I created:
ExternalLiteral
: '`' StringCharacters? '`'
;
in the Lexer and modified:
fragment
StringCharacter
: ~["`\\\r\n]
| EscapeSequence
;
and
fragment
EscapeSequence
: '\\' [btnfr"'`\\]
| OctalEscape
| UnicodeEscape // This is not in the spec but prevents having to preprocess the input
;
so '`' would be treated as a special character, identically to '"'. Then in the grammar I modified
literal
: IntegerLiteral
| FloatingPointLiteral
| BooleanLiteral
| CharacterLiteral
| StringLiteral
| ExternalLiteral
| NullLiteral
;
That seems like it would work to me, but when I try to parse any such expressions, e.g.`0`, I get:
line 1:3 mismatched input '<EOF>' expecting {'boolean', 'byte', 'char', 'double', 'float', 'int', 'long', 'new', 'short', 'super', 'this', 'void', IntegerLiteral, FloatingPointLiteral, BooleanLiteral, CharacterLiteral, StringLiteral, ExternalLiteral, 'null', '(', '!', '~', '++', '--', '+', '-', Identifier, '#'}
line 1:3 missing {IntegerLiteral, FloatingPointLiteral, BooleanLiteral, CharacterLiteral, StringLiteral, ExternalLiteral, 'null'} at '<EOF>'
I've had fights with ANTLR before, I don't know if it's ANTLR or me that is more the problem. Does anyone with more experience than me see what I might've done wrong?

Antlr - Parsing Multiline #define for C.g4

I am using Antlr4 to parse C code.
I want to parse multiline #defines alongwith C.g4 provided in
C.g4
But the grammar mentioned in the link above does not support preprocessor directives, so I have added the following new rules to support preprocessing.
Link to my previous question
Whitespace
: [ \t]+
-> channel(HIDDEN)
;
Newline
: ( '\r' '\n'?
| '\n'
)
-> channel(HIDDEN)
;
BlockComment
: '/*' .*? '*/'
;
LineComment
: '//' ~[\r\n]*
;
IncludeBlock
: '#' Whitespace? 'include' ~[\r\n]*
;
DefineStart
: '#' Whitespace? 'define'
;
DefineBlock
: DefineStart ~[\r\n]*
;
MultiDefine
: DefineStart MultiDefineBody
;
MultiDefineBody
: [\\] [\r\n]+ MultiDefineBody
| ~[\r\n]
;
preprocessorDeclaration
: includeDeclaration
| defineDeclaration
;
includeDeclaration
: IncludeBlock
;
defineDeclaration
: DefineBlock | MultiDefine
;
comment
: BlockComment
| LineComment
;
declaration
: declarationSpecifiers initDeclaratorList ';'
| declarationSpecifiers ';'
| staticAssertDeclaration
| preprocessorDeclaration
| comment
;
It works only for Single line pre-processor directives if MultiBlock rule is removed
But for multiline #defines it is not working.
Any help will be appreciated
By Multiline #define I mean
#define MACRO(num, str) {\
printf("%d", num);\
printf(" is");\
printf(" %s number", str);\
printf("\n");\
}
Basically I need to find a grammar that can parse the above block
I'm shamelessly copying part of my answer from here:
This is because ANTLR's lexer matches "first come, first serve". That
means it will tray to match the given input with the first specified
(in the source code) rule and if that one can match the input, it
won't try to match it with the other ones.
In your case the input sequence DefineStart \\\r\n (where DefineStart stands for an input-sequence corresponsing to the respective rule) will be matched by DefineBlock because the \\ is being consumed by the ~[\r\n]* construct.
You now have two possibilities: Either you tweak your current set of rules in order to circumvent this problem or (my sugestion) you simply use one rule for matching a define-statement (single and multiline).
Such a merged rule could look like this:
DefineBlock:
DefineStart (~[\\\r\n] | '\\\\' '\r'? '\n' | '\\'. )*
;
Note that this code is untested but it should read like this: Match DefineStart and afterwards an arbitrary long character sequence matching the following pattern: The current character is either not \, \r or \n, it is an escaped newline or a backslash followed by an arbitrary character.
This should allow for the wished newline-escaping.

Antlr Eclipse IDE White Space not being skipped

I apologize in advance if this question has already been asked, can't seem to find it.
I'm just beginning with Antlr, using the antlr4IDE for Eclipse to create a parser for a small subset of Java. For some reason, unless I explicitly state the presence of a white space in my regex, the parser will throw an error.
My grammar:
grammar Hello;
r :
(Statement ';')+
;
Statement:
DECL | INIT
;
DECL:
'int' ID
;
INIT:
DECL '=' NUMEXPR
;
NUMEXPR :
Number OP Number | Number
;
OP :
'+'
| '-'
| '/'
| '*'
;
WS :
[ \t\r\n\u000C]+ -> skip
;
Number:
[0-9]+
;
ID :
[a-zA-Z]+
;
When trying to parse
int hello = 76;
I receive the error:
Hello::r:1:0: mismatched input 'int' expecting Statement
Hello::r:1:10: token recognition error at: '='
However, when I manually add the token WS into the rules, I receive no error.
Any ideas where I'm going wrong? I'm new to Antlr, so I'm probably making a stupid mistake. Thanks in advance.
EDIT : Here is my parse tree and error log:
Error Log:
Change syntax like this.
grammar Hello;
r : (statement ';')+ ;
statement : decl | init ;
decl : 'int' ID ;
init : decl '=' numexpr ;
numexpr : Number op Number | Number ;
op : '+' | '-' | '/' | '*' ;
WS : [ \t\r\n\u000C]+ -> skip ;
Number : [0-9]+ ;
ID : [a-zA-Z]+ ;
After looking at the documentation on antlr4, it seems like you have to have a specification for all of the character combinations that you expect to see in your file, from start to finish - not just those that you want to handle.
In that regards, it's expected that you would have to explicitly state the whitespace, with something like:
WS : [ \t\r\n]+ -> skip;
That's why the skip command exists:
A 'skip' command tells the lexer to get another token and throw out the current text.
Though note that sometimes this can cause a little trouble such as in this post.

Treat invalid chars as a single token in ANTLR4 lexer

I'm using the JSON grammar from the antlr4 grammar repository to parse JSON files for an editor plugin. It works, but reports invalid chars one by one. The following snippet results in 18 lexer errors:
{
sometext-without-quotes : 42
}
I want to boil it down to 1-2 by treating consecutive, invalid single-char tokens of the same type as one bigger invalid token.
For a similar question, a custom lexer was suggested that glues "unknown" elements to larger tokens: In antlr4 lexer, How to have a rule that catches all remaining "words" as Unknown token?
I assume that this bypasses the usual lexer error reporting, which I would like to avoid, if possible. Isn't there a proper solution for that rather simple task? It seems to have worked by default in ANTLR3.
The answer is in the link you provided. I don't want to copy the original answer completely so I'll try and paraphrase a bit...
In antlr4 lexer, How to have a rule that catches all remaining "words" as Unknown token?
Add unknowns to the lexer that will match multiples of these...
unknowns : Unknown+ ;
...
Unknown : . ;
There was an edit made to this post to cater for the case where you were only using a lexer and not using a parser. If using a parser then you do not need to override the nextToken method because the error can be handled in the parser in a much cleaner way ie unknowns are just another token type as far as the lexer is concerned. The lexer passes these to the parser which can then handle the errors. If using a parser I'd normally recognize all tokens as individual tokens and then in the parser emit the errors ie group them or not. The reason for doing this is all error handling is done in one place ie it's not in the lexer and in the parser. It also makes the lexer simpler to write and test ie it must recognize all text and never fail on any utf8 input. Some people would likely do it differently but this has worked for me with hand written lexers in C. The parser is in charge of determining what's actually valid and how to error on it. One other benefit is that the lexer is fairly generic and can be reused.
For lexer only solution...
Check the answer at the link and look for this comment in the answer...
... but I only have a lexer, no parsers ...
The answer states that you override the nextToken method and goes into some detail on how to do that
#Override
public Token nextToken() {
and the important part in the code is this...
Token next = super.nextToken();
if(next.getType() != Unknown) {
return next;
}
The code that comes after this handles the case where you can match the bad tokens.
What you could do is use lexer modes. For this you'd had to split grammar to parser and lexer grammar. Let's start with lexer grammar:
JSONLexer.g4
/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */
// Derived from http://json.org
lexer grammar JSONLexer;
STRING
: '"' (ESC | ~ ["\\])* '"'
;
fragment ESC
: '\\' (["\\/bfnrt] | UNICODE)
;
fragment UNICODE
: 'u' HEX HEX HEX HEX
;
fragment HEX
: [0-9a-fA-F]
;
NUMBER
: '-'? INT '.' [0-9] + EXP? | '-'? INT EXP | '-'? INT
;
fragment INT
: '0' | [1-9] [0-9]*
;
// no leading zeros
fragment EXP
: [Ee] [+\-]? INT
;
// \- since - means "range" inside [...]
TRUE : 'true';
FALSE : 'false';
NULL : 'null';
LCURL : '{';
RCURL : '}';
COL : ':';
COMA : ',';
LBRACK : '[';
RBRACK : ']';
WS
: [ \t\n\r] + -> skip
;
NON_VALID_STRING : . ->pushMode(MODE_ERR);
mode MODE_ERR;
WS1
: [ \t\n\r] + -> skip
;
COL1 : ':' ->popMode;
MY_ERR_TOKEN : ~[':']* ->type(NON_VALID_STRING);
Basically I have added some tokens used in the parser part (like LCURL, COL, COMA etc) and introduced NON_VALID_STRING token, which is basically the first character that's nothing that already is (should be) matched. Once this token is detected, I switch the lexer to MODE_ERR mode. In this mode I go back to default mode once : is detected (this can be changed and maybe refined, but server the purpose here :) ) or I say that everything else is MY_ERR_TOKEN to which I assign NON_VALID_STRING token type. Here is what ATNLRWorks says to this when I run interpret lexer option with your input:
So s is NON_VALID_STRING type and so is everything else until :. So, same type but two different tokens. If you want them not to be of the same type, simply omit the type call in the lexer grammar.
Here is the parser grammar now
JSONParser.g4
/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */
// Derived from http://json.org
parser grammar JSONParser;
options {
tokenVocab=JSONLexer;
}
json
: object
| array
;
object
: LCURL pair (COMA pair)* RCURL
| LCURL RCURL
;
pair
: STRING COL value
;
array
: LBRACK value (COMA value)* RBRACK
| LBRACK RBRACK
;
value
: STRING
| NUMBER
| object
| array
| TRUE
| FALSE
| NULL
;
and if you run the test rig (I do it with ANTLRworks) you'll get a single error (see screenshot)
Also you could accumulate lexer errors by overriding the generated lexer class, but I understood in the question that this is not desired or I didn't understand that part :)

ANTLR Decision can match input using multiple alternatives

I have this simple grammer:
expr: factor;
factor: atom (('*' ^ | '/'^) atom)*;
atom: INT
| ':' expr;
INT: ('0'..'9')+
when I run it it says :
Decision can match input such as '*' using multiple alternatives 1,2
Decision can match input such as '/' using multiple alternatives 1,2
I can't spot the ambiguity. How are the red arrows pointing ?
Any help would be appreciated.
Let's say you want to parse the input:
:3*4*:5*6
The parser generated by your grammar could match this input into the following parse trees:
and:
(I omitted the colons to keep the trees more clear)
Note that what you see is just a warning. By specifically instructing ANTLR that (('*' | '/') atom)* needs to be matched greedily, like this:
factor
: atom (options{greedy=true;}: ('*'^ | '/'^) atom)*
;
the parser "knows" which alternative to take, and no warning is emitted.
EDIT
I tested the grammar with ANTLR 3.3 as follows:
grammar T;
options {
output=AST;
}
parse
: expr EOF!
;
expr
: factor
;
factor
: atom (options{greedy=true;}: ('*'^ | '/'^) atom)*
;
atom
: INT
| ':'^ expr
;
INT : ('0'..'9')+;
And then from the command line:
java -cp antlr-3.3.jar org.antlr.Tool T.g
which does not produce any warning (or error).

Categories