ANTLR: parse NULL as a function name and a parameter - java

I would like to be able to use 'NULL' as both a parameter (the value null) and a function name in my grammar. See this reduced example :
grammar test;
expr
: value # valueExpr
| FUNCTION_NAME '(' (expr (',' expr)* )* ')' # functionExpr
;
value
: INT
| 'NULL'
;
FUNCTION_NAME
: [a-zA-Z] [a-zA-Z0-9]*
;
INT: [0-9]+;
Now, trying to parse:
NULL( 1 )
Results in the parse tree failing because it parses NULL as a value, and not a function name.
Ideally, I should even be able to parse NULL(NULL)..
Can you tell me if this is possible, and if yes, how to make this happen?

That 'NULL' string in your grammar defines an implicit token type, it's equivalent to adding something along this:
NULL: 'NULL';
At the start of the lexer rules. When a token matches several lexer rules, the first one is used, so in your grammar the implicit rule get priority, and you get a token of type 'NULL'.
A simple solution would be to introduce a parser rule for function names, something like this:
function_name: FUNCTION_NAME | 'NULL';
and then use that in your expr rule. But that seems brittle, if NULL is not intended to be a keyword in your grammar. There are other solution to this, but I'm not quite sure what to advise since I don't know how you expect your grammar to expand.
But another solution could be to rename FUNCTION_NAME to NAME, get rid of the 'NAME' token type, and rewrite expr like that:
expr
: value # valueExpr
| NAME '(' (expr (',' expr)* )* ')' # functionExpr
| {_input.LT(1).getText().equals("NULL")}? NAME # nullExpr
;
A semantic predicate takes care of the name comparison here.

Related

ANTLR4 error recovery issues for class bodies

I've found a strange issue regarding error recovery in ANTLR4. If I take the grammar example from the ANTLR book
grammar simple;
prog: classDef+ ; // match one or more class definitions
classDef
: 'class' ID '{' member+ '}' // a class has one or more members
;
member
: 'int' ID ';' // field definition
| 'int' f=ID '(' ID ')' '{' stat '}' // method definition
;
stat: expr ';'
| ID '=' expr ';'
;
expr: INT
| ID '(' INT ')'
;
INT : [0-9]+ ;
ID : [a-zA-Z]+ ;
WS : [ \t\r\n]+ -> skip ;
and use the input
class T {
y;
int x;
}
it will see the first member as an error (as it expects 'int' before 'y').
classDef
| "class"
| ID 'T'
| "{"
|- member
| | ID "y" -> error
| | ";" -> error
|- member
| | "int"
| | ID "x"
| | ";"
In this case ANTLR4 recovers from the error in the first member subrule and parses the second member correct.
But if the member classDef is changed from mandatory member+ to optional member*
classDef
: 'class' ID '{' member* '}' // a class has zero or more members
;
then the parsed tree will look like
classDef
| "class" -> error
| ID "T" -> error
| "{" -> error
| ID "y" -> error
| ";" -> error
| "int" -> error
| ID "x" -> error
| ";" -> error
| "}" -> error
It seems that the error recovery cannot solve the issue inside the member subrule anymore.
Obviously using member+ is the way forward as it provides the correct error recovery result. But how do I allow empty class bodies? Am I missing something in the grammar?
The DefaultErrorStrategy class is quite complex with token deletions and insertions and the book explains the theory of this class in a very good way. But what I'm missing here is how to implement custom error recovery for specific rules?
In my case I would add something like "if { is already consumed, try to find int or }" to optimize the error recovery for this rule.
Is this possible with ANTLR4 error recovery in a reasonable way at all? Or do I have to implement manual parser by hand to really gain control over error recovery for those use cases?
It is worth noting that the parser never enters the sub rule for the given input. The classDef rule fails before trying to match a member.
Before trying to parse the sub-rule, the sync method on DefaultErrorStrategy is called. This sync recognizes there is a problem and tries to recover by deleting a single token to see if that fixes things up.
In this case it doesn't, so an exception is thrown and then tokens are consumed until a 'class' token is found. This makes sense because that is what can follow a classDef and it is the classDef rule, not the member rule that is failing at this point.
It doesn't look simple to do correctly, but if you install a custom subclass of DefaultErrorStrategy and override the sync() method, you can get any recovery strategy you like.
Something like the following could be a starting point:
#Override
public void sync(Parser recognizer) throws RecognitionException {
if (recognizer.getContext() instanceof simpleParser.ClassDefContext) {
return;
}
super.sync(recognizer);
}
The result being that the sync doesn't fail, and the member rule is executed. Parsing the first member fails, and the default recovery method handles moving on to the next member in the class.

Treat invalid chars as a single token in ANTLR4 lexer

I'm using the JSON grammar from the antlr4 grammar repository to parse JSON files for an editor plugin. It works, but reports invalid chars one by one. The following snippet results in 18 lexer errors:
{
sometext-without-quotes : 42
}
I want to boil it down to 1-2 by treating consecutive, invalid single-char tokens of the same type as one bigger invalid token.
For a similar question, a custom lexer was suggested that glues "unknown" elements to larger tokens: In antlr4 lexer, How to have a rule that catches all remaining "words" as Unknown token?
I assume that this bypasses the usual lexer error reporting, which I would like to avoid, if possible. Isn't there a proper solution for that rather simple task? It seems to have worked by default in ANTLR3.
The answer is in the link you provided. I don't want to copy the original answer completely so I'll try and paraphrase a bit...
In antlr4 lexer, How to have a rule that catches all remaining "words" as Unknown token?
Add unknowns to the lexer that will match multiples of these...
unknowns : Unknown+ ;
...
Unknown : . ;
There was an edit made to this post to cater for the case where you were only using a lexer and not using a parser. If using a parser then you do not need to override the nextToken method because the error can be handled in the parser in a much cleaner way ie unknowns are just another token type as far as the lexer is concerned. The lexer passes these to the parser which can then handle the errors. If using a parser I'd normally recognize all tokens as individual tokens and then in the parser emit the errors ie group them or not. The reason for doing this is all error handling is done in one place ie it's not in the lexer and in the parser. It also makes the lexer simpler to write and test ie it must recognize all text and never fail on any utf8 input. Some people would likely do it differently but this has worked for me with hand written lexers in C. The parser is in charge of determining what's actually valid and how to error on it. One other benefit is that the lexer is fairly generic and can be reused.
For lexer only solution...
Check the answer at the link and look for this comment in the answer...
... but I only have a lexer, no parsers ...
The answer states that you override the nextToken method and goes into some detail on how to do that
#Override
public Token nextToken() {
and the important part in the code is this...
Token next = super.nextToken();
if(next.getType() != Unknown) {
return next;
}
The code that comes after this handles the case where you can match the bad tokens.
What you could do is use lexer modes. For this you'd had to split grammar to parser and lexer grammar. Let's start with lexer grammar:
JSONLexer.g4
/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */
// Derived from http://json.org
lexer grammar JSONLexer;
STRING
: '"' (ESC | ~ ["\\])* '"'
;
fragment ESC
: '\\' (["\\/bfnrt] | UNICODE)
;
fragment UNICODE
: 'u' HEX HEX HEX HEX
;
fragment HEX
: [0-9a-fA-F]
;
NUMBER
: '-'? INT '.' [0-9] + EXP? | '-'? INT EXP | '-'? INT
;
fragment INT
: '0' | [1-9] [0-9]*
;
// no leading zeros
fragment EXP
: [Ee] [+\-]? INT
;
// \- since - means "range" inside [...]
TRUE : 'true';
FALSE : 'false';
NULL : 'null';
LCURL : '{';
RCURL : '}';
COL : ':';
COMA : ',';
LBRACK : '[';
RBRACK : ']';
WS
: [ \t\n\r] + -> skip
;
NON_VALID_STRING : . ->pushMode(MODE_ERR);
mode MODE_ERR;
WS1
: [ \t\n\r] + -> skip
;
COL1 : ':' ->popMode;
MY_ERR_TOKEN : ~[':']* ->type(NON_VALID_STRING);
Basically I have added some tokens used in the parser part (like LCURL, COL, COMA etc) and introduced NON_VALID_STRING token, which is basically the first character that's nothing that already is (should be) matched. Once this token is detected, I switch the lexer to MODE_ERR mode. In this mode I go back to default mode once : is detected (this can be changed and maybe refined, but server the purpose here :) ) or I say that everything else is MY_ERR_TOKEN to which I assign NON_VALID_STRING token type. Here is what ATNLRWorks says to this when I run interpret lexer option with your input:
So s is NON_VALID_STRING type and so is everything else until :. So, same type but two different tokens. If you want them not to be of the same type, simply omit the type call in the lexer grammar.
Here is the parser grammar now
JSONParser.g4
/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */
// Derived from http://json.org
parser grammar JSONParser;
options {
tokenVocab=JSONLexer;
}
json
: object
| array
;
object
: LCURL pair (COMA pair)* RCURL
| LCURL RCURL
;
pair
: STRING COL value
;
array
: LBRACK value (COMA value)* RBRACK
| LBRACK RBRACK
;
value
: STRING
| NUMBER
| object
| array
| TRUE
| FALSE
| NULL
;
and if you run the test rig (I do it with ANTLRworks) you'll get a single error (see screenshot)
Also you could accumulate lexer errors by overriding the generated lexer class, but I understood in the question that this is not desired or I didn't understand that part :)

ANTLR4: context-sensitive spaces?

In a grammar I would like to implement texts without string delimiting xxx.
The idea is to define things like
a = xxx;
instead of
a ="xxx";
to simplify typewriting. Otherwise there should be variable definitions
and other kind of stuff as well.
As a first approach I experimented with this grammar:
grammar SpaceNoSpace;
prog: stat+;
stat:
'somethingelse' ';'
| typed description* content
;
typed:
'something' '-'
| 'anotherthing' '-'
;
description:
'someSortOfDetails' COLON ID HASH
| 'otherSortOfDetails' COLON ID HASH
;
content:
contenttext ';'
;
contenttext:
(~';')*
;
COLON: ':' ;
HASH: '#';
SEMI: ';';
SPACE: ' ';
ID: [a-zA-Z][a-zA-z0-9]+;
WS : [ \t\n\r]+ -> channel(HIDDEN);
ANY_CHAR : . ;
This works fine for input files like this:
something-someSortOfDetails: aVariableName#
this is the content of this;
anotherthing-someSortOfDetails: aVariableName#
here spaces are accepted as much as you like;
somethingelse;
But modifying the last line to
somethingelse ;
leads to a syntax error:
line 7:15 extraneous input ' ' expecting ';'
This probably reveals that the lexer rule
WS : [ \t\n\r]+ -> channel(HIDDEN);
is not applied, (but the SPACE rule???).
Otherwise, if I delete the SPACE lexer-rule, the space
in "somethingelse ;" is ignored (by lexer-rule WS), so that the parser rule
stat : somethingelse as a consequence is detected correctly.
But as a consequence of the deleted SPACE-rule the content text will be reduced to single in-between-spaces,
so "this here" will be reduced to "this here".
This is not a big problem, but nevertheless it is an
interesting question:
is it possible to implement context-sensitive WS or SPACE
lexer rules:
within the content parser-rule any space should be preserved,
in any other rule spaces should be ignored.
Is this possible to define such a context-sensitive lexer-rule behavior in ANTLR4?
Have you considered Lexer Modes? The section with mode(), pushMode(), popMode is probably interesting for you.
Yet I think that lexer modes are more a problem than a solution. Their purpose is to use (parser) context in the lexer. Consequently one should discard the paradigm of separating lexer and parser - and use a PEG-Parser instead.
Since the SPACE rule is before the WS rule, the lexer is returning a space token to the parser. The ' ' is not being being placed on the hidden channel.

Modifying a plain ANLTR file in background

I would like to modify a grammar file by adding some Java code programatically in background. What I mean is that consider you have a println statement that you want to add it in a grammar before ANTLR works (i.e. creates lexer and parser files).
I have this trivial code: {System.out.println("print");}
Here is the simple grammar that I want to add the above snippet in the 'prog' rule after 'expr':
Before:
grammar Expr;
prog: (expr NEWLINE)* ;
expr: expr ('*'|'/') expr
| INT
;
NEWLINE : [\r\n]+ ;
INT : [0-9]+ ;
After:
grammar Expr;
prog: (expr {System.out.println("print");} NEWLINE)* ;
expr: expr ('*'|'/') expr
| INT
;
NEWLINE : [\r\n]+ ;
INT : [0-9]+ ;
Again note that I want to do this in runtime so that the grammar does not show any Java code (the 'before' snippet).
Is it possible to make this real before ANLTR generates lexer and parser files? Is there any way to visit (like AST visitor for ANTLR) a simple grammar?
ANTLR 4 generates a listener interface and base class (empty implementation) by default. If you also specify the -visitor flag when generating your parser, it will create a visitor interface and base class. Either of these features may be used to execute code using the parse tree rather than embedding actions directly in the grammar file.
If the code is always in the same place, simply insert a function call that acts as a hook to include real code afterwards.
This way you don't have to modify the source or generate the lexer/parser again.
If you want to insert code at predefined points (like enter rule/leave rule), go with Sam's solution to insert them into a listener. In either case it should not be necessary to modify the grammar file.
grammar Expr;
prog: (expr {Hooks.programHook();} NEWLINE)* ;
expr: expr ('*'|'/') expr
| INT
;
NEWLINE : [\r\n]+ ;
INT : [0-9]+ ;
In a java file of your choice (I'm no Java programmer, so the real syntax may be different):
public class Hooks
{
public static void programHook()
{
System.out.println("print");
}
}

ANTLR Decision can match input using multiple alternatives

I have this simple grammer:
expr: factor;
factor: atom (('*' ^ | '/'^) atom)*;
atom: INT
| ':' expr;
INT: ('0'..'9')+
when I run it it says :
Decision can match input such as '*' using multiple alternatives 1,2
Decision can match input such as '/' using multiple alternatives 1,2
I can't spot the ambiguity. How are the red arrows pointing ?
Any help would be appreciated.
Let's say you want to parse the input:
:3*4*:5*6
The parser generated by your grammar could match this input into the following parse trees:
and:
(I omitted the colons to keep the trees more clear)
Note that what you see is just a warning. By specifically instructing ANTLR that (('*' | '/') atom)* needs to be matched greedily, like this:
factor
: atom (options{greedy=true;}: ('*'^ | '/'^) atom)*
;
the parser "knows" which alternative to take, and no warning is emitted.
EDIT
I tested the grammar with ANTLR 3.3 as follows:
grammar T;
options {
output=AST;
}
parse
: expr EOF!
;
expr
: factor
;
factor
: atom (options{greedy=true;}: ('*'^ | '/'^) atom)*
;
atom
: INT
| ':'^ expr
;
INT : ('0'..'9')+;
And then from the command line:
java -cp antlr-3.3.jar org.antlr.Tool T.g
which does not produce any warning (or error).

Categories