ANTLR4 error recovery issues for class bodies

ANTLR4 error recovery issues for class bodies - java

I've found a strange issue regarding error recovery in ANTLR4. If I take the grammar example from the ANTLR book
grammar simple;
prog: classDef+ ; // match one or more class definitions
classDef
: 'class' ID '{' member+ '}' // a class has one or more members
;
member
: 'int' ID ';' // field definition
| 'int' f=ID '(' ID ')' '{' stat '}' // method definition
;
stat: expr ';'
| ID '=' expr ';'
;
expr: INT
| ID '(' INT ')'
;
INT : [0-9]+ ;
ID : [a-zA-Z]+ ;
WS : [ \t\r\n]+ -> skip ;
and use the input
class T {
y;
int x;
}
it will see the first member as an error (as it expects 'int' before 'y').
classDef
| "class"
| ID 'T'
| "{"
|- member
| | ID "y" -> error
| | ";" -> error
|- member
| | "int"
| | ID "x"
| | ";"
In this case ANTLR4 recovers from the error in the first member subrule and parses the second member correct.
But if the member classDef is changed from mandatory member+ to optional member*
classDef
: 'class' ID '{' member* '}' // a class has zero or more members
;
then the parsed tree will look like
classDef
| "class" -> error
| ID "T" -> error
| "{" -> error
| ID "y" -> error
| ";" -> error
| "int" -> error
| ID "x" -> error
| ";" -> error
| "}" -> error
It seems that the error recovery cannot solve the issue inside the member subrule anymore.
Obviously using member+ is the way forward as it provides the correct error recovery result. But how do I allow empty class bodies? Am I missing something in the grammar?
The DefaultErrorStrategy class is quite complex with token deletions and insertions and the book explains the theory of this class in a very good way. But what I'm missing here is how to implement custom error recovery for specific rules?
In my case I would add something like "if { is already consumed, try to find int or }" to optimize the error recovery for this rule.
Is this possible with ANTLR4 error recovery in a reasonable way at all? Or do I have to implement manual parser by hand to really gain control over error recovery for those use cases?

It is worth noting that the parser never enters the sub rule for the given input. The classDef rule fails before trying to match a member.
Before trying to parse the sub-rule, the sync method on DefaultErrorStrategy is called. This sync recognizes there is a problem and tries to recover by deleting a single token to see if that fixes things up.
In this case it doesn't, so an exception is thrown and then tokens are consumed until a 'class' token is found. This makes sense because that is what can follow a classDef and it is the classDef rule, not the member rule that is failing at this point.
It doesn't look simple to do correctly, but if you install a custom subclass of DefaultErrorStrategy and override the sync() method, you can get any recovery strategy you like.
Something like the following could be a starting point:
#Override
public void sync(Parser recognizer) throws RecognitionException {
if (recognizer.getContext() instanceof simpleParser.ClassDefContext) {
return;
}
super.sync(recognizer);
}
The result being that the sync doesn't fail, and the member rule is executed. Parsing the first member fails, and the default recovery method handles moving on to the next member in the class.

Related

What is right way to initialize nested entities?

Eclipse with plugin for DSL with following grammar (xtext)
AbstractStatement returns AbstractStatement:
IfStructureStatement | DeclarativeStatement | BreakStatement | EqualityStatement | SignalStatement;
Component returns Component:
LED_Panel | Switch | Timer | LED_Light;
Setup returns Setup:
{Setup}
'SETUP BEGIN'
( abstractstatement+=AbstractStatement ( "\r" abstractstatement+=AbstractStatement)* )?
'SETUP END';
DeclarativeStatement returns DeclarativeStatement:
{DeclarativeStatement}
'DECLARE'
( component+=[Component|EString] ( "," component+=[Component|EString])* )?
( variable+=[Variable|EString] ( "," variable+=[Variable|EString])* )?
( constant+=[Constant|EString] ( "," constant+=[Constant|EString])* )?";";
LED_Panel returns LED_Panel:
{LED_Panel}
'LED_PANEL'
ElementName=EString
('{'
'PanelWidth' PanelWidth=EInt
'PanelHeight' PanelHeight=EInt
'PanelText' PanelText=EString
'ON' '{' pin+=Pin ( "," pin+=Pin)* '}'
'}')?;
And the following source file:
SETUP BEGIN
DECLARE LED_PANEL p;
SETUP END
This code gives me error "missmatched input LED_PANEL", expecting ";"
It is acting like he can not recognize Component LED_PANEL
I expect that he can validate this code.

In your DeclarativeStatement rule you have component+=[Component|EString]. This means "match an EString token; that token should be the name of a Component (meaning an instance of the Component class)". As far as the parser is concerned, that's equivalent to component+=EString - the fact that it's a cross reference only comes into play once we get to the linker.
It does not mean "match a Component". If that's what you want, you should just write component+=Component (or even better components+=Component since lists should have plural names).
Cross references are intended for situations where you expect the name of something defined elsewhere. If you expect the whole thing, there should be no cross reference.

Antlr Eclipse IDE White Space not being skipped

I apologize in advance if this question has already been asked, can't seem to find it.
I'm just beginning with Antlr, using the antlr4IDE for Eclipse to create a parser for a small subset of Java. For some reason, unless I explicitly state the presence of a white space in my regex, the parser will throw an error.
My grammar:
grammar Hello;
r :
(Statement ';')+
;
Statement:
DECL | INIT
;
DECL:
'int' ID
;
INIT:
DECL '=' NUMEXPR
;
NUMEXPR :
Number OP Number | Number
;
OP :
'+'
| '-'
| '/'
| '*'
;
WS :
[ \t\r\n\u000C]+ -> skip
;
Number:
[0-9]+
;
ID :
[a-zA-Z]+
;
When trying to parse
int hello = 76;
I receive the error:
Hello::r:1:0: mismatched input 'int' expecting Statement
Hello::r:1:10: token recognition error at: '='
However, when I manually add the token WS into the rules, I receive no error.
Any ideas where I'm going wrong? I'm new to Antlr, so I'm probably making a stupid mistake. Thanks in advance.
EDIT : Here is my parse tree and error log:
Error Log:

Change syntax like this.
grammar Hello;
r : (statement ';')+ ;
statement : decl | init ;
decl : 'int' ID ;
init : decl '=' numexpr ;
numexpr : Number op Number | Number ;
op : '+' | '-' | '/' | '*' ;
WS : [ \t\r\n\u000C]+ -> skip ;
Number : [0-9]+ ;
ID : [a-zA-Z]+ ;

After looking at the documentation on antlr4, it seems like you have to have a specification for all of the character combinations that you expect to see in your file, from start to finish - not just those that you want to handle.
In that regards, it's expected that you would have to explicitly state the whitespace, with something like:
WS : [ \t\r\n]+ -> skip;
That's why the skip command exists:
A 'skip' command tells the lexer to get another token and throw out the current text.
Though note that sometimes this can cause a little trouble such as in this post.

ANTLR 4: Parsing grammar

I want to parse some data from AppleSoft Basic script.
I choose ANTLR and download this grammar: jvmBasic
I'm trying to extract function name without parameters:
return parser.prog().line(0).amprstmt(0).statement().getText();
but it returns PRINT"HELLO" e.g full expression except the line number
Here is string i want to parse:
10 PRINT "Hello!"

I think this question really depends on your ANTLR program implementation but if you are using a treewalker/listener you probably want to be targeting the rule for the specific tokens not the entire "statement" rule which is circular and encompasses many types of statement :
//each line can have one to many amprstmt's
line
: (linenumber ((amprstmt (COLON amprstmt?)*) | (COMMENT | REM)))
;
amprstmt
: (amperoper? statement) //encounters a statement here
| (COMMENT | REM)
;
//statements can be made of 1 to many sub statements
statement
: (CLS | LOAD | SAVE | TRACE | NOTRACE | FLASH | INVERSE | GR | NORMAL | SHLOAD | CLEAR | RUN | STOP | TEXT | HOME | HGR | HGR2)
| prstmt
| printstmt1 //the print rule
//MANY MANY OTHER RULES HERE TOO LONG TO PASTE........
;
//the example rule that occurs when the token's "print" is encountered
printstmt1
: (PRINT | QUESTION) printlist?
;
printlist
: expression (COMMA | SEMICOLON)? printlist*
;
As you can see from the BNF type grammar here the statement rule in this grammar includes the rules for a print statement as well as every other type of statement so it will encompass 10, PRINT and hello and subsequently return the text with the getText() method when any of these are encountered in your case, everything but linenumber which is a rule outside of the statement rule.
If you want to target these specific rules to handle what happens when they are encountered you most likely want to add functionality to each of the methods ANTLR generates for each rule by extending the jvmBasiListener class as shown here
example:
-jvmBasicListener.java
-extended to jvmBasicCustomListener.java
void enterPrintstmt1(jvmBasicParser.Printstmt1Context ctx){
System.out.println(ctx.getText());
}
However if all this is setup and you are just wanting to return a string value etc using the single line you have then trying to access the methods at a lower level by addressing the child nodes of statement may work amprstmt->statement->printstmt1->value :
return parser.prog().line().amprstmt(0).statement().printstmt1().getText();
Just to maybe narrow my answer slightly, the rules specifically that address your input "10 PRINT "HELLO" " would be :
linenumber (contains Number) , statement->printstmt1 and statement->datastmt->datum (contains STRINGLITERAL)
So as shown above the linenumber rule exists on its own and the other 2 rules that defined your text are children of statement, which explains outputting everything except the line number when getting the statement rules text.
Addressing each of these and using getText() rather than an encompassing rule such as statement may give you the result you are looking for.
I will update to address your question since the answer may be slightly longer, the easiest way in my opinion to handle specific rules rather than generating a listener or visitor would be to implement actions within your grammar file rules like this :
printstmt1
: (PRINT | QUESTION) printlist? {System.out.println("Print"); //your java code }
;
This would simply allow you to address each rule and perform whichever java action you would wish to carry out. You can then simply compile your code with something like :
java -jar antlr-4.5.3-complete.jar jvmBasic.g4 -visitor
After this you can simply run your code however you wish, here is an example:
import JVM1.jvmBasicLexer;
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTree;
public class Jvm extends jvmBasicBaseVisitor<Object> {
public static void main(String[] args) {
jvmBasicLexer lexer = new jvmBasicLexer(new ANTLRInputStream("10 PRINT \"Hello!\""));
jvmBasicParser parser = new jvmBasicParser(new CommonTokenStream(lexer));
ParseTree tree = parser.prog();
}
}
The output for this example would then be just :
Print
You could also incorporate whatever Java methods you like within the grammar to address each rule encountered and either develop your own classes and methods to handle it or directly print it out a result.
Update
Just to address the latest question now :
parser.line().linenumber().getText() - For line Number, as line is not part of a statement
parser.prog().line(0).amprstmt(0).statement().printstmt1().PR‌INT().getText() - For PRINT as it is isolated in printstmt1, however does not include CLR in the rule
parser.prog().line(0).amprstmt(0).statement().printstmt1().pr‌intlist().expression().getText() - To get the value "hello" as it is part of an expression contained within the printstmt1 rule.
:) Good luck

ANTLR: parse NULL as a function name and a parameter

I would like to be able to use 'NULL' as both a parameter (the value null) and a function name in my grammar. See this reduced example :
grammar test;
expr
: value # valueExpr
| FUNCTION_NAME '(' (expr (',' expr)* )* ')' # functionExpr
;
value
: INT
| 'NULL'
;
FUNCTION_NAME
: [a-zA-Z] [a-zA-Z0-9]*
;
INT: [0-9]+;
Now, trying to parse:
NULL( 1 )
Results in the parse tree failing because it parses NULL as a value, and not a function name.
Ideally, I should even be able to parse NULL(NULL)..
Can you tell me if this is possible, and if yes, how to make this happen?

That 'NULL' string in your grammar defines an implicit token type, it's equivalent to adding something along this:
NULL: 'NULL';
At the start of the lexer rules. When a token matches several lexer rules, the first one is used, so in your grammar the implicit rule get priority, and you get a token of type 'NULL'.
A simple solution would be to introduce a parser rule for function names, something like this:
function_name: FUNCTION_NAME | 'NULL';
and then use that in your expr rule. But that seems brittle, if NULL is not intended to be a keyword in your grammar. There are other solution to this, but I'm not quite sure what to advise since I don't know how you expect your grammar to expand.
But another solution could be to rename FUNCTION_NAME to NAME, get rid of the 'NAME' token type, and rewrite expr like that:
expr
: value # valueExpr
| NAME '(' (expr (',' expr)* )* ')' # functionExpr
| {_input.LT(1).getText().equals("NULL")}? NAME # nullExpr
;
A semantic predicate takes care of the name comparison here.

How can I determine which alternative node was chosen in ANTLR

Suppose I have the following:
variableDeclaration: Identifier COLON Type SEMICOLON;
Type: T_INTEGER | T_CHAR | T_STRING | T_DOUBLE | T_BOOLEAN;
where those T_ names are just defined as "integer", "char" etc.
Now suppose I'm in the exitVariableDeclaration method of a test program called LittleLanguage. I can refer to LittleLanguageLexer.T_INTEGER (etc.) but I can't see how determine which of these types was found through the context.
I had thought it would be context.Type().getSymbol().getType() but that doesn't return the right value. I know that I COULD use context.Type().getText() but I really don't want to be doing string comparisons.
What am I missing?
Thanks

You are loosing information in the lexer by combining the tokens prematurely. Better to combine in a parser rule:
variableDeclaration: Identifier COLON type SEMICOLON;
type: T_INTEGER | T_CHAR | T_STRING | T_DOUBLE | T_BOOLEAN;
Now, type is a TerminalNode whose underlying token instance has a unique type:
variableDeclarationContext ctx = .... ;
TerminalNode typeNode = (TerminalNode) ctx.type().getChild(0);
switch(typeNode.getSymbol().getType()) {
case YourLexer.T_INTEGER:
...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

ANTLR4 error recovery issues for class bodies - java

Related

What is right way to initialize nested entities?

Antlr Eclipse IDE White Space not being skipped

ANTLR 4: Parsing grammar

ANTLR: parse NULL as a function name and a parameter

How can I determine which alternative node was chosen in ANTLR

Categories

Resources