Antlr4 how to detect unrecognized token and given sentence is invalid - java

I am trying to develop a new language with Antlr. Here is my grammar file :
grammar test;
program : vr'.' to'.' e
;
e: be
| be'.' top'.' be
;
be: 'fg'
| 'fs'
| 'mc'
;
to: 'n'
| 'a'
| 'ev'
;
vr: 'er'
| 'fp'
;
top: 'b'
| 'af'
;
Whitespace : [ \t\r\n]+ ->skip
;
Main.java
String expression = "fp.n.fss";
//String expression = "fp.n.fs.fs";
ANTLRInputStream input = new ANTLRInputStream(expression);
testLexer lexer = new testLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
testParser parser = new testParser(tokens);
//remove listener and add listener does not work
ParseTree parseTree = parser.program();
Everything is good for valid sentences. But I want to catch unrecognized tokens and invalid sentences in order to return meaningful messages. Here are two test cases for my problem.
fp.n.fss => anltr gives this error token recognition error at: 's' but i could not handle this error. There are same example error handler class which use BaseErrorListener but in my case it does not work.
fp.n.fs.fs => this sentence is invalid for my grammar but i could not catch. How can i catch invalidations like this sentence?

Firstly welcome to SO and also to the ANTLR section! Error handling seems to be one of those topics frequently asked about, theres a really good thread here about handling errors in Java/ANTLR4.
You most likely wanted to extend the functionality of the defaultErrorStrategy to handle the particular issues and handle them in a way differently that just printing the error line 1:12 token recognition error at: 's'.
To do this you can implement your own version of the default error strategy class:
Parser parser = new testParser(tokens);
parser.setErrorHandler(new DefaultErrorStrategy()
{
#Override
public void recover(Parser recognizer, RecognitionException e) {
for (ParserRuleContext context = recognizer.getContext(); context != null; context = context.getParent()) {
context.exception = e;
}
throw new ParseCancellationException(e);
}
#Override
public Token recoverInline(Parser recognizer)
throws RecognitionException
{
InputMismatchException e = new InputMismatchException(recognizer);
for (ParserRuleContext context = recognizer.getContext(); context != null; context = context.getParent()) {
context.exception = e;
}
throw new ParseCancellationException(e);
}
});
parser.program(); //back to first rule in your grammar
I would like to also recommend splitting your parser and lexer grammars up, if not for readability but also because many tools used to analyse the .g4 file (ANTLRWORKS 2 particularly) will complain about implicity declarations.
For your example it can be modified to the following structure:
grammar test;
program : vr DOT to DOT e
;
e: be
| be DOT top DOT be
;
be: FG
| FS
| MC
;
to: N
| A
| EV
;
vr: ER
| FP
;
top: B
| AF
;
Whitespace : [ \t\r\n]+ ->skip
;
DOT : '.'
;
A: 'A'|'a'
;
AF: 'AF'|'af'
;
N: 'N'|'n'
;
MC: 'MC'|'mc'
;
EV:'EV'|'ev'
;
FS: 'FS'|'fs'
;
FP: 'FP'|'fp'
;
FG: 'FG'|'fg'
;
ER: 'ER'|'er'
;
B: 'B'|'b'
;
You can also find all the methods available for the defaultErrorStrategy Class here and by adding those methods to your "new" error strategy implementation handle whatever exceptions you require.
Hope this helps and Good luck with your project!

Related

Why does getInterpreter().adaptivePredict in generated parser return incorrect value?

I am creating custom language (EO language) plugin for IntelliJ. I use antlr4 adapter and I've already generated parser and lexer. I am working on syntax highlighting.
ANTLR4 plugin allows building tokens tree for EO code. I used this option and pasted simple EO program, the tree was built successfully. Nevertheless when I paste this code in IDEA I see an error.
After doing a research I found that problem appears in getInterpreter().adaptivePredict()
In my grammar there are 2 possible code variants (case 1 and case 2 in code below) but getInterpreter().adaptivePredict() returns 3.
Why it works like this?
May be my grammar is not correct but it works in ANTLR tree viewer (I checked several times and added picture in this post)
EO program
+alias org.eolang.io.stdout
+alias org.eolang.txt.sprintf
[args...] > main # This line is an abstraction. Highlighting error (">" is red, expecting '' or EOL)
[y] > leap
or. > #
and.
eq. (mod. y 4) 0
not. (eq. (mod. y 100) 0)
eq. (mod. y 400) 0
stdout > #
sprintf
"%d is %sa leap year!"
(args.get 0).as-int > year!
if. (leap year:y) "" "not "
It is how tree draws "abstraction" line in IDEA. It is correct.
Here is the code from my parser class.
public final AbstractionContext abstraction() throws RecognitionException {
AbstractionContext _localctx = new AbstractionContext(_ctx, getState());
enterRule(_localctx, 10, RULE_abstraction);
int _la;
try {
enterOuterAlt(_localctx, 1);
{
setState(99);
attributes();
setState(107);
_errHandler.sync(this);
/* ? */ System.out.println(getInterpreter().adaptivePredict(_input,14,_ctx)); // prints 3
switch ( getInterpreter().adaptivePredict(_input,14,_ctx) ) {
case 1:
{
{
setState(100);
suffix();
setState(104);
_errHandler.sync(this);
switch ( getInterpreter().adaptivePredict(_input,13,_ctx) ) {
case 1:
{
setState(101);
match(SPACE);
setState(102);
match(SLASH);
setState(103);
_la = _input.LA(1);
if ( !(_la==QUESTION || _la==NAME) ) {
_errHandler.recoverInline(this);
}
else {
if ( _input.LA(1)==Token.EOF ) matchedEOF = true;
_errHandler.reportMatch(this);
consume();
}
}
break;
}
}
}
break;
case 2:
{
setState(106);
htail();
}
break;
}
}
}
catch (RecognitionException re) {
_localctx.exception = re;
_errHandler.reportError(this, re);
_errHandler.recover(this, re);
}
finally {
exitRule();
}
return _localctx;
}
Here is part from EO.g4
abstraction
:
attributes
(
(suffix (SPACE SLASH (NAME | QUESTION))?)
| htail
)?
;
github link of project: https://github.com/yasamprom/EOplugin_antlrbased/blob/main/src/main/java/org/antlr/jetbrains/eo/parser/EOParser.java

Throwing error from grammar to java in antlr3 [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
How can I throw the custom error message from the grammar file to the java class(where the parsing and lexing are defined)?
<----------Parser Grammar----------->
parser grammar EParser;
#members {
public void displayRecognitionError(String[] tokenNames, RecognitionException e) {
String hdr = getErrorHeader(e);
String msg = getErrorMessage(e, tokenNames);
System.out.println("hdr and msg...."+hdr+">>>>>>"+msg);
throw new RuntimeException(hdr + ":" + msg);
}
}
prog
: stat+
;
stat
: expr SEMI
| ID EQU expr SEMI
;
expr
: multExpr ((PRM) multExpr)*
;
multExpr
: atom (MUL atom)*
;
atom
:INT| OPEN expr CLS
;
<-------------------Java code--------------->
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
public class TestE {
public static void main(String[] args) throws Exception {
ELexer lexer = new ELexer(new ANTLRStringStream("a=9+8;"));
EParser parser = new EParser(new CommonTokenStream(lexer));
try
{
parser.prog();
System.out.println("Parsing successfully...");
}
catch (Exception e)
{
System.out.println("Other exception : " + e.toString());
}
}
}
<------------------Lexer grammar-------------->
lexer grammar ELexer;
tokens
{
ID;
INT;
WS;
EQU;
PRM;
OPEN;
CLS;
SEMI;
MUL;
}
#members {
Stack<String> paraphrase = new Stack<String>();
}
ID :('a'..'z'|'A'..'Z')+ ;
INT : '0'..'9'+ ;
EQU:'=';
PRM:'+'|'-';
OPEN:'(';
SEMI:';';
CLS :')';
MUL:'*';
WS : (' '|'\t'|'\n'|'\r')+ {skip();} ;
Here my input is a=9+8.
When I miss 8 it must give the error message as "Expecting an integer", and when I miss ; it must say "Missing semicolon".
Like this I have to produce the error message (I don't want the default error message that was produced by the antlr, I need my own error msgs).
How can I achieve this? Whether I have to write the error messages in the grammar file? Or the java code?
You don't need to throw custom errors in your grammar. Instead you install your custom error handler and handle exceptions in there. I have written a fairly complete error handling (however, for the ANTLR3 C target). It might give you some hints what you can use to construct your own error messages.
For Java target, you might want to override one or more of these methods:
org.antlr.runtime.BaseRecognizer.getErrorHeader()
creates error header (where in the input the error occurred)
org.antlr.runtime.BaseRecognizer.getErrorMessage()
creates the error message itself (what happened)
org.antlr.runtime.BaseRecognizer.emitErrorMessage()
displays the error (by default on error output / console)
org.antlr.runtime.BaseRecognizer.displayRecognitionError()
glues it all together:
public void displayRecognitionError(String[] tokenNames,
RecognitionException e)
{
String hdr = getErrorHeader(e);
String msg = getErrorMessage(e, tokenNames);
emitErrorMessage(hdr+" "+msg);
}
You can override them in #members section of the grammar as you already did with displayRecognitionError(), or if it's longer code it's more convenient to subclass the org.antlr.runtime.Parser and put superClass = MyParser; in the grammar's options section (note that to do it this way for lexer errors as well, you'll have to create a subclass of org.antlr.runtime.Lexer as well for the lexer to use).

Consume only commented (/** ..... */ )section of java file thorugh ANTLR 4 and skip the rest

I'm new to ANTLR and getting familiar with ANTLR 4. How to consume only the commented section (/** ... */) from a java file(or any file) and skip the rest.
I do have the following file "t.txt" :-
t.txt
/**
#Key1("value1")
#Key2("value2")
*/
This is the text that we need to skip. Only wanted to read the above commented section.
//END_OF_FILE
AND My grammar file as below:-
MyGrammar.g4
grammar MyGrammar;
file : (pair | LINE_COMMENT)* ;
pair : ID VALUE ;
ID : '#' ('A'..'Z') (~('('|'\r'|'\n') | '\\)')* ;
VALUE : '(' (~('\r'|'\n'))*;
COMMENT : '/**' .*? '*/';
WS : [\t\r\n]+ -> skip;
LINE_COMMENT
: '#' ~('\r'|'\n')* ('\r'|'\n'|EOF)
;
I know the COMMENT rule will read the commented section but here i'm stuck that how should skip the rest of the file content and force the antlr to read ID and value from COMMENT content only.
You can use lexical modes for this. Simply switch to another mode when the lexer stumbles upon "/**" and ignore everything else.
Note that lexical modes cannot be used in a combined grammar. You will have to define a separate lexer- and parser-grammar.
A small demo:
AnnotationLexer.g4
lexer grammar AnnotationLexer;
ANNOTATION_START
: '/**' -> mode(INSIDE), skip
;
IGNORE
: . -> skip
;
mode INSIDE;
ID
: '#' [A-Z] (~[(\r\n] | '\\)')*
;
VALUE
: '(' ~[\r\n]*
;
ANNOTATION_END
: '*/' -> mode(DEFAULT_MODE), skip
;
IGNORE_INSIDE
: [ \t\r\n] -> skip
;
file: AnnotationParser.g4
parser grammar AnnotationParser;
options {
tokenVocab=AnnotationLexer;
}
parse
: pair* EOF
;
pair
: ID VALUE {System.out.println("ID=" + $ID.text + ", VALUE=" + $VALUE.text);}
;
And now simply use the lexer and parser:
String input = "/**\n" +
"\n" +
"#Key1(\"value1\")\n" +
"#Key2(\"value2\")\n" +
"\n" +
"*/\n" +
"\n" +
"This is the text that we need to skip. Only wanted to read the above commented section.\n" +
"\n" +
"//END_OF_FILE";
AnnotationLexer lexer = new AnnotationLexer(new ANTLRInputStream(input));
AnnotationParser parser = new AnnotationParser(new CommonTokenStream(lexer));
parser.parse();
which will produce the following output:
ID=#Key1, VALUE=("value1")
ID=#Key2, VALUE=("value2")

JavaCC: How to handle tokens that contain common words

I'm trying to create a parser for source code like this:
[code table 1.0]
code table code_table_name
id = 500
desc = "my code table one"
end code table
... and here below is the grammar I defined:
PARSER_BEGIN(CodeTableParser)
...
PARSER_END(CodeTableParser)
/* skip spaces */
SKIP: {
" "
| "\t"
| "\r"
| "\n"
}
/* reserved words */
TOKEN [IGNORE_CASE]: {
<CODE_TAB_HEADER: "[code table 1.0]">
| <CODE_TAB_END: "end" (" ")+ <CODE_TAB_BEGIN>>
| <CODE_TAB_BEGIN: <IDENT> | "code" (" ")+ "table">
| <ID: "id">
| <DESC: "desc">
}
/* token images */
TOKEN: {
<NUMBER: (<DIGIT>)+>
| <IDENT: (<ALPHA>)+>
| <VALUE: (<ALPHA> ["[", "]"])+>
| <STRING: <QUOTED>>
}
TOKEN: {
<#ALPHA: ["A"-"Z", "a"-"z", "0"-"9", "$", "_", "."]>
| <#DIGIT: ["0"-"9"]>
| <#QUOTED: "\"" (~["\""])* "\"">
}
void parse():
{
}
{
expression() <EOF>
}
void expression():
{
Token tCodeTab;
}
{
<CODE_TAB_HEADER>
<CODE_TAB_BEGIN>
tCodeTab = <IDENT>
(
<ID>
<DESC>
)*
<CODE_TAB_END>
}
The problem is that the parser correctly identifies token ("code table")... but it doesn't identifies token IDENT ("code_table_name") since it contains the words already contained in token CODE_TAB_BEGIN (i.e. "code"). The parser complains saying that "code is followed by invalid character _"...
Having said that, I'm wondering what I'm missing in order to let the parser work correctly. I'm a newbie and any help would be really appreciated ;-)
Thanks,
j3d
Your lexer will never produce an IDENT because the production
<CODE_TAB_BEGIN: <IDENT> | "code" (" ")+ "table">
says that every IDENT can be a CODE_TAB_BEGIN and, as this production comes first, it beats the production for IDENT by the first match rule. (RTFFAQ)
Replace that production by
<CODE_TAB_BEGIN: "code" (" ")+ "table">
You will run into trouble with ID and DESC, but this gets you past the second line of input.

How can I access blocks of text as an attribute that are matched using a greedy=false option in ANTLR?

I have a rule in my ANTLR grammar like this:
COMMENT : '/*' (options {greedy=false;} : . )* '*/' ;
This rule simply matches c-style comments, so it will accept any pair of /* and */ with any arbitrary text lying in between, and it works fine.
What I want to do now is capture all the text between the /* and the */ when the rule matches, to make it accessible to an action. Something like this:
COMMENT : '/*' e=((options {greedy=false;} : . )*) '*/' {System.out.println("got: " + $e.text);
This approach doesn't work, during parsing it gives "no viable alternative" upon reaching the first character after the "/*"
I'm not really clear on if/how this can be done - any suggestions or guidance welcome, thanks.
Note that you can simply do:
getText().substring(2, getText().length()-2)
on the COMMENT token since the first and the last 2 characters will always be /* and */.
You could also remove the options {greedy=false;} : since both .* and .+ are ungreedy (although without the . they are greedy) (i).
EDIT
Or use setText(...) on the Comment token to discard the /* and */ immediately. A little demo:
file T.g:
grammar T;
#parser::members {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream(
"/* abc */ \n" +
" \n" +
"/* \n" +
" DEF \n" +
"*/ "
);
TLexer lexer = new TLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TParser parser = new TParser(tokens);
parser.parse();
}
}
parse
: ( Comment {System.out.printf("parsed :: >\%s<\%n", $Comment.getText());} )+ EOF
;
Comment
: '/*' .* '*/' {setText(getText().substring(2, getText().length()-2));}
;
Space
: (' ' | '\t' | '\r' | '\n') {skip();}
;
Then generate a parser & lexer, compile all .java files and run the parser containing the main method:
java -cp antlr-3.2.jar org.antlr.Tool T.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar TParser
(or `java -cp .;antlr-3.2.jar TParser` on Windows)
which will produce the following output:
parsed :: > abc <
parsed :: >
DEF
<
(i) The Definitive ANTLR Reference, Chapter 4, Extended BNF Subrules, page 86.
Try this:
COMMENT :
'/*' {StringBuilder comment = new StringBuilder();} ( options {greedy=false;} : c=. {comment.appendCodePoint(c);} )* '*/' {System.out.println(comment.toString());};
Another way which will actually return the StringBuilder object so you can use it in your program:
COMMENT returns [StringBuilder comment]:
'/*' {comment = new StringBuilder();} ( options {greedy=false;} : c=. {comment.append((char)c);} )* '*/';

Categories