I'm trying to implement a parser for the example file listed below. I'd like to recognize quoted strings with '+' between them as a single token. So I created a jj file, but it doesn't match such strings. I was under the impression that JavaCC is supposed to match the longest possible match for each token spec. But that doesn't seem to be case for me.
What am I doing wrong here? Why isn't my <STRING> token matching the '+' even though it's specified in there? Why is whitespace not being ignored?
options {
TOKEN_FACTORY = "Token";
}
PARSER_BEGIN(Parser)
package com.example.parser;
public class Parser {
public static void main(String args[]) throws ParseException {
ParserTokenManager manager = new ParserTokenManager(new SimpleCharStream(Parser.class.getResourceAsStream("example")));
Token token = manager.getNextToken();
while (token != null && token.kind != ParserConstants.EOF) {
System.out.println(token.toString() + "[" + token.kind + "]");
token = manager.getNextToken();
}
Parser parser = new Parser(Parser.class.getResourceAsStream("example"));
parser.start();
}
}
PARSER_END(Parser)
// WHITE SPACE
<DEFAULT, IN_STRING_KEYWORD>
SKIP :
{
" " // <-- skipping spaces
| "\t"
| "\n"
| "\r"
| "\f"
}
// TOKENS
TOKEN :
{
< KEYWORD1 : "keyword1" > : IN_STRING_KEYWORD
}
<IN_STRING_KEYWORD>
TOKEN : {<STRING : <CONCAT_STRING> | <UNQUOTED_STRING> > : DEFAULT
| <#CONCAT_STRING : <QUOTED_STRING> ("+" <QUOTED_STRING>)+ >
// <-- CONCAT_STRING never matches "+" part when input is "'smth' +", because whitespace is not ignored!?
| <#QUOTED_STRING : <SINGLEQUOTED_STRING> | <DOUBLEQUOTED_STRING> >
| <#SINGLEQUOTED_STRING : "'" (~["'"])* "'" >
| <#DOUBLEQUOTED_STRING :
"\""
(
(~["\"", "\\"]) |
("\\" ["n", "t", "\"", "\\"])
)*
"\""
>
| <#UNQUOTED_STRING : (~[" ","\t", ";", "{", "}", "/", "*", "'", "\"", "\n", "\r"] | "/" ~["/", "*"] | "*" ~["/"])+ >
}
void start() :
{}
{
(<KEYWORD1><STRING>";")+ <EOF>
}
Here's an example file that should get parsed:
keyword1 "foo" + ' bar';
I'd like to match the argument of the first keyword1 as a single <STRING> token.
Current output:
keyword1[6]
Exception in thread "main" com.example.parser.TokenMgrError: Lexical error at line 1, column 15. Encountered: " " (32), after : "\"foo\""
at com.example.parser.ParserTokenManager.getNextToken(ParserTokenManager.java:616)
at com.example.parser.Parser.main(Parser.java:12)
I'm using JavaCC 5.0.
STRING is expanding to the longest sequence that can be matched, which is "foo" as the error indicates. The space after the closing double quote is not part of the definition of the private token CONCAT_STRING. Skip tokens do not apply within the definition of other tokens, so you must incorporate the space directly into the definition, on either side of the +.
As an aside, I recommend have a final token definition like so:
<each-state-in-which-the-empty-string-cannot-be-recognized>
TOKEN : {
< ILLEGAL : ~[] >
}
This prevents TokenMgrErrors from being thrown and makes debugging a bit easier.
Related
Antlr-3 generating an error on encountering the Pound char ("£") of the French language, which is equivalent char of Hash "#" char of English, even the Unicode value for three special characters #, #, and $ are specified in lexer/parser rule.
FYI: The Unicode value of Pound char (of the French language) = The Unicode value of Hash char (of ENGLISH language).
The lexer/parser rules:
grammar SimpleCalc;
options
{
k = 8;
language = Java;
//filter = true;
}
tokens {
PLUS = '+' ;
MINUS = '-' ;
MULT = '*' ;
DIV = '/' ;
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
expr : n1=NUMBER ( exp = ( PLUS | MINUS ) n2=NUMBER )*
{
if ($exp.text.equals("+"))
System.out.println("Plus Result = " + $n1.text + $n2.text);
else
System.out.println("Minus Result = " + $n1.text + $n2.text);
}
;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
NUMBER : (DIGIT)+ ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
fragment DIGIT : '0'..'9' | '£' | ('\u0040' | '\u0023' | '\u0024');
The text file also reading in UTF-8 as:
public static void main(String[] args) throws Exception
{
try
{
args = new String[1];
args[0] = new String("antlr_test.txt");
SimpleCalcLexer lex = new SimpleCalcLexer(new ANTLRFileStream(args[0], "UTF-8"));
CommonTokenStream tokens = new CommonTokenStream(lex);
SimpleCalcParser parser = new SimpleCalcParser(tokens);
parser.expr();
//System.out.println(tokens);
}
catch (Exception e)
{
e.printStackTrace();
}
}
The input file is having only 1 line:
£3 + 4£
the error is:
antlr_test.txt line 1:1 no viable alternative at character '£'
antlr_test.txt line 1:7 no viable alternative at character '£'
What is wrong with my approach?
or did I miss something?
I cannot reproduce what you describe. When I test your grammar without modifications, I get a NumberFormatException, which is expected, because Integer.parseInt("£3") cannot succeed.
When I change your embedded code into this:
{
if ($exp.text.equals("+"))
System.out.println("Result = " + (Integer.parseInt($n1.text.replaceAll("\\D", "")) + Integer.parseInt($n2.text.replaceAll("\\D", ""))));
else
System.out.println("Result = " + (Integer.parseInt($n1.text.replaceAll("\\D", "")) - Integer.parseInt($n2.text.replaceAll("\\D", ""))));
}
and regenerate lexer and parser classes (something you might not have done) and rerun the driver code, I get the following output:
Result = 7
EDIT
Perhaps the pound sign in the grammar is the issue? What if you try:
fragment DIGIT : '0'..'9' | '\u00A3' | ('\u0040' | '\u0023' | '\u0024');
instead of:
fragment DIGIT : '0'..'9' | '£' | ('\u0040' | '\u0023' | '\u0024');
?
I am writing a parser with JavaCC. This my current progress:
PARSER_BEGIN(Compiler)
public class Compiler {
public static void main(String[] args) {
try {
(new Compiler(new java.io.BufferedReader(new java.io.FileReader(args[0])))).S();
System.out.println("Syntax is correct");
} catch (Throwable e) {
e.printStackTrace();
}
}
}
PARSER_END(Compiler)
<DEFAULT, INBODY> SKIP: { " " | "\t" | "\r" }
<DEFAULT> TOKEN: { "(" | ")" | <ID: (["a"-"z","A"-"Z","0"-"9","-","_"])+ > | "\n" : INBODY }
<DEFAULT> TOKEN: { <#RAND: (" " | "\t" | "\r")* > | <END: <RAND> "\n" <RAND> ("\n" <RAND>)+ > }
<INBODY> TOKEN: { <STRING: (~["\n", "\r"])*> : DEFAULT }
void S(): {}
{
(Signature() "\n" Body() (["\n"] <EOF> | <END> [<EOF>]) )+
}
void Signature(): {}
{
"(" <ID> <ID> ")"
}
void Body(): {}
{
<STRING> ("\n" <STRING> )*
}
My goal is to parse a language looking like this:
(test1 pic1)
This line is a <STRING> token
After the last <STRING> one empty line is necessary
(test2 pic1)
String1
It is also allowed to have an arbitrary number (>=1) of empty lines
(test3 pic1)
String1
String2
(test4 pic1)
String1
String2
An arbitrary number (also zero) of empty lines follow till <EOF>
It almost works fine, but the problem I am now facing is the following:
At the end of the parsed text it is (like stated in the example above) allowed to have an arbitrary number (including zero) of empty lines till <EOF>. If I have no empty line before <EOF> it works as expected (it prints "Syntax is correct"). If I have at least two empty lines before <EOF> it also works as expected (it prints "Syntax is correct"). If there is exact only one empty line before <EOF> it should also print "Syntax is correct". But I get following exception stack trace instead:
ParseException: Encountered "<EOF>" at line 19, column 9.
Was expecting:
<STRING> ...
at Compiler.generateParseException(Compiler.java:284)
at Compiler.jj_consume_token(Compiler.java:217)
at Compiler.Body(Compiler.java:83)
at Compiler.S(Compiler.java:18)
at Compiler.main(Compiler.java:6)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at com.simontuffs.onejar.Boot.run(Boot.java:340)
at com.simontuffs.onejar.Boot.main(Boot.java:166)
Does someone have an idea what the problem might be?
UPDATE:
Changing the line
(Signature() "\n" Body() (["\n"] <EOF> | <END> [<EOF>]) )+
to
(Signature() "\n" Body() (<EOF> | <END> [<EOF>]) )+
produces the same behavior. It seems that ["\n"] is (for some reason) completely ignored.
I found the core of the issue. Changing the line
<STRING> ("\n" <STRING> )*
to
<STRING> (LOOKAHEAD(2) "\n" <STRING> )*
solved the problem.
It just needed a local LOOKAHEAD(2).
I am very new to JavaCC and JTB. I am trying to get a a moderately complicated parser/validator going using a JTB file to generate the AST for me. I'm having a lot of problems making that work, so I decided to do the simplest example possible.
I am using Eclipse 3.8.1. My JavaCC and JTB plugins are:
plugins/sf.eclipse.javacc_1.5.30/jars/javacc-5.0.jar
plugins/sf.eclipse.javacc_1.5.30/jars/jtb-1.4.9.jar
I believe these to be fairly recent versions of what's available and so they should work OK.
I created a project and inside that project, I created a new JTB file. The plugin generated a bunch of code.
/**
* JTB template file created by SF JavaCC plugin 1.5.28+ wizard for JTB 1.4.0.2+ and JavaCC 1.5.0+
*/
options
{
static = true;
JTB_P = "";
}
PARSER_BEGIN(SimpleGrammar)
// this import is not needed as it is generated by JTB
// import syntaxtree.*;
// this import is needed as it is not generated by JTB
import visitor.*;
public class SimpleGrammar
{
public static void main(String args [])
{
System.out.println("Reading from standard input...");
System.out.print("Enter an expression like \"1+(2+3)*var;\" :");
new SimpleGrammar(System.in);
try
{
Start start = SimpleGrammar.Start();
DepthFirstVoidVisitor v = new MyVisitor();
start.accept(v);
}
catch (Exception e)
{
System.out.println("Oops.");
System.out.println(e);
System.out.println(e.getMessage());
}
}
}
class MyVisitor extends DepthFirstVoidVisitor
{
public void visit(NodeToken n)
{
System.out.println("visit " + n.tokenImage);
}
}
PARSER_END(SimpleGrammar)
SKIP :
{
" "
| "\t"
| "\n"
| "\r"
| < "//" (~[ "\n", "\r" ])*
(
"\n"
| "\r"
| "\r\n"
) >
| < "/*" (~[ "*" ])* "*"
(
~[ "/" ] (~[ "*" ])* "*"
)*
"/" >
}
TOKEN : /* LITERALS */
{
< INTEGER_LITERAL :
< DECIMAL_LITERAL > ([ "l", "L" ])?
| < HEX_LITERAL > ([ "l", "L" ])?
| < OCTAL_LITERAL > ([ "l", "L" ])?
>
| < #DECIMAL_LITERAL : [ "1"-"9" ] ([ "0"-"9" ])* >
| < #HEX_LITERAL : "0" [ "x", "X" ] ([ "0"-"9", "a"-"f", "A"-"F" ])+ >
| < #OCTAL_LITERAL : "0" ([ "0"-"7" ])* >
}
TOKEN : /* IDENTIFIERS */
{
< IDENTIFIER :
< LETTER >
(
< LETTER >
| < DIGIT >
)* >
| < #LETTER : [ "_", "a"-"z", "A"-"Z" ] >
| < #DIGIT : [ "0"-"9" ] >
}
void Start() :
{}
{
Expression() ";"
}
void Expression() :
{}
{
AdditiveExpression()
}
void AdditiveExpression() :
{}
{
MultiplicativeExpression()
(
(
"+"
| "-"
)
MultiplicativeExpression()
)*
}
void MultiplicativeExpression() :
{}
{
UnaryExpression()
(
(
"*"
| "/"
| "%"
)
UnaryExpression()
)*
}
void UnaryExpression() :
{}
{
"(" Expression() ")"
| Identifier()
| MyInteger()
}
void Identifier() :
{}
{
< IDENTIFIER >
}
void MyInteger() :
{}
{
< INTEGER_LITERAL >
}
It actually had a bunch of ?parser_name? tags wherever you see SimpleGrammar which I noticed and changed.
So I right-click on the SimpleGrammar.jtb file and hit "Compile with JavaCC | JJTree | JTB" and it generates all the files just like it's supposed to. So I then press F11 to run the program. Here is the output of that console session:
Reading from standard input...
Enter an expression like "1+(2+3)*var;" :1+2
Oops.
java.lang.NullPointerException
null
That's funny, it probably shouldn't be doing that. So I change the exception catching to ParseException so that it won't catch the NullPointerException and I'll get a debug console. After I do that and hunt around a bit, I find this snippet of code:
static final public MyInteger MyInteger() throws ParseException {
// --- JTB generated node declarations ---
NodeToken n0 = null;
Token n1 = null;
jj_consume_token(INTEGER_LITERAL);
n0 = JTBToolkit.makeNodeToken(n1);
{if (true) return new MyInteger(n0);}
throw new Error("Missing return statement in function");
}
So what's happening here is that n1 is getting referenced before assigned to. I don't know why this is happening.
I've also looked in the generated jtb.out.jj file to see what's there. Here's the snippet corresponding to the code that I'm seeing:
MyInteger MyInteger() :
{
// --- JTB generated node declarations ---
NodeToken n0 = null;
Token n1 = null;
}
{
< INTEGER_LITERAL >
{ n0 = JTBToolkit.makeNodeToken(n1); }
{ return new MyInteger(n0); }
}
Is that wrong? I'm honestly not sure. I was under the impression that you did everything in the JTB file and all the files that were generated after that were not to be touched.
Any insight as to what's going wrong here would be greatly appreciated.
EDIT
So I've done a bit of poking around and I've found that if I change
<INTEGER_LITERAL>
to
n1 = <INTEGER_LITERAL>
in the above-mentioned snipped then there won't be a null pointer exception anymore I can successfully enter "1;" at the console and it'll parse.
I think what I'm trying to figure out is why JTB is generating a bogus jtb.out.jj rather than what can I do to stop the null pointer exceptions. I've got a decent sized, non-toy grammar I want to work with and manually editing the jtb.out.jj file every time is not a scalable solution.
I had the same problem, it's a bug in the generation. If you use version JTB 1.4.7, it will probably work.
You can find the available versions here: https://java.net/projects/jtb/sources/svn/show/trunk/lib?rev=75 (the download section doesn't really work for me)
I have the following TT.jj, if I uncomment the SomethingElse part below, it successfully parses a language of the form create create blahblah or create blahblah. But if I comment out the SomethingElse part below, but retain the LOOKAHEAD, javacc complains that the lookahead is not necessary and "ignored", but the resulting parser only accepts an empty string.
I thought javacc said it's "ignored" so it should not take any effect ? basically a superfluous LOOKAHEAD causes error. How does that work exactly? maybe javacc's implementation of LOOKAHEAD is not exactly up to the spec ?
options{
IGNORE_CASE=true ;
STATIC=false;
DEBUG_PARSER=true;
DEBUG_LOOKAHEAD=false;
DEBUG_TOKEN_MANAGER=false;
// FORCE_LA_CHECK=true;
UNICODE_INPUT=true;
}
PARSER_BEGIN(TT)
import java.util.*;
/**
* The parser generated by JavaCC
*/
public class TT {
}
PARSER_END(TT)
///////////////////////////////////////////// main stuff concerned
void Statement() :
{ }
{
LOOKAHEAD(2)
CreateTable()
//|
//SomethingElse()
}
void CreateTable():
{
}
{
<K_CREATE> <K_CREATE> <S_IDENTIFIER>
}
//void SomethingElse():
//{}{
// <K_CREATE> <S_IDENTIFIER>
//}
//
//////////////////////////////////////////////////////////
SKIP:
{
" "
| "\t"
| "\r"
| "\n"
}
TOKEN: /* SQL Keywords. prefixed with K_ to avoid name clashes */
{
<K_CREATE: "CREATE">
}
TOKEN : /* Numeric Constants */
{
< S_DOUBLE: ((<S_LONG>)? "." <S_LONG> ( ["e","E"] (["+", "-"])? <S_LONG>)?
|
<S_LONG> "." (["e","E"] (["+", "-"])? <S_LONG>)?
|
<S_LONG> ["e","E"] (["+", "-"])? <S_LONG>
)>
| < S_LONG: ( <DIGIT> )+ >
| < #DIGIT: ["0" - "9"] >
}
TOKEN:
{
< S_IDENTIFIER: ( <LETTER> | <ADDITIONAL_LETTERS> )+ ( <DIGIT> | <LETTER> | <ADDITIONAL_LETTERS> | <SPECIAL_CHARS>)* >
| < #LETTER: ["a"-"z", "A"-"Z", "_", "$"] >
| < #SPECIAL_CHARS: "$" | "_" | "#" | "#">
| < S_CHAR_LITERAL: "'" (~["'"])* "'" ("'" (~["'"])* "'")*>
| < S_QUOTED_IDENTIFIER: "\"" (~["\n","\r","\""])+ "\"" | ("`" (~["\n","\r","`"])+ "`") | ( "[" ~["0"-"9","]"] (~["\n","\r","]"])* "]" ) >
/*
To deal with database names (columns, tables) using not only latin base characters, one
can expand the following rule to accept additional letters. Here is the addition of german umlauts.
There seems to be no way to recognize letters by an external function to allow
a configurable addition. One must rebuild JSqlParser with this new "Letterset".
*/
| < #ADDITIONAL_LETTERS: ["ä","ö","ü","Ä","Ö","Ü","ß"] >
}
The lookahead specification that JavaCC says it is ignoring is not ignored. Moral: Don't put lookahead specifications at nonchoice points.
In more detail. When a lookahead (other than a purely semantic lookahead) appears at a nonchoice point, it appears to generate a lookahead method that always returns false, therefor lookahead fails and, there being no other choice, an exception is thrown.
here is the generated code from bad .jj
final public void Statement() throws ParseException {
trace_call("Statement");
try {
if (jj_2_1(5)) {
} else {
jj_consume_token(-1);
throw new ParseException();
}
CreateTable();
} finally {
trace_return("Statement");
}
}
here is the good one:
final public void Statement() throws ParseException {
trace_call("Statement");
try {
if (jj_2_1(3)) {
CreateTable();
} else {
switch ((jj_ntk==-1)?jj_ntk():jj_ntk) {
case K_CREATE:
SomethingElse();
break;
default:
jj_la1[0] = jj_gen;
jj_consume_token(-1);
throw new ParseException();
}
}
} finally {
trace_return("Statement");
}
}
i.e. the superfluous LOOKAHEAD is not ignored at all, javacc mechanically tries to list all the options (which is none in the bad case) in the if-else struct and led to a grammar that looks directly for EOF
Is it possible to get the "active" ANTLR rule from which a action method was called?
Something like this log-function in Antlr-Pseudo-Code which should show the start and end position of some rules without hand over the $start- and $end-tokens with every log()-call:
#members{
private void log() {
System.out.println("Start: " + $activeRule.start.pos +
"End: " + $activeRule.stop.pos);
}
}
expr: multExpr (('+'|'-') multExpr)* {log(); }
;
multExpr
: atom('*' atom)* {log(); }
;
atom: INT
| ID {log(); }
| '(' expr ')'
;
No, there is no way to get the name of the rule the parser is currently in. Realize that parser rules are, by default, simply Java methods returning a void. From a Java method, you cannot find out the name of it at run-time after all (when inside of this method).
If you set output=AST in the options { ... } of your grammar, every parser rule creates (and returns) an instance of a ParserRuleReturnScope called retval: so you could use that for your purposes:
// ...
options {
output=AST;
}
// ...
#parser::members{
private void log(ParserRuleReturnScope rule) {
System.out.println("Rule: " + rule.getClass().getName() +
", start: " + rule.start +
", end: " + rule.stop);
}
}
expr: multExpr (('+'|'-') multExpr)* {log(retval);}
;
multExpr
: atom('*' atom)* {log(retval);}
;
atom: INT
| ID {log(retval);}
| '(' expr ')'
;
// ...
This is however not a very reliable thing to do: the name of the variable may very well change in the next version of ANTLR.
(for Antlr4)
I was googling on how to get the name of the active rule and found this post. After some more research, I have found how to do it :
prog: statement[this.getRuleNames() /* parser rule names */]* EOF
;
statement [String[] rule_names]
locals [String rule_name]
#after { System.out.println("The statement is a " + $rule_name + " : `" + $text + "`"); }
: stmt_a[rule_names] {$rule_name = $stmt_a.rule_name;}
;
stmt_a [String[] rule_names] returns [String rule_name]
: 'stmt_a' { $rule_name = rule_names[$ctx.getRuleIndex()]; }
;
A more general solution passes the context on to the surrounding rule, from which you can extract all informations about the last active rule.
File RuleName.g4 :
grammar RuleName;
prog
#init {System.out.println("Last update 1026");}
: statement[this.getRuleNames() /* parser rule names */]* EOF
;
statement [String[] rule_names]
locals [String rule_name, ParserRuleContext context]
#after { $rule_name = rule_names[$context.getRuleIndex()];
System.out.println("The statement is a " + $rule_name + " : `" + $text + "`" + " from " + $start + " to " + $stop); }
: stmt_a {$context = (ParserRuleContext)$stmt_a.context;}
| stmt_b {$context = (ParserRuleContext)$stmt_b.context;}
| stmt_c {$context = (ParserRuleContext)$stmt_c.context;}
;
stmt_a returns [Stmt_aContext context]
: 'stmt_a' more { $context = $ctx; }
;
stmt_b returns [Stmt_bContext context]
: 'stmt_b' more { $context = $ctx; }
;
stmt_c returns [Stmt_cContext context]
: 'stmt_c' more { $context = $ctx; }
;
more
: ID+
;
ID : [A-Z] ;
WS : [ \t]+ -> channel(HIDDEN) ;
NL : [\r\n]+ -> skip ;
File input.txt :
stmt_c X Y Z
stmt_a A B C
stmt_b D E F
Execution :
$ export CLASSPATH=".:/usr/local/lib/antlr-4.9-complete.jar"
$ alias a4='java -jar /usr/local/lib/antlr-4.9-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 -no-listener RuleName.g4
$ javac RuleName*.java
$ grun RuleName prog -tokens input.txt
[#0,0:5='stmt_c',<'stmt_c'>,1:0]
[#1,6:6=' ',<WS>,channel=1,1:6]
[#2,7:7='X',<ID>,1:7]
[#3,8:8=' ',<WS>,channel=1,1:8]
[#4,9:9='Y',<ID>,1:9]
[#5,10:10=' ',<WS>,channel=1,1:10]
[#6,11:11='Z',<ID>,1:11]
...
[#21,39:38='<EOF>',<EOF>,4:0]
Last update 1026
The statement is a stmt_c : `stmt_c X Y Z` from [#0,0:5='stmt_c',<3>,1:0] to [#6,11:11='Z',<4>,1:11]
The statement is a stmt_a : `stmt_a A B C` from [#7,13:18='stmt_a',<1>,2:0] to [#13,24:24='C',<4>,2:11]
The statement is a stmt_b : `stmt_b D E F` from [#14,26:31='stmt_b',<2>,3:0] to [#20,37:37='F',<4>,3:11]