superfluous LOOKAHEAD in javacc causes error? - java

I have the following TT.jj, if I uncomment the SomethingElse part below, it successfully parses a language of the form create create blahblah or create blahblah. But if I comment out the SomethingElse part below, but retain the LOOKAHEAD, javacc complains that the lookahead is not necessary and "ignored", but the resulting parser only accepts an empty string.
I thought javacc said it's "ignored" so it should not take any effect ? basically a superfluous LOOKAHEAD causes error. How does that work exactly? maybe javacc's implementation of LOOKAHEAD is not exactly up to the spec ?
options{
IGNORE_CASE=true ;
STATIC=false;
DEBUG_PARSER=true;
DEBUG_LOOKAHEAD=false;
DEBUG_TOKEN_MANAGER=false;
// FORCE_LA_CHECK=true;
UNICODE_INPUT=true;
}
PARSER_BEGIN(TT)
import java.util.*;
/**
* The parser generated by JavaCC
*/
public class TT {
}
PARSER_END(TT)
///////////////////////////////////////////// main stuff concerned
void Statement() :
{ }
{
LOOKAHEAD(2)
CreateTable()
//|
//SomethingElse()
}
void CreateTable():
{
}
{
<K_CREATE> <K_CREATE> <S_IDENTIFIER>
}
//void SomethingElse():
//{}{
// <K_CREATE> <S_IDENTIFIER>
//}
//
//////////////////////////////////////////////////////////
SKIP:
{
" "
| "\t"
| "\r"
| "\n"
}
TOKEN: /* SQL Keywords. prefixed with K_ to avoid name clashes */
{
<K_CREATE: "CREATE">
}
TOKEN : /* Numeric Constants */
{
< S_DOUBLE: ((<S_LONG>)? "." <S_LONG> ( ["e","E"] (["+", "-"])? <S_LONG>)?
|
<S_LONG> "." (["e","E"] (["+", "-"])? <S_LONG>)?
|
<S_LONG> ["e","E"] (["+", "-"])? <S_LONG>
)>
| < S_LONG: ( <DIGIT> )+ >
| < #DIGIT: ["0" - "9"] >
}
TOKEN:
{
< S_IDENTIFIER: ( <LETTER> | <ADDITIONAL_LETTERS> )+ ( <DIGIT> | <LETTER> | <ADDITIONAL_LETTERS> | <SPECIAL_CHARS>)* >
| < #LETTER: ["a"-"z", "A"-"Z", "_", "$"] >
| < #SPECIAL_CHARS: "$" | "_" | "#" | "#">
| < S_CHAR_LITERAL: "'" (~["'"])* "'" ("'" (~["'"])* "'")*>
| < S_QUOTED_IDENTIFIER: "\"" (~["\n","\r","\""])+ "\"" | ("`" (~["\n","\r","`"])+ "`") | ( "[" ~["0"-"9","]"] (~["\n","\r","]"])* "]" ) >
/*
To deal with database names (columns, tables) using not only latin base characters, one
can expand the following rule to accept additional letters. Here is the addition of german umlauts.
There seems to be no way to recognize letters by an external function to allow
a configurable addition. One must rebuild JSqlParser with this new "Letterset".
*/
| < #ADDITIONAL_LETTERS: ["ä","ö","ü","Ä","Ö","Ü","ß"] >
}

The lookahead specification that JavaCC says it is ignoring is not ignored. Moral: Don't put lookahead specifications at nonchoice points.
In more detail. When a lookahead (other than a purely semantic lookahead) appears at a nonchoice point, it appears to generate a lookahead method that always returns false, therefor lookahead fails and, there being no other choice, an exception is thrown.

here is the generated code from bad .jj
final public void Statement() throws ParseException {
trace_call("Statement");
try {
if (jj_2_1(5)) {
} else {
jj_consume_token(-1);
throw new ParseException();
}
CreateTable();
} finally {
trace_return("Statement");
}
}
here is the good one:
final public void Statement() throws ParseException {
trace_call("Statement");
try {
if (jj_2_1(3)) {
CreateTable();
} else {
switch ((jj_ntk==-1)?jj_ntk():jj_ntk) {
case K_CREATE:
SomethingElse();
break;
default:
jj_la1[0] = jj_gen;
jj_consume_token(-1);
throw new ParseException();
}
}
} finally {
trace_return("Statement");
}
}
i.e. the superfluous LOOKAHEAD is not ignored at all, javacc mechanically tries to list all the options (which is none in the bad case) in the if-else struct and led to a grammar that looks directly for EOF

Related

Antlr3 grammar generates parsering error on encountering the Pound char

Antlr-3 generating an error on encountering the Pound char ("£") of the French language, which is equivalent char of Hash "#" char of English, even the Unicode value for three special characters #, #, and $ are specified in lexer/parser rule.
FYI: The Unicode value of Pound char (of the French language) = The Unicode value of Hash char (of ENGLISH language).
The lexer/parser rules:
grammar SimpleCalc;
options
{
k = 8;
language = Java;
//filter = true;
}
tokens {
PLUS = '+' ;
MINUS = '-' ;
MULT = '*' ;
DIV = '/' ;
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
expr : n1=NUMBER ( exp = ( PLUS | MINUS ) n2=NUMBER )*
{
if ($exp.text.equals("+"))
System.out.println("Plus Result = " + $n1.text + $n2.text);
else
System.out.println("Minus Result = " + $n1.text + $n2.text);
}
;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
NUMBER : (DIGIT)+ ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
fragment DIGIT : '0'..'9' | '£' | ('\u0040' | '\u0023' | '\u0024');
The text file also reading in UTF-8 as:
public static void main(String[] args) throws Exception
{
try
{
args = new String[1];
args[0] = new String("antlr_test.txt");
SimpleCalcLexer lex = new SimpleCalcLexer(new ANTLRFileStream(args[0], "UTF-8"));
CommonTokenStream tokens = new CommonTokenStream(lex);
SimpleCalcParser parser = new SimpleCalcParser(tokens);
parser.expr();
//System.out.println(tokens);
}
catch (Exception e)
{
e.printStackTrace();
}
}
The input file is having only 1 line:
£3 + 4£
the error is:
antlr_test.txt line 1:1 no viable alternative at character '£'
antlr_test.txt line 1:7 no viable alternative at character '£'
What is wrong with my approach?
or did I miss something?
I cannot reproduce what you describe. When I test your grammar without modifications, I get a NumberFormatException, which is expected, because Integer.parseInt("£3") cannot succeed.
When I change your embedded code into this:
{
if ($exp.text.equals("+"))
System.out.println("Result = " + (Integer.parseInt($n1.text.replaceAll("\\D", "")) + Integer.parseInt($n2.text.replaceAll("\\D", ""))));
else
System.out.println("Result = " + (Integer.parseInt($n1.text.replaceAll("\\D", "")) - Integer.parseInt($n2.text.replaceAll("\\D", ""))));
}
and regenerate lexer and parser classes (something you might not have done) and rerun the driver code, I get the following output:
Result = 7
EDIT
Perhaps the pound sign in the grammar is the issue? What if you try:
fragment DIGIT : '0'..'9' | '\u00A3' | ('\u0040' | '\u0023' | '\u0024');
instead of:
fragment DIGIT : '0'..'9' | '£' | ('\u0040' | '\u0023' | '\u0024');
?

Antlr parser rule fails to match either of specified lexer rules

I have a small work-in-progress Antlr grammar that looks like:
filterExpression returns [ActivityPredicate pred]
: NAME OPERATOR (PACE | NUMBER) {
if ($PACE != null) {
$pred = new SingleActivityPredicate($NAME.text, Operator.fromCharacter($OPERATOR.text), $PACE.text);
} else {
$pred = new SingleActivityPredicate($NAME.text, Operator.fromCharacter($OPERATOR.text), $NUMBER.text);
}
};
OPERATOR: ('>' | '<' | '=') ;
NAME: ('A'..'Z' | 'a'..'z')+ ;
NUMBER: ('0'..'9')+ ('.' ('0'..'9')+)? ;
PACE: ('0'..'9')('0'..'9')? ':' ('0'..'5')('0'..'9');
WS: (' ' | '\t' | '\r'| '\n')+ -> skip;
Hoping to parse things like:
distance = 4 or pace < 8:30
However, both of those inputs result in null for both the PACE and NUMBER, while trying to parse either:
However, dropping the option and just picking PACE works fine (it also works fine the other way, opting for NUMBER):
filterExpression returns [ActivityPredicate pred]
: NAME OPERATOR PACE { ... };
Why is it that when I provide the option, they're both null?
Try this.
filterExpression returns [ActivityPredicate pred]
: n=NAME o=OPERATOR (p=PACE | i=NUMBER) {
if ($PACE != null) {
$pred = new SingleActivityPredicate(
$n.text, Operator.fromCharacter($o.text), $p.text);
} else {
$pred = new SingleActivityPredicate(
$n.text, Operator.fromCharacter($o.text), $i.text);
}
};

JTB generating jj file that compiles but throws NullPointerExceptions on grammatically valid input

I am very new to JavaCC and JTB. I am trying to get a a moderately complicated parser/validator going using a JTB file to generate the AST for me. I'm having a lot of problems making that work, so I decided to do the simplest example possible.
I am using Eclipse 3.8.1. My JavaCC and JTB plugins are:
plugins/sf.eclipse.javacc_1.5.30/jars/javacc-5.0.jar
plugins/sf.eclipse.javacc_1.5.30/jars/jtb-1.4.9.jar
I believe these to be fairly recent versions of what's available and so they should work OK.
I created a project and inside that project, I created a new JTB file. The plugin generated a bunch of code.
/**
* JTB template file created by SF JavaCC plugin 1.5.28+ wizard for JTB 1.4.0.2+ and JavaCC 1.5.0+
*/
options
{
static = true;
JTB_P = "";
}
PARSER_BEGIN(SimpleGrammar)
// this import is not needed as it is generated by JTB
// import syntaxtree.*;
// this import is needed as it is not generated by JTB
import visitor.*;
public class SimpleGrammar
{
public static void main(String args [])
{
System.out.println("Reading from standard input...");
System.out.print("Enter an expression like \"1+(2+3)*var;\" :");
new SimpleGrammar(System.in);
try
{
Start start = SimpleGrammar.Start();
DepthFirstVoidVisitor v = new MyVisitor();
start.accept(v);
}
catch (Exception e)
{
System.out.println("Oops.");
System.out.println(e);
System.out.println(e.getMessage());
}
}
}
class MyVisitor extends DepthFirstVoidVisitor
{
public void visit(NodeToken n)
{
System.out.println("visit " + n.tokenImage);
}
}
PARSER_END(SimpleGrammar)
SKIP :
{
" "
| "\t"
| "\n"
| "\r"
| < "//" (~[ "\n", "\r" ])*
(
"\n"
| "\r"
| "\r\n"
) >
| < "/*" (~[ "*" ])* "*"
(
~[ "/" ] (~[ "*" ])* "*"
)*
"/" >
}
TOKEN : /* LITERALS */
{
< INTEGER_LITERAL :
< DECIMAL_LITERAL > ([ "l", "L" ])?
| < HEX_LITERAL > ([ "l", "L" ])?
| < OCTAL_LITERAL > ([ "l", "L" ])?
>
| < #DECIMAL_LITERAL : [ "1"-"9" ] ([ "0"-"9" ])* >
| < #HEX_LITERAL : "0" [ "x", "X" ] ([ "0"-"9", "a"-"f", "A"-"F" ])+ >
| < #OCTAL_LITERAL : "0" ([ "0"-"7" ])* >
}
TOKEN : /* IDENTIFIERS */
{
< IDENTIFIER :
< LETTER >
(
< LETTER >
| < DIGIT >
)* >
| < #LETTER : [ "_", "a"-"z", "A"-"Z" ] >
| < #DIGIT : [ "0"-"9" ] >
}
void Start() :
{}
{
Expression() ";"
}
void Expression() :
{}
{
AdditiveExpression()
}
void AdditiveExpression() :
{}
{
MultiplicativeExpression()
(
(
"+"
| "-"
)
MultiplicativeExpression()
)*
}
void MultiplicativeExpression() :
{}
{
UnaryExpression()
(
(
"*"
| "/"
| "%"
)
UnaryExpression()
)*
}
void UnaryExpression() :
{}
{
"(" Expression() ")"
| Identifier()
| MyInteger()
}
void Identifier() :
{}
{
< IDENTIFIER >
}
void MyInteger() :
{}
{
< INTEGER_LITERAL >
}
It actually had a bunch of ?parser_name? tags wherever you see SimpleGrammar which I noticed and changed.
So I right-click on the SimpleGrammar.jtb file and hit "Compile with JavaCC | JJTree | JTB" and it generates all the files just like it's supposed to. So I then press F11 to run the program. Here is the output of that console session:
Reading from standard input...
Enter an expression like "1+(2+3)*var;" :1+2
Oops.
java.lang.NullPointerException
null
That's funny, it probably shouldn't be doing that. So I change the exception catching to ParseException so that it won't catch the NullPointerException and I'll get a debug console. After I do that and hunt around a bit, I find this snippet of code:
static final public MyInteger MyInteger() throws ParseException {
// --- JTB generated node declarations ---
NodeToken n0 = null;
Token n1 = null;
jj_consume_token(INTEGER_LITERAL);
n0 = JTBToolkit.makeNodeToken(n1);
{if (true) return new MyInteger(n0);}
throw new Error("Missing return statement in function");
}
So what's happening here is that n1 is getting referenced before assigned to. I don't know why this is happening.
I've also looked in the generated jtb.out.jj file to see what's there. Here's the snippet corresponding to the code that I'm seeing:
MyInteger MyInteger() :
{
// --- JTB generated node declarations ---
NodeToken n0 = null;
Token n1 = null;
}
{
< INTEGER_LITERAL >
{ n0 = JTBToolkit.makeNodeToken(n1); }
{ return new MyInteger(n0); }
}
Is that wrong? I'm honestly not sure. I was under the impression that you did everything in the JTB file and all the files that were generated after that were not to be touched.
Any insight as to what's going wrong here would be greatly appreciated.
EDIT
So I've done a bit of poking around and I've found that if I change
<INTEGER_LITERAL>
to
n1 = <INTEGER_LITERAL>
in the above-mentioned snipped then there won't be a null pointer exception anymore I can successfully enter "1;" at the console and it'll parse.
I think what I'm trying to figure out is why JTB is generating a bogus jtb.out.jj rather than what can I do to stop the null pointer exceptions. I've got a decent sized, non-toy grammar I want to work with and manually editing the jtb.out.jj file every time is not a scalable solution.
I had the same problem, it's a bug in the generation. If you use version JTB 1.4.7, it will probably work.
You can find the available versions here: https://java.net/projects/jtb/sources/svn/show/trunk/lib?rev=75 (the download section doesn't really work for me)

Why am I getting syntax checking failed everytime I parse an assignment statement using javacc tool?

I have made a AssignStatement class and I am trying to pass the string using javacc
The assignment statement is of the form : a=b+c*d
Here is the source code
PARSER_BEGIN(AssignStatement)
public class AssignStatement
{
public static void main(String s[])
{
try
{
AssignStatement as=new AssignStatement(System.in);
as.StartSymbol();
System.out.println("Syntax checking successfully");
}
catch(Throwable e)
{
System.out.println("Syntex checking failed"+e.getMessage());
}
}
}
PARSER_END(AssignStatement)
SKIP: {"" | "\t" | "\n" | "\r" }
TOKEN:{ "(" | ")" | "+" | "*" | ":="| <NUM: (["0"-"9"])+> | <ID:(["0"-"9"])+>
}
void StartSymbol(): {}
{
(AStmt())*<EOF>
}
void AStmt(): {}
{
LOOKAHEAD(2) <ID> "=" AStmt()
| Term() ("+" Term())*
}
void Term(): {}
{
Factor() ("*" Factor())*
}
void Factor(): {}
{
<NUM>
| <ID>
| "(" AStmt() ")"
}
The output I got after I did java AssignStatement
"a=10+20*30" or a=10+20*30
Syntax checking failed error: Bailing out of infinite loop caused by repeated empty string matches at line 1, column 1.
From my point of you there can be 2 possibilities
I am taking input from user wrong [in that case please suggest and also how to take input from a file
My grammar is wrong. Please suggest
Please guide if anyone can?
The problem is on the line
SKIP: {"" | "\t" | "\n" | "\r" }
This says that you want to skip any 0 length strings. The problem is that having found such a token, the lexer removes 0 characters from the input, and then, of course, it finds the same 0 length token and so on ad infinitum.
Perhaps you meant
SKIP: {" " | "\t" | "\n" | "\r" }
Now, on input "a=10+20*30", no regular expression will match and you will get a TokenManagerError.
Matching the empty string has its (rare) uses. This is not one of them.
A second problem is with the rule
TOKEN:{ ... <NUM: (["0"-"9"])+> | <ID:(["0"-"9"])+> }
Since the definition of ID is the same as the definition of NUM, it will never succeed. Perhaps you want something like
TOKEN:{ ... <NUM: (["0"-"9"])+> | <ID:(["a"-"z"])+> }
If you do that you won't get the TokenManagerError on the input "a=10+20*30".

JavaCC lexer doesn't work as expected (whitespace not ignored)

I'm trying to implement a parser for the example file listed below. I'd like to recognize quoted strings with '+' between them as a single token. So I created a jj file, but it doesn't match such strings. I was under the impression that JavaCC is supposed to match the longest possible match for each token spec. But that doesn't seem to be case for me.
What am I doing wrong here? Why isn't my <STRING> token matching the '+' even though it's specified in there? Why is whitespace not being ignored?
options {
TOKEN_FACTORY = "Token";
}
PARSER_BEGIN(Parser)
package com.example.parser;
public class Parser {
public static void main(String args[]) throws ParseException {
ParserTokenManager manager = new ParserTokenManager(new SimpleCharStream(Parser.class.getResourceAsStream("example")));
Token token = manager.getNextToken();
while (token != null && token.kind != ParserConstants.EOF) {
System.out.println(token.toString() + "[" + token.kind + "]");
token = manager.getNextToken();
}
Parser parser = new Parser(Parser.class.getResourceAsStream("example"));
parser.start();
}
}
PARSER_END(Parser)
// WHITE SPACE
<DEFAULT, IN_STRING_KEYWORD>
SKIP :
{
" " // <-- skipping spaces
| "\t"
| "\n"
| "\r"
| "\f"
}
// TOKENS
TOKEN :
{
< KEYWORD1 : "keyword1" > : IN_STRING_KEYWORD
}
<IN_STRING_KEYWORD>
TOKEN : {<STRING : <CONCAT_STRING> | <UNQUOTED_STRING> > : DEFAULT
| <#CONCAT_STRING : <QUOTED_STRING> ("+" <QUOTED_STRING>)+ >
// <-- CONCAT_STRING never matches "+" part when input is "'smth' +", because whitespace is not ignored!?
| <#QUOTED_STRING : <SINGLEQUOTED_STRING> | <DOUBLEQUOTED_STRING> >
| <#SINGLEQUOTED_STRING : "'" (~["'"])* "'" >
| <#DOUBLEQUOTED_STRING :
"\""
(
(~["\"", "\\"]) |
("\\" ["n", "t", "\"", "\\"])
)*
"\""
>
| <#UNQUOTED_STRING : (~[" ","\t", ";", "{", "}", "/", "*", "'", "\"", "\n", "\r"] | "/" ~["/", "*"] | "*" ~["/"])+ >
}
void start() :
{}
{
(<KEYWORD1><STRING>";")+ <EOF>
}
Here's an example file that should get parsed:
keyword1 "foo" + ' bar';
I'd like to match the argument of the first keyword1 as a single <STRING> token.
Current output:
keyword1[6]
Exception in thread "main" com.example.parser.TokenMgrError: Lexical error at line 1, column 15. Encountered: " " (32), after : "\"foo\""
at com.example.parser.ParserTokenManager.getNextToken(ParserTokenManager.java:616)
at com.example.parser.Parser.main(Parser.java:12)
I'm using JavaCC 5.0.
STRING is expanding to the longest sequence that can be matched, which is "foo" as the error indicates. The space after the closing double quote is not part of the definition of the private token CONCAT_STRING. Skip tokens do not apply within the definition of other tokens, so you must incorporate the space directly into the definition, on either side of the +.
As an aside, I recommend have a final token definition like so:
<each-state-in-which-the-empty-string-cannot-be-recognized>
TOKEN : {
< ILLEGAL : ~[] >
}
This prevents TokenMgrErrors from being thrown and makes debugging a bit easier.

Categories