Issue parsing a custom grammar using JavaCC

Issue parsing a custom grammar using JavaCC - java

I am writing a parser with JavaCC. This my current progress:
PARSER_BEGIN(Compiler)
public class Compiler {
public static void main(String[] args) {
try {
(new Compiler(new java.io.BufferedReader(new java.io.FileReader(args[0])))).S();
System.out.println("Syntax is correct");
} catch (Throwable e) {
e.printStackTrace();
}
}
}
PARSER_END(Compiler)
<DEFAULT, INBODY> SKIP: { " " | "\t" | "\r" }
<DEFAULT> TOKEN: { "(" | ")" | <ID: (["a"-"z","A"-"Z","0"-"9","-","_"])+ > | "\n" : INBODY }
<DEFAULT> TOKEN: { <#RAND: (" " | "\t" | "\r")* > | <END: <RAND> "\n" <RAND> ("\n" <RAND>)+ > }
<INBODY> TOKEN: { <STRING: (~["\n", "\r"])*> : DEFAULT }
void S(): {}
{
(Signature() "\n" Body() (["\n"] <EOF> | <END> [<EOF>]) )+
}
void Signature(): {}
{
"(" <ID> <ID> ")"
}
void Body(): {}
{
<STRING> ("\n" <STRING> )*
}
My goal is to parse a language looking like this:
(test1 pic1)
This line is a <STRING> token
After the last <STRING> one empty line is necessary
(test2 pic1)
String1
It is also allowed to have an arbitrary number (>=1) of empty lines
(test3 pic1)
String1
String2
(test4 pic1)
String1
String2
An arbitrary number (also zero) of empty lines follow till <EOF>
It almost works fine, but the problem I am now facing is the following:
At the end of the parsed text it is (like stated in the example above) allowed to have an arbitrary number (including zero) of empty lines till <EOF>. If I have no empty line before <EOF> it works as expected (it prints "Syntax is correct"). If I have at least two empty lines before <EOF> it also works as expected (it prints "Syntax is correct"). If there is exact only one empty line before <EOF> it should also print "Syntax is correct". But I get following exception stack trace instead:
ParseException: Encountered "<EOF>" at line 19, column 9.
Was expecting:
<STRING> ...
at Compiler.generateParseException(Compiler.java:284)
at Compiler.jj_consume_token(Compiler.java:217)
at Compiler.Body(Compiler.java:83)
at Compiler.S(Compiler.java:18)
at Compiler.main(Compiler.java:6)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at com.simontuffs.onejar.Boot.run(Boot.java:340)
at com.simontuffs.onejar.Boot.main(Boot.java:166)
Does someone have an idea what the problem might be?
UPDATE:
Changing the line
(Signature() "\n" Body() (["\n"] <EOF> | <END> [<EOF>]) )+
to
(Signature() "\n" Body() (<EOF> | <END> [<EOF>]) )+
produces the same behavior. It seems that ["\n"] is (for some reason) completely ignored.

I found the core of the issue. Changing the line
<STRING> ("\n" <STRING> )*
to
<STRING> (LOOKAHEAD(2) "\n" <STRING> )*
solved the problem.
It just needed a local LOOKAHEAD(2).

Related

Antlr3 grammar generates parsering error on encountering the Pound char

Antlr-3 generating an error on encountering the Pound char ("£") of the French language, which is equivalent char of Hash "#" char of English, even the Unicode value for three special characters #, #, and $ are specified in lexer/parser rule.
FYI: The Unicode value of Pound char (of the French language) = The Unicode value of Hash char (of ENGLISH language).
The lexer/parser rules:
grammar SimpleCalc;
options
{
k = 8;
language = Java;
//filter = true;
}
tokens {
PLUS = '+' ;
MINUS = '-' ;
MULT = '*' ;
DIV = '/' ;
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
expr : n1=NUMBER ( exp = ( PLUS | MINUS ) n2=NUMBER )*
{
if ($exp.text.equals("+"))
System.out.println("Plus Result = " + $n1.text + $n2.text);
else
System.out.println("Minus Result = " + $n1.text + $n2.text);
}
;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
NUMBER : (DIGIT)+ ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
fragment DIGIT : '0'..'9' | '£' | ('\u0040' | '\u0023' | '\u0024');
The text file also reading in UTF-8 as:
public static void main(String[] args) throws Exception
{
try
{
args = new String[1];
args[0] = new String("antlr_test.txt");
SimpleCalcLexer lex = new SimpleCalcLexer(new ANTLRFileStream(args[0], "UTF-8"));
CommonTokenStream tokens = new CommonTokenStream(lex);
SimpleCalcParser parser = new SimpleCalcParser(tokens);
parser.expr();
//System.out.println(tokens);
}
catch (Exception e)
{
e.printStackTrace();
}
}
The input file is having only 1 line:
£3 + 4£
the error is:
antlr_test.txt line 1:1 no viable alternative at character '£'
antlr_test.txt line 1:7 no viable alternative at character '£'
What is wrong with my approach?
or did I miss something?

I cannot reproduce what you describe. When I test your grammar without modifications, I get a NumberFormatException, which is expected, because Integer.parseInt("£3") cannot succeed.
When I change your embedded code into this:
{
if ($exp.text.equals("+"))
System.out.println("Result = " + (Integer.parseInt($n1.text.replaceAll("\\D", "")) + Integer.parseInt($n2.text.replaceAll("\\D", ""))));
else
System.out.println("Result = " + (Integer.parseInt($n1.text.replaceAll("\\D", "")) - Integer.parseInt($n2.text.replaceAll("\\D", ""))));
}
and regenerate lexer and parser classes (something you might not have done) and rerun the driver code, I get the following output:
Result = 7
EDIT
Perhaps the pound sign in the grammar is the issue? What if you try:
fragment DIGIT : '0'..'9' | '\u00A3' | ('\u0040' | '\u0023' | '\u0024');
instead of:
fragment DIGIT : '0'..'9' | '£' | ('\u0040' | '\u0023' | '\u0024');
?

Why do Symbol's fields (left and right) always return 0?

I'm working on a pascal compiler using CUP and JFLEX. One of the requirements is to recover from errors and show where the errors are.
I've been using CUP's method of syntax_error and unrecovered_syntax_error
This is the parser code in my parser.cup
parser code
{:
public static Nodo padre;
public int cont = 0;
public void syntax_error(Symbol s) {
System.out.println("Error sintáctico. No se esperaba el siguiente token: <" + s.value + ">. Línea: " + (s.left + 1) + ", Columna: " + (s.right + 1));
}
public void unrecovered_syntax_error(Symbol s) throws java.lang.Exception {
System.out.println("Error sintáctico cerca del token: <" + s.value + ">. Línea: " + (s.left + 1) + ", Columna: " + (s.right + 1));
}
:}
This is part of my CFG, which I'm only focusing where I'm declaring the error production
programa ::=
inicioPrograma:inicProg declaraciones_const:declConst declaraciones_tipo:declTipo declaraciones_var:declVar declaraciones_subprogramas:declSubp proposicion_compuesta:propComp DOT
;
proposicion_compuesta ::=
BEGIN proposiciones_optativas END;
proposiciones_optativas ::=
lista_proposiciones
| /* lambda */;
lista_proposiciones ::=
proposicion SEMICOLON
| proposicion SEMICOLON lista_proposiciones;
proposicion ::=
variable ASSIGNMENT expresion
{:
System.out.println("hola");
:}
| proposicion_procedimiento
| proposicion_compuesta
| IF expresion THEN proposicion
| IF expresion THEN proposicion ELSE proposicion
| WHILE expresion DO proposicion
| FOR ID ASSIGNMENT expresion TO expresion DO proposicion
| FOR ID ASSIGNMENT expresion DOWNTO expresion DO proposicion
| REPEAT proposicion UNTIL expresion
| READ LPAR ID RPAR
| WRITE LPAR CONST_STR RPAR
| WRITE LPAR CONST_STR COMMA ID RPAR
| error;
And this is my Main.java
import java.io.*;
public class Main {
static public void main(String argv[]) {
/* Start the parser */
try {
parser p = new parser(new LexicalElements(new FileReader(argv[0])));
p.parse();
} catch (Exception e) {
e.printStackTrace();
}
}
}
I believe that s.right and s.left should change its value when a new error comes in, but when I've two errors like:
program ejemplo;
begin
a 1;
b := 2;
while x+2
a:= 1;
end.
It should return
Error sintáctico. No se esperaba el siguiente token: <1>. Línea: [numberX], Columna: [numberY]
Error sintáctico. No se esperaba el siguiente token: <a>. Línea: [numberW], Columna: [numberZ]
Where numberX and numberY may equal to each other, numberW and numberZ may equal to each, but a pair can't equal to another pair.
Yet, it returns me
Error sintáctico. No se esperaba el siguiente token: <1>. Línea: 1, Columna: 1
Error sintáctico. No se esperaba el siguiente token: <a>. Línea: 1, Columna: 1
I would gladly appreciate if someone helps me to understand as to why this is occurring and/or how to solve it.

I can't really answer your question: Why do Symbol's fields (left and right) always return 0? Since you didn't show us your scanner specification code; Symbols are initialized in the scanner.
However, if what you want is to have the left and right value of each symbol equal to its line and column respectively you have to set that information from the scanner. The parser doesn't keep track of the symbol's location in the source code.
I assume you are using jflex for your scanner generator, if so; here is an example of how you can return Symbols with line and column using the %line and %column options:
import java_cup.runtime.*;
%%
%cup
//switches line counting on (the current line is accessed via yyline)
%line
//switches column counting on (the current column is accessed via yycolumn)
%column
IntegerRegex= [0-9]+
%%
<YYINITIAL>
"keyword" {
return new Symbol(sym.KEYWORD, yyline, yycolumn);
}
{IntegerRegex} {
return new Symbol(sym.INTEGER_LITERAL, yyline, yycolumn, Integer.valueOf(yytext()));
}
</YYINITIAL>

superfluous LOOKAHEAD in javacc causes error?

I have the following TT.jj, if I uncomment the SomethingElse part below, it successfully parses a language of the form create create blahblah or create blahblah. But if I comment out the SomethingElse part below, but retain the LOOKAHEAD, javacc complains that the lookahead is not necessary and "ignored", but the resulting parser only accepts an empty string.
I thought javacc said it's "ignored" so it should not take any effect ? basically a superfluous LOOKAHEAD causes error. How does that work exactly? maybe javacc's implementation of LOOKAHEAD is not exactly up to the spec ?
options{
IGNORE_CASE=true ;
STATIC=false;
DEBUG_PARSER=true;
DEBUG_LOOKAHEAD=false;
DEBUG_TOKEN_MANAGER=false;
// FORCE_LA_CHECK=true;
UNICODE_INPUT=true;
}
PARSER_BEGIN(TT)
import java.util.*;
/**
* The parser generated by JavaCC
*/
public class TT {
}
PARSER_END(TT)
///////////////////////////////////////////// main stuff concerned
void Statement() :
{ }
{
LOOKAHEAD(2)
CreateTable()
//|
//SomethingElse()
}
void CreateTable():
{
}
{
<K_CREATE> <K_CREATE> <S_IDENTIFIER>
}
//void SomethingElse():
//{}{
// <K_CREATE> <S_IDENTIFIER>
//}
//
//////////////////////////////////////////////////////////
SKIP:
{
" "
| "\t"
| "\r"
| "\n"
}
TOKEN: /* SQL Keywords. prefixed with K_ to avoid name clashes */
{
<K_CREATE: "CREATE">
}
TOKEN : /* Numeric Constants */
{
< S_DOUBLE: ((<S_LONG>)? "." <S_LONG> ( ["e","E"] (["+", "-"])? <S_LONG>)?
|
<S_LONG> "." (["e","E"] (["+", "-"])? <S_LONG>)?
|
<S_LONG> ["e","E"] (["+", "-"])? <S_LONG>
)>
| < S_LONG: ( <DIGIT> )+ >
| < #DIGIT: ["0" - "9"] >
}
TOKEN:
{
< S_IDENTIFIER: ( <LETTER> | <ADDITIONAL_LETTERS> )+ ( <DIGIT> | <LETTER> | <ADDITIONAL_LETTERS> | <SPECIAL_CHARS>)* >
| < #LETTER: ["a"-"z", "A"-"Z", "_", "$"] >
| < #SPECIAL_CHARS: "$" | "_" | "#" | "#">
| < S_CHAR_LITERAL: "'" (~["'"])* "'" ("'" (~["'"])* "'")*>
| < S_QUOTED_IDENTIFIER: "\"" (~["\n","\r","\""])+ "\"" | ("`" (~["\n","\r","`"])+ "`") | ( "[" ~["0"-"9","]"] (~["\n","\r","]"])* "]" ) >
/*
To deal with database names (columns, tables) using not only latin base characters, one
can expand the following rule to accept additional letters. Here is the addition of german umlauts.
There seems to be no way to recognize letters by an external function to allow
a configurable addition. One must rebuild JSqlParser with this new "Letterset".
*/
| < #ADDITIONAL_LETTERS: ["ä","ö","ü","Ä","Ö","Ü","ß"] >
}

The lookahead specification that JavaCC says it is ignoring is not ignored. Moral: Don't put lookahead specifications at nonchoice points.
In more detail. When a lookahead (other than a purely semantic lookahead) appears at a nonchoice point, it appears to generate a lookahead method that always returns false, therefor lookahead fails and, there being no other choice, an exception is thrown.

here is the generated code from bad .jj
final public void Statement() throws ParseException {
trace_call("Statement");
try {
if (jj_2_1(5)) {
} else {
jj_consume_token(-1);
throw new ParseException();
}
CreateTable();
} finally {
trace_return("Statement");
}
}
here is the good one:
final public void Statement() throws ParseException {
trace_call("Statement");
try {
if (jj_2_1(3)) {
CreateTable();
} else {
switch ((jj_ntk==-1)?jj_ntk():jj_ntk) {
case K_CREATE:
SomethingElse();
break;
default:
jj_la1[0] = jj_gen;
jj_consume_token(-1);
throw new ParseException();
}
}
} finally {
trace_return("Statement");
}
}
i.e. the superfluous LOOKAHEAD is not ignored at all, javacc mechanically tries to list all the options (which is none in the bad case) in the if-else struct and led to a grammar that looks directly for EOF

Why am I getting syntax checking failed everytime I parse an assignment statement using javacc tool?

I have made a AssignStatement class and I am trying to pass the string using javacc
The assignment statement is of the form : a=b+c*d
Here is the source code
PARSER_BEGIN(AssignStatement)
public class AssignStatement
{
public static void main(String s[])
{
try
{
AssignStatement as=new AssignStatement(System.in);
as.StartSymbol();
System.out.println("Syntax checking successfully");
}
catch(Throwable e)
{
System.out.println("Syntex checking failed"+e.getMessage());
}
}
}
PARSER_END(AssignStatement)
SKIP: {"" | "\t" | "\n" | "\r" }
TOKEN:{ "(" | ")" | "+" | "*" | ":="| <NUM: (["0"-"9"])+> | <ID:(["0"-"9"])+>
}
void StartSymbol(): {}
{
(AStmt())*<EOF>
}
void AStmt(): {}
{
LOOKAHEAD(2) <ID> "=" AStmt()
| Term() ("+" Term())*
}
void Term(): {}
{
Factor() ("*" Factor())*
}
void Factor(): {}
{
<NUM>
| <ID>
| "(" AStmt() ")"
}
The output I got after I did java AssignStatement
"a=10+20*30" or a=10+20*30
Syntax checking failed error: Bailing out of infinite loop caused by repeated empty string matches at line 1, column 1.
From my point of you there can be 2 possibilities
I am taking input from user wrong [in that case please suggest and also how to take input from a file
My grammar is wrong. Please suggest
Please guide if anyone can?

The problem is on the line
SKIP: {"" | "\t" | "\n" | "\r" }
This says that you want to skip any 0 length strings. The problem is that having found such a token, the lexer removes 0 characters from the input, and then, of course, it finds the same 0 length token and so on ad infinitum.
Perhaps you meant
SKIP: {" " | "\t" | "\n" | "\r" }
Now, on input "a=10+20*30", no regular expression will match and you will get a TokenManagerError.
Matching the empty string has its (rare) uses. This is not one of them.
A second problem is with the rule
TOKEN:{ ... <NUM: (["0"-"9"])+> | <ID:(["0"-"9"])+> }
Since the definition of ID is the same as the definition of NUM, it will never succeed. Perhaps you want something like
TOKEN:{ ... <NUM: (["0"-"9"])+> | <ID:(["a"-"z"])+> }
If you do that you won't get the TokenManagerError on the input "a=10+20*30".

JavaCC lexer doesn't work as expected (whitespace not ignored)

I'm trying to implement a parser for the example file listed below. I'd like to recognize quoted strings with '+' between them as a single token. So I created a jj file, but it doesn't match such strings. I was under the impression that JavaCC is supposed to match the longest possible match for each token spec. But that doesn't seem to be case for me.
What am I doing wrong here? Why isn't my <STRING> token matching the '+' even though it's specified in there? Why is whitespace not being ignored?
options {
TOKEN_FACTORY = "Token";
}
PARSER_BEGIN(Parser)
package com.example.parser;
public class Parser {
public static void main(String args[]) throws ParseException {
ParserTokenManager manager = new ParserTokenManager(new SimpleCharStream(Parser.class.getResourceAsStream("example")));
Token token = manager.getNextToken();
while (token != null && token.kind != ParserConstants.EOF) {
System.out.println(token.toString() + "[" + token.kind + "]");
token = manager.getNextToken();
}
Parser parser = new Parser(Parser.class.getResourceAsStream("example"));
parser.start();
}
}
PARSER_END(Parser)
// WHITE SPACE
<DEFAULT, IN_STRING_KEYWORD>
SKIP :
{
" " // <-- skipping spaces
| "\t"
| "\n"
| "\r"
| "\f"
}
// TOKENS
TOKEN :
{
< KEYWORD1 : "keyword1" > : IN_STRING_KEYWORD
}
<IN_STRING_KEYWORD>
TOKEN : {<STRING : <CONCAT_STRING> | <UNQUOTED_STRING> > : DEFAULT
| <#CONCAT_STRING : <QUOTED_STRING> ("+" <QUOTED_STRING>)+ >
// <-- CONCAT_STRING never matches "+" part when input is "'smth' +", because whitespace is not ignored!?
| <#QUOTED_STRING : <SINGLEQUOTED_STRING> | <DOUBLEQUOTED_STRING> >
| <#SINGLEQUOTED_STRING : "'" (~["'"])* "'" >
| <#DOUBLEQUOTED_STRING :
"\""
(
(~["\"", "\\"]) |
("\\" ["n", "t", "\"", "\\"])
)*
"\""
>
| <#UNQUOTED_STRING : (~[" ","\t", ";", "{", "}", "/", "*", "'", "\"", "\n", "\r"] | "/" ~["/", "*"] | "*" ~["/"])+ >
}
void start() :
{}
{
(<KEYWORD1><STRING>";")+ <EOF>
}
Here's an example file that should get parsed:
keyword1 "foo" + ' bar';
I'd like to match the argument of the first keyword1 as a single <STRING> token.
Current output:
keyword1[6]
Exception in thread "main" com.example.parser.TokenMgrError: Lexical error at line 1, column 15. Encountered: " " (32), after : "\"foo\""
at com.example.parser.ParserTokenManager.getNextToken(ParserTokenManager.java:616)
at com.example.parser.Parser.main(Parser.java:12)
I'm using JavaCC 5.0.

STRING is expanding to the longest sequence that can be matched, which is "foo" as the error indicates. The space after the closing double quote is not part of the definition of the private token CONCAT_STRING. Skip tokens do not apply within the definition of other tokens, so you must incorporate the space directly into the definition, on either side of the +.
As an aside, I recommend have a final token definition like so:
<each-state-in-which-the-empty-string-cannot-be-recognized>
TOKEN : {
< ILLEGAL : ~[] >
}
This prevents TokenMgrErrors from being thrown and makes debugging a bit easier.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Issue parsing a custom grammar using JavaCC - java

I found the core of the issue. Changing the line <STRING> ("\n" <STRING> )* to <STRING> (LOOKAHEAD(2) "\n" <STRING> )* solved the problem. It just needed a local LOOKAHEAD(2).

Related

Antlr3 grammar generates parsering error on encountering the Pound char

Why do Symbol's fields (left and right) always return 0?

superfluous LOOKAHEAD in javacc causes error?

Why am I getting syntax checking failed everytime I parse an assignment statement using javacc tool?

JavaCC lexer doesn't work as expected (whitespace not ignored)

Categories

Resources