Antlr-3 generating an error on encountering the Pound char ("£") of the French language, which is equivalent char of Hash "#" char of English, even the Unicode value for three special characters #, #, and $ are specified in lexer/parser rule.
FYI: The Unicode value of Pound char (of the French language) = The Unicode value of Hash char (of ENGLISH language).
The lexer/parser rules:
grammar SimpleCalc;
options
{
k = 8;
language = Java;
//filter = true;
}
tokens {
PLUS = '+' ;
MINUS = '-' ;
MULT = '*' ;
DIV = '/' ;
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
expr : n1=NUMBER ( exp = ( PLUS | MINUS ) n2=NUMBER )*
{
if ($exp.text.equals("+"))
System.out.println("Plus Result = " + $n1.text + $n2.text);
else
System.out.println("Minus Result = " + $n1.text + $n2.text);
}
;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
NUMBER : (DIGIT)+ ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
fragment DIGIT : '0'..'9' | '£' | ('\u0040' | '\u0023' | '\u0024');
The text file also reading in UTF-8 as:
public static void main(String[] args) throws Exception
{
try
{
args = new String[1];
args[0] = new String("antlr_test.txt");
SimpleCalcLexer lex = new SimpleCalcLexer(new ANTLRFileStream(args[0], "UTF-8"));
CommonTokenStream tokens = new CommonTokenStream(lex);
SimpleCalcParser parser = new SimpleCalcParser(tokens);
parser.expr();
//System.out.println(tokens);
}
catch (Exception e)
{
e.printStackTrace();
}
}
The input file is having only 1 line:
£3 + 4£
the error is:
antlr_test.txt line 1:1 no viable alternative at character '£'
antlr_test.txt line 1:7 no viable alternative at character '£'
What is wrong with my approach?
or did I miss something?
I cannot reproduce what you describe. When I test your grammar without modifications, I get a NumberFormatException, which is expected, because Integer.parseInt("£3") cannot succeed.
When I change your embedded code into this:
{
if ($exp.text.equals("+"))
System.out.println("Result = " + (Integer.parseInt($n1.text.replaceAll("\\D", "")) + Integer.parseInt($n2.text.replaceAll("\\D", ""))));
else
System.out.println("Result = " + (Integer.parseInt($n1.text.replaceAll("\\D", "")) - Integer.parseInt($n2.text.replaceAll("\\D", ""))));
}
and regenerate lexer and parser classes (something you might not have done) and rerun the driver code, I get the following output:
Result = 7
EDIT
Perhaps the pound sign in the grammar is the issue? What if you try:
fragment DIGIT : '0'..'9' | '\u00A3' | ('\u0040' | '\u0023' | '\u0024');
instead of:
fragment DIGIT : '0'..'9' | '£' | ('\u0040' | '\u0023' | '\u0024');
?
Related
I am building an expression evaluator using Java, JFlex(lexer gen) and Jacc(parser gen). I need to:
generate the lexer
generate the parser
generate the AST
display the AST graph
evaluate expression
I was able to create the lexer and the parser and the AST. Now I am trying to make the AST graph using the visitor pattern, but this made a problem with my generated AST evident (so to speak). In my calculator I need to handle parentheses and they create empty nodes in my AST (and that makes my parse tree not an AST I suppose). Here is the relevant part of my grammar:
Calc : /* empty */
| AddExpr { ast = new Calc($1); }
;
AddExpr : ModExpr
| AddExpr '+' ModExpr { $$ = new AddExpr($1, $3, "+"); }
| AddExpr '-' ModExpr { $$ = new AddExpr($1, $3, "-"); }
;
ModExpr : IntDivExpr
| ModExpr MOD IntDivExpr { $$ = new ModExpr($1, $3); }
;
IntDivExpr : MultExpr
| IntDivExpr DIV MultExpr { $$ = new IntDivExpr($1, $3); }
;
MultExpr : UnaryExpr
| MultExpr '*' UnaryExpr { $$ = new MultExpr($1, $3, "*"); }
| MultExpr '/' UnaryExpr { $$ = new MultExpr($1, $3, "/"); }
;
UnaryExpr : ExpExpr
| '-' UnaryExpr { $$ = new UnaryExpr($2, "-"); }
| '+' UnaryExpr { $$ = new UnaryExpr($2, "+"); }
;
ExpExpr : Value
| ExpExpr '^' Value { $$ = new ExpExpr($1, $3); }
;
Value : DoubleLiteral
| '(' AddExpr ')' { $$ = new Value($2); }
;
DoubleLiteral : DOUBLE { $$ = $1; }
;
Here is an example expression:
1*(2+3)/(4-5)*((((6))))
and the resulting image:
This leaves me with Value nodes for each pair of parentheses. I have a few ideas on how to handle this, but I am not sure how to proceed:
Try to handle this in my grammar (not sure how as I am not allowed to use precedence directives)
Handle this in my evaluator
If you don't want Value nodes, then just replace { $$ = new Value($2); } with { $$ = $2; }.
I have a small work-in-progress Antlr grammar that looks like:
filterExpression returns [ActivityPredicate pred]
: NAME OPERATOR (PACE | NUMBER) {
if ($PACE != null) {
$pred = new SingleActivityPredicate($NAME.text, Operator.fromCharacter($OPERATOR.text), $PACE.text);
} else {
$pred = new SingleActivityPredicate($NAME.text, Operator.fromCharacter($OPERATOR.text), $NUMBER.text);
}
};
OPERATOR: ('>' | '<' | '=') ;
NAME: ('A'..'Z' | 'a'..'z')+ ;
NUMBER: ('0'..'9')+ ('.' ('0'..'9')+)? ;
PACE: ('0'..'9')('0'..'9')? ':' ('0'..'5')('0'..'9');
WS: (' ' | '\t' | '\r'| '\n')+ -> skip;
Hoping to parse things like:
distance = 4 or pace < 8:30
However, both of those inputs result in null for both the PACE and NUMBER, while trying to parse either:
However, dropping the option and just picking PACE works fine (it also works fine the other way, opting for NUMBER):
filterExpression returns [ActivityPredicate pred]
: NAME OPERATOR PACE { ... };
Why is it that when I provide the option, they're both null?
Try this.
filterExpression returns [ActivityPredicate pred]
: n=NAME o=OPERATOR (p=PACE | i=NUMBER) {
if ($PACE != null) {
$pred = new SingleActivityPredicate(
$n.text, Operator.fromCharacter($o.text), $p.text);
} else {
$pred = new SingleActivityPredicate(
$n.text, Operator.fromCharacter($o.text), $i.text);
}
};
I have the following TT.jj, if I uncomment the SomethingElse part below, it successfully parses a language of the form create create blahblah or create blahblah. But if I comment out the SomethingElse part below, but retain the LOOKAHEAD, javacc complains that the lookahead is not necessary and "ignored", but the resulting parser only accepts an empty string.
I thought javacc said it's "ignored" so it should not take any effect ? basically a superfluous LOOKAHEAD causes error. How does that work exactly? maybe javacc's implementation of LOOKAHEAD is not exactly up to the spec ?
options{
IGNORE_CASE=true ;
STATIC=false;
DEBUG_PARSER=true;
DEBUG_LOOKAHEAD=false;
DEBUG_TOKEN_MANAGER=false;
// FORCE_LA_CHECK=true;
UNICODE_INPUT=true;
}
PARSER_BEGIN(TT)
import java.util.*;
/**
* The parser generated by JavaCC
*/
public class TT {
}
PARSER_END(TT)
///////////////////////////////////////////// main stuff concerned
void Statement() :
{ }
{
LOOKAHEAD(2)
CreateTable()
//|
//SomethingElse()
}
void CreateTable():
{
}
{
<K_CREATE> <K_CREATE> <S_IDENTIFIER>
}
//void SomethingElse():
//{}{
// <K_CREATE> <S_IDENTIFIER>
//}
//
//////////////////////////////////////////////////////////
SKIP:
{
" "
| "\t"
| "\r"
| "\n"
}
TOKEN: /* SQL Keywords. prefixed with K_ to avoid name clashes */
{
<K_CREATE: "CREATE">
}
TOKEN : /* Numeric Constants */
{
< S_DOUBLE: ((<S_LONG>)? "." <S_LONG> ( ["e","E"] (["+", "-"])? <S_LONG>)?
|
<S_LONG> "." (["e","E"] (["+", "-"])? <S_LONG>)?
|
<S_LONG> ["e","E"] (["+", "-"])? <S_LONG>
)>
| < S_LONG: ( <DIGIT> )+ >
| < #DIGIT: ["0" - "9"] >
}
TOKEN:
{
< S_IDENTIFIER: ( <LETTER> | <ADDITIONAL_LETTERS> )+ ( <DIGIT> | <LETTER> | <ADDITIONAL_LETTERS> | <SPECIAL_CHARS>)* >
| < #LETTER: ["a"-"z", "A"-"Z", "_", "$"] >
| < #SPECIAL_CHARS: "$" | "_" | "#" | "#">
| < S_CHAR_LITERAL: "'" (~["'"])* "'" ("'" (~["'"])* "'")*>
| < S_QUOTED_IDENTIFIER: "\"" (~["\n","\r","\""])+ "\"" | ("`" (~["\n","\r","`"])+ "`") | ( "[" ~["0"-"9","]"] (~["\n","\r","]"])* "]" ) >
/*
To deal with database names (columns, tables) using not only latin base characters, one
can expand the following rule to accept additional letters. Here is the addition of german umlauts.
There seems to be no way to recognize letters by an external function to allow
a configurable addition. One must rebuild JSqlParser with this new "Letterset".
*/
| < #ADDITIONAL_LETTERS: ["ä","ö","ü","Ä","Ö","Ü","ß"] >
}
The lookahead specification that JavaCC says it is ignoring is not ignored. Moral: Don't put lookahead specifications at nonchoice points.
In more detail. When a lookahead (other than a purely semantic lookahead) appears at a nonchoice point, it appears to generate a lookahead method that always returns false, therefor lookahead fails and, there being no other choice, an exception is thrown.
here is the generated code from bad .jj
final public void Statement() throws ParseException {
trace_call("Statement");
try {
if (jj_2_1(5)) {
} else {
jj_consume_token(-1);
throw new ParseException();
}
CreateTable();
} finally {
trace_return("Statement");
}
}
here is the good one:
final public void Statement() throws ParseException {
trace_call("Statement");
try {
if (jj_2_1(3)) {
CreateTable();
} else {
switch ((jj_ntk==-1)?jj_ntk():jj_ntk) {
case K_CREATE:
SomethingElse();
break;
default:
jj_la1[0] = jj_gen;
jj_consume_token(-1);
throw new ParseException();
}
}
} finally {
trace_return("Statement");
}
}
i.e. the superfluous LOOKAHEAD is not ignored at all, javacc mechanically tries to list all the options (which is none in the bad case) in the if-else struct and led to a grammar that looks directly for EOF
I'm trying to implement a parser for the example file listed below. I'd like to recognize quoted strings with '+' between them as a single token. So I created a jj file, but it doesn't match such strings. I was under the impression that JavaCC is supposed to match the longest possible match for each token spec. But that doesn't seem to be case for me.
What am I doing wrong here? Why isn't my <STRING> token matching the '+' even though it's specified in there? Why is whitespace not being ignored?
options {
TOKEN_FACTORY = "Token";
}
PARSER_BEGIN(Parser)
package com.example.parser;
public class Parser {
public static void main(String args[]) throws ParseException {
ParserTokenManager manager = new ParserTokenManager(new SimpleCharStream(Parser.class.getResourceAsStream("example")));
Token token = manager.getNextToken();
while (token != null && token.kind != ParserConstants.EOF) {
System.out.println(token.toString() + "[" + token.kind + "]");
token = manager.getNextToken();
}
Parser parser = new Parser(Parser.class.getResourceAsStream("example"));
parser.start();
}
}
PARSER_END(Parser)
// WHITE SPACE
<DEFAULT, IN_STRING_KEYWORD>
SKIP :
{
" " // <-- skipping spaces
| "\t"
| "\n"
| "\r"
| "\f"
}
// TOKENS
TOKEN :
{
< KEYWORD1 : "keyword1" > : IN_STRING_KEYWORD
}
<IN_STRING_KEYWORD>
TOKEN : {<STRING : <CONCAT_STRING> | <UNQUOTED_STRING> > : DEFAULT
| <#CONCAT_STRING : <QUOTED_STRING> ("+" <QUOTED_STRING>)+ >
// <-- CONCAT_STRING never matches "+" part when input is "'smth' +", because whitespace is not ignored!?
| <#QUOTED_STRING : <SINGLEQUOTED_STRING> | <DOUBLEQUOTED_STRING> >
| <#SINGLEQUOTED_STRING : "'" (~["'"])* "'" >
| <#DOUBLEQUOTED_STRING :
"\""
(
(~["\"", "\\"]) |
("\\" ["n", "t", "\"", "\\"])
)*
"\""
>
| <#UNQUOTED_STRING : (~[" ","\t", ";", "{", "}", "/", "*", "'", "\"", "\n", "\r"] | "/" ~["/", "*"] | "*" ~["/"])+ >
}
void start() :
{}
{
(<KEYWORD1><STRING>";")+ <EOF>
}
Here's an example file that should get parsed:
keyword1 "foo" + ' bar';
I'd like to match the argument of the first keyword1 as a single <STRING> token.
Current output:
keyword1[6]
Exception in thread "main" com.example.parser.TokenMgrError: Lexical error at line 1, column 15. Encountered: " " (32), after : "\"foo\""
at com.example.parser.ParserTokenManager.getNextToken(ParserTokenManager.java:616)
at com.example.parser.Parser.main(Parser.java:12)
I'm using JavaCC 5.0.
STRING is expanding to the longest sequence that can be matched, which is "foo" as the error indicates. The space after the closing double quote is not part of the definition of the private token CONCAT_STRING. Skip tokens do not apply within the definition of other tokens, so you must incorporate the space directly into the definition, on either side of the +.
As an aside, I recommend have a final token definition like so:
<each-state-in-which-the-empty-string-cannot-be-recognized>
TOKEN : {
< ILLEGAL : ~[] >
}
This prevents TokenMgrErrors from being thrown and makes debugging a bit easier.
I've been looking through the ANTLR v3 documentation (and my trusty copy of "The Definitive ANTLR reference"), and I can't seem to find a clean way to implement escape sequences in string literals (I'm currently using the Java target). I had hoped to be able to do something like:
fragment
ESCAPE_SEQUENCE
: '\\' '\'' { setText("'"); }
;
STRING
: '\'' (ESCAPE_SEQUENCE | ~('\'' | '\\'))* '\''
{
// strip the quotes from the resulting token
setText(getText().substring(1, getText().length() - 1));
}
;
For example, I would want the input token "'Foo\'s House'" to become the String "Foo's House".
Unfortunately, the setText(...) call in the ESCAPE_SEQUENCE fragment sets the text for the entire STRING token, which is obviously not what I want.
Is there a way to implement this grammar without adding a method to go back through the resulting string and manually replace escape sequences (e.g., with something like setText(escapeString(getText())) in the STRING rule)?
Here is how I accomplished this in the JSON parser I wrote.
STRING
#init{StringBuilder lBuf = new StringBuilder();}
:
'"'
( escaped=ESC {lBuf.append(getText());} |
normal=~('"'|'\\'|'\n'|'\r') {lBuf.appendCodePoint(normal);} )*
'"'
{setText(lBuf.toString());}
;
fragment
ESC
: '\\'
( 'n' {setText("\n");}
| 'r' {setText("\r");}
| 't' {setText("\t");}
| 'b' {setText("\b");}
| 'f' {setText("\f");}
| '"' {setText("\"");}
| '\'' {setText("\'");}
| '/' {setText("/");}
| '\\' {setText("\\");}
| ('u')+ i=HEX_DIGIT j=HEX_DIGIT k=HEX_DIGIT l=HEX_DIGIT
{setText(ParserUtil.hexToChar(i.getText(),j.getText(),
k.getText(),l.getText()));}
)
;
For ANTLR4, Java target and standard escaped string grammar, I used a dedicated singleton class : CharSupport to translate string. It is available in antlr API :
STRING : '"'
( ESC
| ~('"'|'\\'|'\n'|'\r')
)*
'"' {
setText(
org.antlr.v4.misc.CharSupport.getStringFromGrammarStringLiteral(
getText()
)
);
}
;
As I saw in V4 documentation and by experiments, #init is no longer supported in lexer part!
Another (possibly more efficient) alternative is to use rule arguments:
STRING
#init { final StringBuilder buf = new StringBuilder(); }
:
'"'
(
ESCAPE[buf]
| i = ~( '\\' | '"' ) { buf.appendCodePoint(i); }
)*
'"'
{ setText(buf.toString()); };
fragment ESCAPE[StringBuilder buf] :
'\\'
( 't' { buf.append('\t'); }
| 'n' { buf.append('\n'); }
| 'r' { buf.append('\r'); }
| '"' { buf.append('\"'); }
| '\\' { buf.append('\\'); }
| 'u' a = HEX_DIGIT b = HEX_DIGIT c = HEX_DIGIT d = HEX_DIGIT { buf.append(ParserUtil.hexChar(a, b, c, d)); }
);
I needed to do just that, but my target was C and not Java. Here's how I did it based on answer #1 (and comment), in case anyone needs something alike:
QUOTE : '\'';
STR
#init{ pANTLR3_STRING unesc = GETTEXT()->factory->newRaw(GETTEXT()->factory); }
: QUOTE ( reg = ~('\\' | '\'') { unesc->addc(unesc, reg); }
| esc = ESCAPED { unesc->appendS(unesc, GETTEXT()); } )+ QUOTE { SETTEXT(unesc); };
fragment
ESCAPED : '\\'
( '\\' { SETTEXT(GETTEXT()->factory->newStr8(GETTEXT()->factory, (pANTLR3_UINT8)"\\")); }
| '\'' { SETTEXT(GETTEXT()->factory->newStr8(GETTEXT()->factory, (pANTLR3_UINT8)"\'")); }
)
;
HTH.