antlr3 - ignore a token / parse it only once - java

At the moment I try to parse a text like
"play by the way by band by"
to parse a command for a mediaplayer.
I got problems if the play and by token is in the name of a song or an artist.
How can I ignore the multiple token in the songname and artist or parse the token only once in the directions I want?
That is my .g file:
text returns [String value]
: speech=wordExp (space s1=name)? (byartist space a1=name)? {
command = $speech.text;
match = $s1.text;
artist = $a1.text;
}
;
name
: s1 = (WORD (space s2 = WORD)*)
;
byartist
: space BY
;
wordExp
: PLAY | STOP
;
//Lexer
PLAY : 'play';
STOP : 'stop';
BY : 'by';
space : ' ';
WORD : ( 'a'..'z' | 'A'..'Z' )*; // digits in here?
WS : ('\t' | '\r'| '\n') {
$channel=HIDDEN;
}
;

Related

Antlr3 grammar generates parsering error on encountering the Pound char

Antlr-3 generating an error on encountering the Pound char ("£") of the French language, which is equivalent char of Hash "#" char of English, even the Unicode value for three special characters #, #, and $ are specified in lexer/parser rule.
FYI: The Unicode value of Pound char (of the French language) = The Unicode value of Hash char (of ENGLISH language).
The lexer/parser rules:
grammar SimpleCalc;
options
{
k = 8;
language = Java;
//filter = true;
}
tokens {
PLUS = '+' ;
MINUS = '-' ;
MULT = '*' ;
DIV = '/' ;
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
expr : n1=NUMBER ( exp = ( PLUS | MINUS ) n2=NUMBER )*
{
if ($exp.text.equals("+"))
System.out.println("Plus Result = " + $n1.text + $n2.text);
else
System.out.println("Minus Result = " + $n1.text + $n2.text);
}
;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
NUMBER : (DIGIT)+ ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
fragment DIGIT : '0'..'9' | '£' | ('\u0040' | '\u0023' | '\u0024');
The text file also reading in UTF-8 as:
public static void main(String[] args) throws Exception
{
try
{
args = new String[1];
args[0] = new String("antlr_test.txt");
SimpleCalcLexer lex = new SimpleCalcLexer(new ANTLRFileStream(args[0], "UTF-8"));
CommonTokenStream tokens = new CommonTokenStream(lex);
SimpleCalcParser parser = new SimpleCalcParser(tokens);
parser.expr();
//System.out.println(tokens);
}
catch (Exception e)
{
e.printStackTrace();
}
}
The input file is having only 1 line:
£3 + 4£
the error is:
antlr_test.txt line 1:1 no viable alternative at character '£'
antlr_test.txt line 1:7 no viable alternative at character '£'
What is wrong with my approach?
or did I miss something?
I cannot reproduce what you describe. When I test your grammar without modifications, I get a NumberFormatException, which is expected, because Integer.parseInt("£3") cannot succeed.
When I change your embedded code into this:
{
if ($exp.text.equals("+"))
System.out.println("Result = " + (Integer.parseInt($n1.text.replaceAll("\\D", "")) + Integer.parseInt($n2.text.replaceAll("\\D", ""))));
else
System.out.println("Result = " + (Integer.parseInt($n1.text.replaceAll("\\D", "")) - Integer.parseInt($n2.text.replaceAll("\\D", ""))));
}
and regenerate lexer and parser classes (something you might not have done) and rerun the driver code, I get the following output:
Result = 7
EDIT
Perhaps the pound sign in the grammar is the issue? What if you try:
fragment DIGIT : '0'..'9' | '\u00A3' | ('\u0040' | '\u0023' | '\u0024');
instead of:
fragment DIGIT : '0'..'9' | '£' | ('\u0040' | '\u0023' | '\u0024');
?

Antlr parser rule fails to match either of specified lexer rules

I have a small work-in-progress Antlr grammar that looks like:
filterExpression returns [ActivityPredicate pred]
: NAME OPERATOR (PACE | NUMBER) {
if ($PACE != null) {
$pred = new SingleActivityPredicate($NAME.text, Operator.fromCharacter($OPERATOR.text), $PACE.text);
} else {
$pred = new SingleActivityPredicate($NAME.text, Operator.fromCharacter($OPERATOR.text), $NUMBER.text);
}
};
OPERATOR: ('>' | '<' | '=') ;
NAME: ('A'..'Z' | 'a'..'z')+ ;
NUMBER: ('0'..'9')+ ('.' ('0'..'9')+)? ;
PACE: ('0'..'9')('0'..'9')? ':' ('0'..'5')('0'..'9');
WS: (' ' | '\t' | '\r'| '\n')+ -> skip;
Hoping to parse things like:
distance = 4 or pace < 8:30
However, both of those inputs result in null for both the PACE and NUMBER, while trying to parse either:
However, dropping the option and just picking PACE works fine (it also works fine the other way, opting for NUMBER):
filterExpression returns [ActivityPredicate pred]
: NAME OPERATOR PACE { ... };
Why is it that when I provide the option, they're both null?
Try this.
filterExpression returns [ActivityPredicate pred]
: n=NAME o=OPERATOR (p=PACE | i=NUMBER) {
if ($PACE != null) {
$pred = new SingleActivityPredicate(
$n.text, Operator.fromCharacter($o.text), $p.text);
} else {
$pred = new SingleActivityPredicate(
$n.text, Operator.fromCharacter($o.text), $i.text);
}
};

Antlr 3.3 return values in java

I try to figure out how to get values from the parser.
My input is 'play the who' and it should return a string with 'the who'.
Sample.g:
text returns [String value]
: speech = wordExp space name {$value = $speech.text;}
;
name returns [String value]
: SongArtist = WORD (space WORD)* {$value = $SongArtist.text;}
;
wordExp returns [String value]
: command = PLAY {$value = $command.text;} | command = SEARCH {$value = $command.text;}
;
PLAY : 'play';
SEARCH : 'search';
space : ' ';
WORD : ( 'a'..'z' | 'A'..'Z' )*;
WS
: ('\t' | '\r'| '\n') {$channel=HIDDEN;}
;
If I enter 'play the who' that tree comes up:
http://i.stack.imgur.com/ET61P.png
I created a Java file to catch the output. If I call parser.wordExp() I supposed to get 'the who', but it returns the object and this EOF failure (see the output below). parser.text() returns 'play'.
import org.antlr.runtime.*;
import a.b.c.SampleLexer;
import a.b.c.SampleParser;
public class Main {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream("play the who");
SampleLexer lexer = new SampleLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
SampleParser parser = new SampleParser(tokens);
System.out.println(parser.text());
System.out.println(parser.wordExp());
}
}
The console return this:
play
a.b.c.SampleParser$wordExp_return#1d0ca25a
line 1:12 no viable alternative at input '<EOF>'
How can I catch 'the who'? It is weird for me why I can not catch this string. The interpreter creates the tree correctly.
First, in your grammar, speech only gets assigned the return value of parser rule wordExp. If you want to manipulate the return value of rule name as well, you can do this with an additional variable like the example below.
text returns [String value]
: a=wordExp space b=name {$value = $a.text+" "+$b.text;}
;
Second, invoking parser.text() parses the entire input. A second invocation (in your case parser.wordExp()) thus finds EOF. If you remove the second call the no viable alternative at input 'EOF' goes away.
There may be a better way to do this, but in the meantime this may help you out.

JavaCC lexer doesn't work as expected (whitespace not ignored)

I'm trying to implement a parser for the example file listed below. I'd like to recognize quoted strings with '+' between them as a single token. So I created a jj file, but it doesn't match such strings. I was under the impression that JavaCC is supposed to match the longest possible match for each token spec. But that doesn't seem to be case for me.
What am I doing wrong here? Why isn't my <STRING> token matching the '+' even though it's specified in there? Why is whitespace not being ignored?
options {
TOKEN_FACTORY = "Token";
}
PARSER_BEGIN(Parser)
package com.example.parser;
public class Parser {
public static void main(String args[]) throws ParseException {
ParserTokenManager manager = new ParserTokenManager(new SimpleCharStream(Parser.class.getResourceAsStream("example")));
Token token = manager.getNextToken();
while (token != null && token.kind != ParserConstants.EOF) {
System.out.println(token.toString() + "[" + token.kind + "]");
token = manager.getNextToken();
}
Parser parser = new Parser(Parser.class.getResourceAsStream("example"));
parser.start();
}
}
PARSER_END(Parser)
// WHITE SPACE
<DEFAULT, IN_STRING_KEYWORD>
SKIP :
{
" " // <-- skipping spaces
| "\t"
| "\n"
| "\r"
| "\f"
}
// TOKENS
TOKEN :
{
< KEYWORD1 : "keyword1" > : IN_STRING_KEYWORD
}
<IN_STRING_KEYWORD>
TOKEN : {<STRING : <CONCAT_STRING> | <UNQUOTED_STRING> > : DEFAULT
| <#CONCAT_STRING : <QUOTED_STRING> ("+" <QUOTED_STRING>)+ >
// <-- CONCAT_STRING never matches "+" part when input is "'smth' +", because whitespace is not ignored!?
| <#QUOTED_STRING : <SINGLEQUOTED_STRING> | <DOUBLEQUOTED_STRING> >
| <#SINGLEQUOTED_STRING : "'" (~["'"])* "'" >
| <#DOUBLEQUOTED_STRING :
"\""
(
(~["\"", "\\"]) |
("\\" ["n", "t", "\"", "\\"])
)*
"\""
>
| <#UNQUOTED_STRING : (~[" ","\t", ";", "{", "}", "/", "*", "'", "\"", "\n", "\r"] | "/" ~["/", "*"] | "*" ~["/"])+ >
}
void start() :
{}
{
(<KEYWORD1><STRING>";")+ <EOF>
}
Here's an example file that should get parsed:
keyword1 "foo" + ' bar';
I'd like to match the argument of the first keyword1 as a single <STRING> token.
Current output:
keyword1[6]
Exception in thread "main" com.example.parser.TokenMgrError: Lexical error at line 1, column 15. Encountered: " " (32), after : "\"foo\""
at com.example.parser.ParserTokenManager.getNextToken(ParserTokenManager.java:616)
at com.example.parser.Parser.main(Parser.java:12)
I'm using JavaCC 5.0.
STRING is expanding to the longest sequence that can be matched, which is "foo" as the error indicates. The space after the closing double quote is not part of the definition of the private token CONCAT_STRING. Skip tokens do not apply within the definition of other tokens, so you must incorporate the space directly into the definition, on either side of the +.
As an aside, I recommend have a final token definition like so:
<each-state-in-which-the-empty-string-cannot-be-recognized>
TOKEN : {
< ILLEGAL : ~[] >
}
This prevents TokenMgrErrors from being thrown and makes debugging a bit easier.

How to handle escape sequences in string literals in ANTLR 3?

I've been looking through the ANTLR v3 documentation (and my trusty copy of "The Definitive ANTLR reference"), and I can't seem to find a clean way to implement escape sequences in string literals (I'm currently using the Java target). I had hoped to be able to do something like:
fragment
ESCAPE_SEQUENCE
: '\\' '\'' { setText("'"); }
;
STRING
: '\'' (ESCAPE_SEQUENCE | ~('\'' | '\\'))* '\''
{
// strip the quotes from the resulting token
setText(getText().substring(1, getText().length() - 1));
}
;
For example, I would want the input token "'Foo\'s House'" to become the String "Foo's House".
Unfortunately, the setText(...) call in the ESCAPE_SEQUENCE fragment sets the text for the entire STRING token, which is obviously not what I want.
Is there a way to implement this grammar without adding a method to go back through the resulting string and manually replace escape sequences (e.g., with something like setText(escapeString(getText())) in the STRING rule)?
Here is how I accomplished this in the JSON parser I wrote.
STRING
#init{StringBuilder lBuf = new StringBuilder();}
:
'"'
( escaped=ESC {lBuf.append(getText());} |
normal=~('"'|'\\'|'\n'|'\r') {lBuf.appendCodePoint(normal);} )*
'"'
{setText(lBuf.toString());}
;
fragment
ESC
: '\\'
( 'n' {setText("\n");}
| 'r' {setText("\r");}
| 't' {setText("\t");}
| 'b' {setText("\b");}
| 'f' {setText("\f");}
| '"' {setText("\"");}
| '\'' {setText("\'");}
| '/' {setText("/");}
| '\\' {setText("\\");}
| ('u')+ i=HEX_DIGIT j=HEX_DIGIT k=HEX_DIGIT l=HEX_DIGIT
{setText(ParserUtil.hexToChar(i.getText(),j.getText(),
k.getText(),l.getText()));}
)
;
For ANTLR4, Java target and standard escaped string grammar, I used a dedicated singleton class : CharSupport to translate string. It is available in antlr API :
STRING : '"'
( ESC
| ~('"'|'\\'|'\n'|'\r')
)*
'"' {
setText(
org.antlr.v4.misc.CharSupport.getStringFromGrammarStringLiteral(
getText()
)
);
}
;
As I saw in V4 documentation and by experiments, #init is no longer supported in lexer part!
Another (possibly more efficient) alternative is to use rule arguments:
STRING
#init { final StringBuilder buf = new StringBuilder(); }
:
'"'
(
ESCAPE[buf]
| i = ~( '\\' | '"' ) { buf.appendCodePoint(i); }
)*
'"'
{ setText(buf.toString()); };
fragment ESCAPE[StringBuilder buf] :
'\\'
( 't' { buf.append('\t'); }
| 'n' { buf.append('\n'); }
| 'r' { buf.append('\r'); }
| '"' { buf.append('\"'); }
| '\\' { buf.append('\\'); }
| 'u' a = HEX_DIGIT b = HEX_DIGIT c = HEX_DIGIT d = HEX_DIGIT { buf.append(ParserUtil.hexChar(a, b, c, d)); }
);
I needed to do just that, but my target was C and not Java. Here's how I did it based on answer #1 (and comment), in case anyone needs something alike:
QUOTE : '\'';
STR
#init{ pANTLR3_STRING unesc = GETTEXT()->factory->newRaw(GETTEXT()->factory); }
: QUOTE ( reg = ~('\\' | '\'') { unesc->addc(unesc, reg); }
| esc = ESCAPED { unesc->appendS(unesc, GETTEXT()); } )+ QUOTE { SETTEXT(unesc); };
fragment
ESCAPED : '\\'
( '\\' { SETTEXT(GETTEXT()->factory->newStr8(GETTEXT()->factory, (pANTLR3_UINT8)"\\")); }
| '\'' { SETTEXT(GETTEXT()->factory->newStr8(GETTEXT()->factory, (pANTLR3_UINT8)"\'")); }
)
;
HTH.

Categories