How to handle escape sequences in string literals in ANTLR 3?

How to handle escape sequences in string literals in ANTLR 3? - java

I've been looking through the ANTLR v3 documentation (and my trusty copy of "The Definitive ANTLR reference"), and I can't seem to find a clean way to implement escape sequences in string literals (I'm currently using the Java target). I had hoped to be able to do something like:
fragment
ESCAPE_SEQUENCE
: '\\' '\'' { setText("'"); }
;
STRING
: '\'' (ESCAPE_SEQUENCE | ~('\'' | '\\'))* '\''
{
// strip the quotes from the resulting token
setText(getText().substring(1, getText().length() - 1));
}
;
For example, I would want the input token "'Foo\'s House'" to become the String "Foo's House".
Unfortunately, the setText(...) call in the ESCAPE_SEQUENCE fragment sets the text for the entire STRING token, which is obviously not what I want.
Is there a way to implement this grammar without adding a method to go back through the resulting string and manually replace escape sequences (e.g., with something like setText(escapeString(getText())) in the STRING rule)?

Here is how I accomplished this in the JSON parser I wrote.
STRING
#init{StringBuilder lBuf = new StringBuilder();}
:
'"'
( escaped=ESC {lBuf.append(getText());} |
normal=~('"'|'\\'|'\n'|'\r') {lBuf.appendCodePoint(normal);} )*
'"'
{setText(lBuf.toString());}
;
fragment
ESC
: '\\'
( 'n' {setText("\n");}
| 'r' {setText("\r");}
| 't' {setText("\t");}
| 'b' {setText("\b");}
| 'f' {setText("\f");}
| '"' {setText("\"");}
| '\'' {setText("\'");}
| '/' {setText("/");}
| '\\' {setText("\\");}
| ('u')+ i=HEX_DIGIT j=HEX_DIGIT k=HEX_DIGIT l=HEX_DIGIT
{setText(ParserUtil.hexToChar(i.getText(),j.getText(),
k.getText(),l.getText()));}
)
;

For ANTLR4, Java target and standard escaped string grammar, I used a dedicated singleton class : CharSupport to translate string. It is available in antlr API :
STRING : '"'
( ESC
| ~('"'|'\\'|'\n'|'\r')
)*
'"' {
setText(
org.antlr.v4.misc.CharSupport.getStringFromGrammarStringLiteral(
getText()
)
);
}
;
As I saw in V4 documentation and by experiments, #init is no longer supported in lexer part!

Another (possibly more efficient) alternative is to use rule arguments:
STRING
#init { final StringBuilder buf = new StringBuilder(); }
:
'"'
(
ESCAPE[buf]
| i = ~( '\\' | '"' ) { buf.appendCodePoint(i); }
)*
'"'
{ setText(buf.toString()); };
fragment ESCAPE[StringBuilder buf] :
'\\'
( 't' { buf.append('\t'); }
| 'n' { buf.append('\n'); }
| 'r' { buf.append('\r'); }
| '"' { buf.append('\"'); }
| '\\' { buf.append('\\'); }
| 'u' a = HEX_DIGIT b = HEX_DIGIT c = HEX_DIGIT d = HEX_DIGIT { buf.append(ParserUtil.hexChar(a, b, c, d)); }
);

I needed to do just that, but my target was C and not Java. Here's how I did it based on answer #1 (and comment), in case anyone needs something alike:
QUOTE : '\'';
STR
#init{ pANTLR3_STRING unesc = GETTEXT()->factory->newRaw(GETTEXT()->factory); }
: QUOTE ( reg = ~('\\' | '\'') { unesc->addc(unesc, reg); }
| esc = ESCAPED { unesc->appendS(unesc, GETTEXT()); } )+ QUOTE { SETTEXT(unesc); };
fragment
ESCAPED : '\\'
( '\\' { SETTEXT(GETTEXT()->factory->newStr8(GETTEXT()->factory, (pANTLR3_UINT8)"\\")); }
| '\'' { SETTEXT(GETTEXT()->factory->newStr8(GETTEXT()->factory, (pANTLR3_UINT8)"\'")); }
)
;
HTH.

Related

Antlr3 grammar generates parsering error on encountering the Pound char

Antlr-3 generating an error on encountering the Pound char ("£") of the French language, which is equivalent char of Hash "#" char of English, even the Unicode value for three special characters #, #, and $ are specified in lexer/parser rule.
FYI: The Unicode value of Pound char (of the French language) = The Unicode value of Hash char (of ENGLISH language).
The lexer/parser rules:
grammar SimpleCalc;
options
{
k = 8;
language = Java;
//filter = true;
}
tokens {
PLUS = '+' ;
MINUS = '-' ;
MULT = '*' ;
DIV = '/' ;
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
expr : n1=NUMBER ( exp = ( PLUS | MINUS ) n2=NUMBER )*
{
if ($exp.text.equals("+"))
System.out.println("Plus Result = " + $n1.text + $n2.text);
else
System.out.println("Minus Result = " + $n1.text + $n2.text);
}
;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
NUMBER : (DIGIT)+ ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
fragment DIGIT : '0'..'9' | '£' | ('\u0040' | '\u0023' | '\u0024');
The text file also reading in UTF-8 as:
public static void main(String[] args) throws Exception
{
try
{
args = new String[1];
args[0] = new String("antlr_test.txt");
SimpleCalcLexer lex = new SimpleCalcLexer(new ANTLRFileStream(args[0], "UTF-8"));
CommonTokenStream tokens = new CommonTokenStream(lex);
SimpleCalcParser parser = new SimpleCalcParser(tokens);
parser.expr();
//System.out.println(tokens);
}
catch (Exception e)
{
e.printStackTrace();
}
}
The input file is having only 1 line:
£3 + 4£
the error is:
antlr_test.txt line 1:1 no viable alternative at character '£'
antlr_test.txt line 1:7 no viable alternative at character '£'
What is wrong with my approach?
or did I miss something?

I cannot reproduce what you describe. When I test your grammar without modifications, I get a NumberFormatException, which is expected, because Integer.parseInt("£3") cannot succeed.
When I change your embedded code into this:
{
if ($exp.text.equals("+"))
System.out.println("Result = " + (Integer.parseInt($n1.text.replaceAll("\\D", "")) + Integer.parseInt($n2.text.replaceAll("\\D", ""))));
else
System.out.println("Result = " + (Integer.parseInt($n1.text.replaceAll("\\D", "")) - Integer.parseInt($n2.text.replaceAll("\\D", ""))));
}
and regenerate lexer and parser classes (something you might not have done) and rerun the driver code, I get the following output:
Result = 7
EDIT
Perhaps the pound sign in the grammar is the issue? What if you try:
fragment DIGIT : '0'..'9' | '\u00A3' | ('\u0040' | '\u0023' | '\u0024');
instead of:
fragment DIGIT : '0'..'9' | '£' | ('\u0040' | '\u0023' | '\u0024');
?

(DEFINE) feature in regex does not work in Java

I am trying to validate a JSON string using regex. Found the valid regex from another post https://stackoverflow.com/a/3845829/7493427
It uses DEFINE feature in regex. But I think the JRegex library does not support that feature. Is there a work around for this?
I used java.util.regex first, then found out about JRegex library. But this doesn't work too.
String regex = "(?(DEFINE)" +
"(?<number> -? (?= [1-9]|0(?!\\d) ) \\d+ (\\.\\d+)? ([eE] [+-]?
\\d+)? )" +
"(?<boolean> true | false | null )" +
"(?<string> \" ([^\"\\n\\r\\t\\\\\\\\]* | \\\\\\\\
[\"\\\\\\\\bfnrt\\/] | \\\\\\\\ u [0-9a-f]{4} )* \" )" +
"(?<array> \\[ (?: (?&json) (?: , (?&json) )* )? \\s*
\\] )" +
"(?<pair> \\s* (?&string) \\s* : (?&json) )" +
"(?<object> \\{ (?: (?&pair) (?: , (?&pair) )* )? \\s*
\\} )" +
"(?<json> \\s* (?: (?&number) | (?&boolean) | (?&string) | (?
&array) | (?&object) ) \\s* )" +
")" +
"\\A (?&json) \\Z";
String test = "{\"asd\" : \"asdasdasdasdasdasd\"}";
jregex.Pattern pattern = new jregex.Pattern(regex);
jregex.Matcher matcher = pattern.matcher(test);
if(matcher.find()) {
System.out.println(matcher.groups());
}
I expected a match as the test json is valid, but I get an exception.
Exception in thread "main" jregex.PatternSyntaxException: unknown group name in conditional expr.: DEFINE at jregex.Term.makeTree(Term.java:360) at jregex.Term.makeTree(Term.java:219)at jregex.Term.makeTree(Term.java:206) at jregex.Pattern.compile(Pattern.java:164) at jregex.Pattern.(Pattern.java:150) at jregex.Pattern.(Pattern.java:108) at com.cloak.utilities.regex.VariableValidationHelper.main(VariableValidationHelper.java:305)

You can use this rather simple Jackson setup:
private static final ObjectMapper MAPPER = new ObjectMapper();
public static boolean isValidJson(String json) {
try {
MAPPER.readValue(json, Map.class);
return true;
} catch(IOException e) {
return false;
}
}
ObjectMapper#readValue() will throw JsonProcessingExceptions (a sub class of IOException) when the input is invalid.

AST how to deal with empty nodes

I am building an expression evaluator using Java, JFlex(lexer gen) and Jacc(parser gen). I need to:
generate the lexer
generate the parser
generate the AST
display the AST graph
evaluate expression
I was able to create the lexer and the parser and the AST. Now I am trying to make the AST graph using the visitor pattern, but this made a problem with my generated AST evident (so to speak). In my calculator I need to handle parentheses and they create empty nodes in my AST (and that makes my parse tree not an AST I suppose). Here is the relevant part of my grammar:
Calc : /* empty */
| AddExpr { ast = new Calc($1); }
;
AddExpr : ModExpr
| AddExpr '+' ModExpr { $$ = new AddExpr($1, $3, "+"); }
| AddExpr '-' ModExpr { $$ = new AddExpr($1, $3, "-"); }
;
ModExpr : IntDivExpr
| ModExpr MOD IntDivExpr { $$ = new ModExpr($1, $3); }
;
IntDivExpr : MultExpr
| IntDivExpr DIV MultExpr { $$ = new IntDivExpr($1, $3); }
;
MultExpr : UnaryExpr
| MultExpr '*' UnaryExpr { $$ = new MultExpr($1, $3, "*"); }
| MultExpr '/' UnaryExpr { $$ = new MultExpr($1, $3, "/"); }
;
UnaryExpr : ExpExpr
| '-' UnaryExpr { $$ = new UnaryExpr($2, "-"); }
| '+' UnaryExpr { $$ = new UnaryExpr($2, "+"); }
;
ExpExpr : Value
| ExpExpr '^' Value { $$ = new ExpExpr($1, $3); }
;
Value : DoubleLiteral
| '(' AddExpr ')' { $$ = new Value($2); }
;
DoubleLiteral : DOUBLE { $$ = $1; }
;
Here is an example expression:
1*(2+3)/(4-5)*((((6))))
and the resulting image:
This leaves me with Value nodes for each pair of parentheses. I have a few ideas on how to handle this, but I am not sure how to proceed:
Try to handle this in my grammar (not sure how as I am not allowed to use precedence directives)
Handle this in my evaluator

If you don't want Value nodes, then just replace { $$ = new Value($2); } with { $$ = $2; }.

ANTLR3: No viable alternative at character

I have this ANTLR3 grammar:
grammar wft;
#header {
package com.mycompany.wftdiff.parser;
import com.mycompany.wftdiff.model.*;
}
#lexer::header {
package com.mycompany.wftdiff.parser;
}
#members {
private final WftFile wftFile = new WftFile();
public WftFile getParsingResult() {
return wftFile;
}
}
wftFile:
{
System.out.println("Heyo!");
}
(CommentLine | assignment | NewLine)*
itemTypeDefinition
EOF
;
/**
* ItemTypeDefinition
* DEFINE ITEM_TYPE
* END ITEM_TYPE
*/
itemTypeDefinition:
'DEFINE ITEM_TYPE' NewLine
(KeyName|TransStmt|BaseStmt|NewLine)+
WhiteSpace* 'DEFINE ITEM_ATTRIBUTE' NewLine
(KeyName|TransStmt|BaseStmt)*
WhiteSpace* 'END ITEM_ATTRIBUTE' NewLine
'END ITEM_TYPE'
;
/**
* KeyName
* KEY NAME VARCHAR2(8)
*/
KeyName: WhiteSpace* KeyNameStart .* {$channel = HIDDEN;} NewLine;
fragment KeyNameStart: 'KEY NAME VARCHAR2(';
/**
* TransStmt
* TRANS DISPLAY_NAME VARCHAR2(80)
*/
TransStmt: WhiteSpace* TransStmtStart .* {$channel = HIDDEN;} NewLine;
fragment TransStmtStart: 'TRANS';
/**
* BaseStmt
BASE PROTECT_LEVEL NUMBER
*/
BaseStmt: WhiteSpace* BaseStmtStart .* {$channel = HIDDEN;} NewLine;
fragment BaseStmtStart: 'BASE';
/**
* Assignment
*/
assignment returns [Assignment assignment]:
{
System.out.println("Assignment found!");
}
target=AssignmentTarget
WhiteSpace '=' WhiteSpace
value=String {
assignment = new Assignment(target.getText(), value.getText());
wftFile.addAssignment(new Assignment(target.getText(), value.getText()));
}
NewLine;
AssignmentTarget: A (A|D|'_')*;
String: '"' ~'"'* '"'
;
/**
* Comment
*/
CommentLine: CommentStart .* {$channel = HIDDEN;} NewLine;
fragment CommentStart: '#';
// Lexer rules
fragment D: '0'..'9';
fragment A: 'A'..'Z'
| 'a'..'z';
StringLength: D+;
NewLine : '\r' '\n' | '\n' | '\r';
WhiteSpace: ' ';
Then I generate a parser for it using
java -cp "D:\wftdiff\lib\antlr-3.5.2\antlr-3.5.2-complete.jar" org.antlr.Tool -o src/com/mycompany/wftdiff/parser/ grammar-src/wft.g
...and call it like this:
val lexer = wftLexer(ANTLRFileStream(fileName))
val parser = wftParser(CommonTokenStream(lexer))
parser.wftFile()
System.out.println("Test")
fileName points to a text file with the following contents:
# Oracle Workflow Process Definition
# $Header$
VERSION_MAJOR = "2"
VERSION_MINOR = "6"
LANGUAGE = "GERMAN"
ACCESS_LEVEL = "100"
DEFINE ITEM_TYPE
KEY NAME VARCHAR2(8)
TRANS DISPLAY_NAME VARCHAR2(80)
TRANS DESCRIPTION VARCHAR2(240)
BASE PROTECT_LEVEL NUMBER
BASE CUSTOM_LEVEL NUMBER
BASE WF_SELECTOR VARCHAR2(240)
BASE READ_ROLE REFERENCES ROLE
BASE WRITE_ROLE REFERENCES ROLE
BASE EXECUTE_ROLE REFERENCES ROLE
BASE PERSISTENCE_TYPE VARCHAR2(8)
BASE PERSISTENCE_DAYS NUMBER
DEFINE ITEM_ATTRIBUTE
KEY NAME VARCHAR2(30)
TRANS DISPLAY_NAME VARCHAR2(80)
TRANS DESCRIPTION VARCHAR2(240)
BASE PROTECT_LEVEL NUMBER
BASE CUSTOM_LEVEL NUMBER
BASE TYPE VARCHAR2(8)
BASE FORMAT VARCHAR2(240)
BASE VALUE_TYPE VARCHAR2(8)
BASE DEFAULT VARCHAR2(4000)
END ITEM_ATTRIBUTE
END ITEM_TYPE
I get the following output:
Heyo!
Assignment found!
Assignment found!
Assignment found!
Assignment found!
test-data/partialSample01.wft line 25:2 no viable alternative at character 'D'
test-data/partialSample01.wft line 35:2 no viable alternative at character 'E'
Test
How should I change my grammar in order to get rid of the no viable alternative at character 'D' error?
Note that I don't need to parse this section of the file (I'm not interested in this particular information; it comes later in the file).
Update 1: Tried to ignore the whole thing as suggested here (using skip()), but it didn't help.
New grammar file:
grammar wft;
#header {
package com.mycompany.wftdiff.parser;
import com.mycompany.wftdiff.model.*;
}
#lexer::header {
package com.mycompany.wftdiff.parser;
}
#members {
private final WftFile wftFile = new WftFile();
public WftFile getParsingResult() {
return wftFile;
}
}
wftFile:
{
System.out.println("Heyo!");
}
(CommentLine | assignment | NewLine)*
itemTypeDefinition
EOF
;
/**
* ItemTypeDefinition
* DEFINE ITEM_TYPE
* END ITEM_TYPE
*/
itemTypeDefinition:
'DEFINE ITEM_TYPE' NewLine
(KeyName|TransStmt|BaseStmt|NewLine)+
WhiteSpace*
NewLine
DefineItemAttribute
WhiteSpace*
'END ITEM_TYPE'
;
DefineItemAttribute: 'DEFINE ITEM_ATTRIBUTE' .* 'END ITEM_ATTRIBUTE' {skip();};
/**
* KeyName
* KEY NAME VARCHAR2(8)
*/
KeyName: WhiteSpace* KeyNameStart .* {$channel = HIDDEN;} NewLine;
fragment KeyNameStart: 'KEY NAME VARCHAR2(';
/**
* TransStmt
* TRANS DISPLAY_NAME VARCHAR2(80)
*/
TransStmt: WhiteSpace* TransStmtStart .* {$channel = HIDDEN;} NewLine;
fragment TransStmtStart: 'TRANS';
/**
* BaseStmt
BASE PROTECT_LEVEL NUMBER
*/
BaseStmt: WhiteSpace* BaseStmtStart .* {$channel = HIDDEN;} NewLine;
fragment BaseStmtStart: 'BASE';
/**
* Assignment
*/
assignment returns [Assignment assignment]:
{
System.out.println("Assignment found!");
}
target=AssignmentTarget
WhiteSpace '=' WhiteSpace
value=String {
assignment = new Assignment(target.getText(), value.getText());
wftFile.addAssignment(new Assignment(target.getText(), value.getText()));
}
NewLine;
AssignmentTarget: A (A|D|'_')*;
String: '"' ~'"'* '"'
;
/**
* Comment
*/
CommentLine: CommentStart .* {$channel = HIDDEN;} NewLine;
fragment CommentStart: '#';
// Lexer rules
fragment D: '0'..'9';
fragment A: 'A'..'Z'
| 'a'..'z';
StringLength: D+;
NewLine : '\r' '\n' | '\n' | '\r';
WhiteSpace: ' ';
Parsing result:
Heyo!
Assignment found!
Assignment found!
Assignment found!
Assignment found!
test-data/partialSample01.wft line 25:2 no viable alternative at character 'D'
test-data/partialSample01.wft line 36:0 missing DefineItemAttribute at 'END ITEM_TYPE'
Test
Bounty terms
I will award the bounty to a person who accomplishes the following heroic deeds:
Creates a parser, which is capable to recognize all parts of this file, which are marked as relevant in the comments, that is
1.1. everything inside BEGIN ACTIVITY and END ACTIVITY tags,
1.2. everything inside BEGIN ACTIVITY_TRANSITION and END ACTIVITY_TRANSITION,
1.3. everything inside BEGIN PROCESS_ACTIVITY and BEGIN PROCESS_ACTIVITY tags.
By "recognize everything" I mean there must be ANTLR 3 code, which allows me to put Java statements that would process the data extracted from the file like in the assignment rule in the original post. You don't need to write any Java code there, but it must be possible for me to add that code later.
All parts which are not marked as relevant can be ignored by the parser (similar to the comments in the original grammar).
Your grammar must be compatible with ANTLR 3, Java 8, and Windows 7.
You can remove the code in the original version (like here), so you don't get compiler errors.
The parser must be either be able to be generated using java -cp "D:\wftdiff\lib\antlr-3.5.2\antlr-3.5.2-complete.jar" org.antlr.Tool -o src/com/mycompany/wftdiff/parser/ grammar-src/wft.g, or, if you use any special settings, you need to specify them in your answer. The point is, I need to be able to reproduce your result.
When I feed the sample file to the parser, it must consume it without complaining (without printing any ANTLR error messages, without crashing and without throwing technical exceptions like NullPointerException).

Here is the grammar. It recognize all parts, you can add java actions wherever you want.
Compiled and tested with jdk1.8, antlr 3.5.2 and the provided sample input.
grammar wft;
#header {
package com.mycompany.wftdiff.parser;
}
#lexer::header {
package com.mycompany.wftdiff.parser;
}
#members {
}
wftFile : (COMMENT|assignment|definition|flow)*
;
assignment
: ID EQ STRING
;
definition
: 'DEFINE' ID
(COMMENT | (dclass ID type) | definition)*
'END' ID
;
dclass : 'KEY' | 'BASE' | 'TRANS'
;
type : tnum | tvarchar | tref | tdate
;
tnum : 'NUMBER'
;
tvarchar: 'VARCHAR2' '(' INT ')'
;
tref : 'REFERENCES' ID
;
tdate : 'DATE'
;
flow : 'BEGIN' ID (STRING)+
(COMMENT|assignment|flow)+
'END' ID
;
EQ : '='
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
NL : '\r'? '\n' {$channel=HIDDEN;}
;
COMMENT
: '#' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
;
WS : ( ' '
| '\t'
) {$channel=HIDDEN;}
;
STRING
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
INT : '0'..'9'+
;

Antlr parser rule fails to match either of specified lexer rules

I have a small work-in-progress Antlr grammar that looks like:
filterExpression returns [ActivityPredicate pred]
: NAME OPERATOR (PACE | NUMBER) {
if ($PACE != null) {
$pred = new SingleActivityPredicate($NAME.text, Operator.fromCharacter($OPERATOR.text), $PACE.text);
} else {
$pred = new SingleActivityPredicate($NAME.text, Operator.fromCharacter($OPERATOR.text), $NUMBER.text);
}
};
OPERATOR: ('>' | '<' | '=') ;
NAME: ('A'..'Z' | 'a'..'z')+ ;
NUMBER: ('0'..'9')+ ('.' ('0'..'9')+)? ;
PACE: ('0'..'9')('0'..'9')? ':' ('0'..'5')('0'..'9');
WS: (' ' | '\t' | '\r'| '\n')+ -> skip;
Hoping to parse things like:
distance = 4 or pace < 8:30
However, both of those inputs result in null for both the PACE and NUMBER, while trying to parse either:
However, dropping the option and just picking PACE works fine (it also works fine the other way, opting for NUMBER):
filterExpression returns [ActivityPredicate pred]
: NAME OPERATOR PACE { ... };
Why is it that when I provide the option, they're both null?

Try this.
filterExpression returns [ActivityPredicate pred]
: n=NAME o=OPERATOR (p=PACE | i=NUMBER) {
if ($PACE != null) {
$pred = new SingleActivityPredicate(
$n.text, Operator.fromCharacter($o.text), $p.text);
} else {
$pred = new SingleActivityPredicate(
$n.text, Operator.fromCharacter($o.text), $i.text);
}
};

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to handle escape sequences in string literals in ANTLR 3? - java

Related

Antlr3 grammar generates parsering error on encountering the Pound char

(DEFINE) feature in regex does not work in Java

AST how to deal with empty nodes

ANTLR3: No viable alternative at character

Antlr parser rule fails to match either of specified lexer rules

Categories

Resources