I have a problem figuring out how to parse a date in my grammar.
The thing is that it shares its definition with a String, but according to the Antlr 4 documentation, it should follow the precedence by looking at the order of declaration.
Here is my grammar:
grammar formula;
/* entry point */
parse: expr EOF;
expr
: value # argumentArithmeticExpr
| l=expr operator=('*'|'/'|'%') r=expr # multdivArithmeticExpr // TODO: test the % operator
| l=expr operator=('+'|'-') r=expr # addsubtArithmeticExpr
| '-' expr # minusArithmeticExpr
| FUNCTION_NAME '(' (expr ( ',' expr )* ) ? ')'# functionExpr
| '(' expr ')' # parensArithmeticExpr
;
value
: number
| variable
| date
| string
| bool;
/* Atomes */
bool
: BOOL
;
variable
: '[' (~(']') | ' ')* ']'
;
date
: DQUOTE date_format DQUOTE
| QUOTE date_format QUOTE
;
date_format
: year=INT '-' month=INT '-' day=INT (hour=INT ':' minutes=INT ':' seconds=INT)?
;
string
: STRING_LITERAL
;
number
: ('+'|'-')? NUMERIC_LITERAL
;
/* lexemes de base */
QUOTE : '\'';
DQUOTE : '"';
MINUS : '-';
COLON : ':';
DOT : '.';
PIPE : '|';
BOOL : T R U E | F A L S E;
FUNCTION_NAME: IDENTIFIER ;
IDENTIFIER
: [a-zA-Z_] [a-zA-Z_0-9]* // TODO: do we more chars in this set?
;
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )? // ex: 0.05e3
| '.' DIGIT+ ( E [-+]? DIGIT+ )? // ex: .05e3
;
INT: DIGIT+;
STRING_LITERAL
: '\'' ( ~'\'' | '\'\'' )* '\''
| '"' ( ~'"' | '""' )* '"'
;
WS: [ \t\n]+ -> skip;
UNEXPECTED_CHAR: . ;
fragment DIGIT: [0-9];
fragment A:('a'|'A');
fragment B:('b'|'B');
fragment C:('c'|'C');
fragment D:('d'|'D');
fragment E:('e'|'E');
fragment F:('f'|'F');
fragment G:('g'|'G');
fragment H:('h'|'H');
fragment I:('i'|'I');
fragment J:('j'|'J');
fragment K:('k'|'K');
fragment L:('l'|'L');
fragment M:('m'|'M');
fragment N:('n'|'N');
fragment O:('o'|'O');
fragment P:('p'|'P');
fragment Q:('q'|'Q');
fragment R:('r'|'R');
fragment S:('s'|'S');
fragment T:('t'|'T');
fragment U:('u'|'U');
fragment V:('v'|'V');
fragment W:('w'|'W');
fragment X:('x'|'X');
fragment Y:('y'|'Y');
fragment Z:('z'|'Z');
The important part here is this:
value
: number
| variable
| date
| string
| bool;
date
: DQUOTE date_format DQUOTE
| QUOTE date_format QUOTE
;
date_format
: year=INT '-' month=INT '-' day=INT (hour=INT ':' minutes=INT ':' seconds=INT)?
;
My grammar expects these things:
"a quoted string" -> gives a string
"2015-03 TOTOTo" -> gives a string because the date format doesn't match.
"2015-03-15" -> gives a date because it matches DQUOTE INT '-' INT '-' INT DQUOTE
And I (tried?) to make sure that the parser tries to match a date before trying to match a string: value: ...| date | string| ....
But when I use the grun utility (and my unit tests...), I can see that it categorizes the date as a string, like if it never bothered to check the date format.
Can you tell me why it is so?
I suspect there's a catch with the order in which I declare my grammar rules, but I tried some permutations and didn't get anything.
The problem stems from the failure to understand that the lexer runs to completion before any of the parser rules are effectively considered.
That means, the STRING_LITERAL lexer rule will consume all strings, dates included, and output just STRING_LITERAL tokens. The date and related parser subrules are never even considered by the parser.
Perhaps the minimal solution is to modify the STRING_LITERAL lexer rule to
STRING_LITERAL
: { notDateString() }?
( QUOTE .*? QUOTE
| DQUOTE .*? DQUOTE
)
;
The notDateString predicate requires native code to perform the essential disambiguation between date formats and other strings.
Another alternative is to promote the STRING_LITERAL rule entirely to the parser. Doable, but a bit messy depending on whether there is a need to preserve whitespaces within 'real' strings.
BTW, you may wish to add a token stream dump to your standard series of unit tests.
Related
I am playing with antlr4 grammar files, and I wanted to write my own jsonpath grammar.
I've comeup with this:
grammar ObjectPath;
objectPath : dnot;
dnot : ROOT expr ('.' expr)
| EOF
;
expr : select #selectExpr
| ID #idExpr
;
select : ID '[]' #selectAll
| ID '[' INT ']' #selectIndex
| ID '[' INT (',' INT)* ']' #selectIndexes
| ID '[' INT ':' INT ']' #selectRange
| ID '[' INT ':]' #selectFrom
| ID '[:' INT ']' #selectUntil
| ID '[-' INT ':]' #selectLast
| ID '[?(' query ')]' #selectQuery
;
query : expr (AND|OR) expr # andOr
| ALL # all
| QPREF ID # prop
| QPREF ID GT INT # gt
| QPREF ID LT INT # lt
| QPREF ID EQ INT # eq
| QPREF ID GTE INT # gte
| QPREF ID LTE INT # lte
;
/** Lexer **/
ROOT : '$.' ;
QPREF : '#.' ;
ID : [a-zA-Z][a-zA-Z0-9]* ;
INT : '0' | [1-9][0-9]* ;
AND : '&&' ;
OR : '||' ;
GT : '>' ;
LT : '<' ;
EQ : '==' ;
GTE : '>=' ;
LTE : '<=' ;
ALL : '*' ;
After running this on a simple expression:
CharStream input = CharStreams.fromString("$.name");
ObjectPathLexer lexer = new ObjectPathLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
ObjectPathParser parser = new ObjectPathParser(tokens);
ParseTree parseTree = parser.dnot();
ObjectPathDefaultVisitor visitor = ...
System.out.println(visitor.visit(parseTree));
System.out.println(parseTree.toStringTree(parser));
The output is ok, meaning that the "name" is actually retrieved from the json, but there's a warning I cannot explain:
line 1:6 mismatched input '<EOF>' expecting '.'
I've read that I need to explicitly have an EOF rule added to my starting one (dnot), but this doesn't seem to work.
Any idea what can I do ?
Your input $.name cannot be parsed by your rule:
dnot : ROOT expr ('.' expr)
| EOF
;
$.name produces 2 tokens:
ROOT
ID
But your first alternative, ROOT expr ('.' expr), expects 2 expressions separated by a .. Perhaps you meant to make the second expr optional, like this:
dnot : ROOT expr ('.' expr)*
| EOF
;
And the EOF is generally added at the end of your start rule, to force the parser to consume all tokens. As you did it now, the parser successfully parsed ROOT expr, but then failed to parse further, and produces the warning you saw (expecting '.').
Since objectPath seems to be your start rule, I think this is what you want to do:
objectPath : dnot EOF;
dnot : ROOT expr ('.' expr)?
;
Also, tokens like these [], '[?(', etc look suspicious. I'm not really familiar with Object Path, but by glueing these chars to each other, input like this [ ] ([ and ] separated by a space) will not be matched by []. So if foo[ ] is valid, I'd write it like this instead:
select : ID '[' ']' #selectAll
| ...
and skip spaces in the lexer:
SPACES : [ \t\r\n]+ -> skip;
please bear with me I'm not a coding expert.
I built a grammar in ANTLR4 using ANTRWorks 2. I tested the grammar with various teststrings and it works fine within there. Now what I'm having trouble with is using the generated lexer and parser in my own code. As code generation target I'm using Java.
Here is the code I'm trying:
String s = "query(std::map .find(x) == y): bla";
ANTLRInputStream input = new ANTLRInputStream(s);
TokenStream tokens = new CommonTokenStream(new pqlcLexer(input));
pqlcParser parser = new pqlcParser(tokens);
ParseTree tree = parser.query();
System.out.println(tree.toStringTree());
The Output of that is just "query", which is my starting rule. I would expect something like the output from ANTLRworks:
"(query (quant_expr query ( (match std::map . find ( (cm x) ) == (cm (numeral 256))) ) : (query (qexpr bla))))"
Here is the tree visually: http://puu.sh/94Nlx/00dc35bb05.png
Which methods do I have to call to get the proper syntax tree as output?
Here is the generated Parser for reference: http://pastebin.com/Lb34TyRW and the grammar:
// Lexer
//Schlüsselwörter
EXISTS: 'exists';
REDUCE: 'reduce';
QUERY: 'query';
INT: 'int';
DOUBLE: 'double';
CONST: 'const';
STDVECTOR: 'std::vector';
STDMAP: 'std::map';
STDSET: 'std::set';
INTEGER_LITERAL : (DIGIT)+ ;
fragment DIGIT: '0'..'9';
DOUBLE_LITERAL : DIGIT '.' DIGIT+;
LPAREN : '(';
RPAREN : ')';
LBRACK : '[';
RBRACK : ']';
DOT : '.';
EQUAL : '==';
LE : '<=';
GE : '>=';
GT : '>';
LT : '<';
ADD : '+';
MUL : '*';
AND : '&&';
COLON : ':';
IDENTIFIER : JavaLetter JavaLetterOrDigit*;
fragment JavaLetter : [a-zA-Z$_]; // these are the "java letters" below 0xFF
fragment JavaLetterOrDigit : [a-zA-Z0-9$_]; // these are the "java letters or digits" below 0xFF
WS
: [ \t\r\n\u000C]+ -> skip
;
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
// Parser
//start_rule: query;
query :
quant_expr
| qexpr+
| IDENTIFIER // order IDENTIFIER and qexpr+?
| numeral
//| c_expr TODO
;
c_type : INT | DOUBLE | CONST;
bin_op: AND | ADD | MUL | EQUAL | LT | GT | LE| GE;
qexpr:
LPAREN query RPAREN bin_op_query?
// query bin_op query
| IDENTIFIER bin_op_query? // copied from query to resolve left recursion problem
| numeral bin_op_query? // ^
| quant_expr bin_op_query? // ^
// query.find(query)
| IDENTIFIER find_query? // copied from query to resolve left recursion problem
| numeral find_query? // ^
| quant_expr find_query?
// query[query]
| IDENTIFIER array_query? // copied from query to resolve left recursion problem
| numeral array_query? // ^
| quant_expr array_query?
// | qexpr bin_op_query // bad, resolved by quexpr+ in query
;
bin_op_query: bin_op query bin_op_query?; // resolve left recursion of query bin_op query
find_query: '.''find' LPAREN query RPAREN;
array_query: LBRACK query RBRACK;
quant_expr:
quant id ':' query
| QUERY LPAREN match RPAREN ':' query
| REDUCE LPAREN IDENTIFIER RPAREN id ':' query
;
match:
STDVECTOR LBRACK id RBRACK EQUAL cm
| STDMAP '.''find' LPAREN cm RPAREN EQUAL cm
| STDSET '.''find' LPAREN cm RPAREN
;
cm:
IDENTIFIER
| numeral
// | c_expr TODO
;
quant :
EXISTS;
id :
c_type IDENTIFIER
| IDENTIFIER // Nach Seite 2 aber nicht der Übersicht. Laut übersicht id -> aber dann wäre Regel 1 ohne +
;
numeral :
INTEGER_LITERAL
| DOUBLE_LITERAL
;
Apart from the fact that Java Classes should start with an uppercase letter (so you should rename your grammar, so it starts with an uppercase letter) your last line should be
System.out.println(tree.toStringTree(parser));
to print the tree. Otherwise the tree doesnÄt know which parser to use and only outputs what you described.
EDIT
When naming your grammar PQLC the following code
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.*;
public class Test {
public static void main(String[] args) throws Exception {
String query = "query(std::map .find(x) == y): bla";
ANTLRInputStream input = new ANTLRInputStream(query);
PQLCLexer lexer = new PQLCLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
PQLCParser parser = new PQLCParser(tokens);
ParseTree tree = parser.query(); // begin parsing at query rule
System.out.println(tree.toStringTree(parser)); // print LISP-style tree
}
}
produces this output with ANTLR v4.2 at my machine:
(query (quant_expr query ( (match std::map . find ( (cm x) ) == (cm y)) ) : (query (qexpr bla))))
I've coded a simple lexer and parser using ANTLR4 grammars to make a language plugin for NetBeans 7.3 to help team write more quickly our layout files (a mix of XHTML and widgets definitions also in form of XHTML tags but with custom properties, characteristics, and with some differencies against XHTML syntax).
Template file example:
<div style="dyn_layout_panel">
#symbol#
<w_label=label, text="Try to close this window" />
<w_buttonclose=button, text = "CLOSE", on_press=press_close />
<w_buttonterminate=button, text="TERMINATE", on_press=press_terminate />
<w_mydatepicker=datepicker, parent=tab0, ary=[10, "str", /regex/i], start_date=2013-10-05, on_selected=datepicker_selected />
<w_myeditbox=editbox, parent=tab0, validation=USER_REGEX, validation_regex=/^[0-9]+[a-z]*$/i,
validation_msg="User regex don't match editbox contents.", on_keyreturn=tab0_editbox_keyreturn />
<div style="dyn_layout_panel">
$SYMBOL_2$
Some text that make a text node.
</div>
</div>
I use AnltrWorks 2 to write and debug lexer and parser and all seem to be fine, in NetBeans also I don't get any exception and the parser work properly but during editing/typing I lose token colors near the cursor.
Screenshot of problem:
Adding a debug console output for each keystroke I see that the lexer enter in IN_TAG or IN_WIDGET mode correctly, but after a WHITESPACE it returns to the default mode and match te rest of text inside a tag as a TEXT_NODE token.
I know that a lexer can have only one active mode at a time, so because it matches the TEXT_NODE rule when in IN_TAG or IN_WIDGET modes?
Lexer grammar file:
lexer grammar LayoutLexer;
COMMENT
: '/*' .*? '*/' -> channel(HIDDEN)
;
WS : ( ' '
| '\t'
| EOL
)+? -> channel(HIDDEN)
;
WDG_START_OPEN : '<w_' PROPERTY -> pushMode(IN_WIDGET) ;
WDG_END_OPEN : '</w_' PROPERTY -> pushMode(IN_WIDGET) ;
TAG_START_OPEN : '<' ATTRIBUTE -> pushMode(IN_TAG) ;
TAG_END_OPEN : '</' ATTRIBUTE -> pushMode(IN_TAG) ;
EXT_REF
: ( ('#' REF_NAME '#') | ('$' SYMBOL '$') | ('§' REF_NAME '§') )
;
fragment
REF_NAME
: ( [a-z]+ [0-9a-z_]*? )
;
fragment
EOL : ( '\r\n' | '\n\r' | '\n' )
;
EQUAL
: '='
;
TEXT_NODE
: ( (~('\r'|'\n'|'<'|'#'|'$'|'§'))+ )
;
ERROR
: ( .+? )
;
mode IN_TAG;
TAG_CLOSE : '>' -> popMode ;
TAG_EMPTY_CLOSE : '/>' -> popMode ;
TAG_WS : WS -> type(WS), channel(HIDDEN) ;
TAG_COMMENT : COMMENT -> type(COMMENT), channel(HIDDEN) ;
TAG_EQ : EQUAL -> type(EQUAL) ;
ATTRIBUTE
: ( LITERAL [0-9a-zA-Z_]* )
;
VAL
: ( '"' ( ESC_SEQ | ~('\\'|'"') )*? '"'
| '\'' ( ESC_SEQ | ~('\\'|'\'') )*? '\'' )
;
TAG_ERR : ERROR -> type(ERROR) ;
mode IN_WIDGET;
WDG_CLOSE : '>' -> popMode ;
WDG_EMPTY_CLOSE : '/>' -> popMode ;
WDG_WS : WS -> type(WS), mode(IN_WIDGET), channel(HIDDEN) ;
WDG_COMMENT : COMMENT -> type(COMMENT), channel(HIDDEN) ;
WDG_EQ : EQUAL -> type(EQUAL), pushMode(WDG_ASSIGN) ;
COMMA
: ','
;
fragment
MINUS
: '-'
;
STRING
: ( '"' ( ESC_SEQ | ~('\\'|'"') )*? '"'
| '\'' ( ESC_SEQ | ~('\\'|'\'') )*? '\'' )
;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
fragment
HEX_DIGIT
: [0-9a-fA-F]
;
fragment
DIGIT
: [0-9]
;
fragment
HEX_NUMBER
: '0x' HEX_DIGIT+
;
fragment
HTML_NUMBER
: (INT_NUMBER | FLOAT_NUMBER) HTML_UNITS
;
fragment
FLOAT_NUMBER
: MINUS? INT_NUMBER '.' DIGIT+
;
fragment
INT_NUMBER
: MINUS? DIGIT+
;
EVENT_HANDLER
: 'on_' PROPERTY
;
PROPERTY
: ( LITERAL [0-9a-zA-Z_]* )
;
fragment
LITERAL
: ( LITERAL_U | LITERAL_L )
;
fragment
LITERAL_U
: [A-Z]+
;
fragment
LITERAL_L
: [a-z]+
;
WDG_ERR : ERROR -> type(ERROR) ;
mode WDG_ASSIGN;
PHP_REF
: ( LITERAL_L ('_' | LITERAL_L | [0-9])* ) -> popMode
;
VALUE : (WDG_VAL | ARRAY) -> popMode;
ASGN_WS : WS -> type(WS), channel(HIDDEN);
ASGN_COMMA : COMMA -> type(COMMA);
ARY_START
: '['
;
ARY_END
: ']'
;
BIT_OR
: '|'
;
ARRAY
: ARY_START ARY_VALUE (ASGN_COMMA ARY_VALUE)* ARY_END
;
fragment
ARY_VALUE : ASGN_WS? WDG_VAL ASGN_WS? -> type(VALUE);
fragment
WDG_VAL
: (STRING
| UTC_DATE
| HEX_NUMBER
| HTML_NUMBER
| FLOAT_NUMBER
| INT_NUMBER
| BOOLEAN
| BITFIELD
| REGEX
| CSS_CLASS)
;
fragment
HTML_UNITS
: ('%'|'in'|'cm'|'mm'|'em'|'ex'|'pt'|'pc'|'px')
;
fragment
BOOLEAN
: ('true'|'false')
;
fragment
BITFIELD
: SYMBOL (WS? BIT_OR WS? SYMBOL)*
;
SYMBOL
: LITERAL_U [0-9A-Z_]*
;
UTC_DATE
: (DIGIT DIGIT DIGIT DIGIT '-' DIGIT DIGIT '-' DIGIT DIGIT)
;
REGEX
: ('/' ('\\'.|.)*? '/' ('g'|'m'|'i')* )
;
CSS_CLASS
: ( LITERAL_L ('-' | '_' | LITERAL_L | [0-9])* )
;
WDG_ASSIGN_ERR : ERROR -> type(ERROR), popMode;
Parser grammar file:
parser grammar LayoutParser;
options
{
tokenVocab=LayoutLexer;
language=Java;
}
document : (element | TEXT_NODE | EXT_REF)* EOF;
element
locals
[
String currentTag
]
: ( ( html_open_tag (element | TEXT_NODE | EXT_REF)* html_close_tag )
| ( wdg_open_tag (element | TEXT_NODE | EXT_REF)* wdg_close_tag )
| ( html_empty_tag | wdg_empty_tag ) )
;
html_empty_tag
: TAG_START_OPEN (ATTRIBUTE EQUAL VAL)* TAG_EMPTY_CLOSE
;
html_open_tag
: ( tag=TAG_START_OPEN (ATTRIBUTE EQUAL VAL)* TAG_CLOSE )
{$element::currentTag = $tag.text.substring(1);}
;
html_close_tag
: tag=TAG_END_OPEN TAG_CLOSE
{
if (!$element::currentTag.equals($tag.text.substring(2)))
notifyErrorListeners("HTML tag mismatch '" + $element::currentTag + "' - '" + $tag.text.substring(2) + "'");
}
;
wdg_empty_tag
: WDG_START_OPEN EQUAL PHP_REF ( COMMA (wdg_prop | wdg_event) )* WDG_EMPTY_CLOSE
;
wdg_open_tag
: tag=WDG_START_OPEN EQUAL PHP_REF ( COMMA (wdg_prop | wdg_event) )* WDG_CLOSE
{$element::currentTag = $tag.text.substring(1);}
;
wdg_close_tag
: tag=WDG_END_OPEN WDG_CLOSE
{
if (!$element::currentTag.equals($tag.text.substring(2)))
notifyErrorListeners("Widget alias mismatch '" + $element::currentTag + "' - '" + $tag.text + "'");
}
;
wdg_prop
: PROPERTY (EQUAL (ARRAY | VALUE | PHP_REF | UTC_DATE | REGEX | CSS_CLASS))?
;
wdg_event
: EVENT_HANDLER EQUAL PHP_REF
;
Depending on the implementation of syntax highlighting, the IDE may or may not start at the beginning of the document when lexing the input for syntax highlighting. If it does not start at the beginning of the document, then before returning any tokens, you need to ensure that the lexer instance is initialized in the correct mode (both the _mode and _modeStack fields need to be initialized to their correct state at the point where lexing starts).
If your lexer reads or writes any custom fields during lexing, you may need to restore those fields as well.
Examples
GoWorks (NetBeans based, LGPL License). This implementation does not use the lexer facilities in the NetBeans API, but instead implements the functionality at a lower level. For now you can ignore the MarkOccurrences* and SemanticHighlighter classes.
package org.tvl.goworks.editor.go.highlighter
package org.antlr.works.editor.antlr4.highlighting
ANTLR 4 IntelliJ Plugin (IntelliJ IDEA, BSD license).
package org.antlr.intellij.adaptor.lexer
package org.antlr.intellij.plugin (in particular, the SyntaxHighlighter classes)
Additional efficiency notes
Your REF_NAME, VAL, and STRING rules use non-greedy loops that do not need to be non-greedy. In each of these rules, change +? to + and change *? to *.
Your WS and ERROR rules use a non-greedy operator +? which is equivalent to not having a closure at all. The unnecessary use of a non-greedy operator in these cases only serves to slow down your lexer. To preserve the existing behavior, you can remove +? from these rules (replacing with + would change behavior).
Additional functionality notes
ANTLR 4 does not perform any error correction during lexing. If the input does not match a token, then the input simply does not match a token. This issue affects your VAL and STRING tokens in particular, which will not get syntax highlighting prior to adding the closing " or ' character. For syntax highlighting these types of tokens, I prefer to use an additional mode in the lexer, allowing me to produce separate tokens for the escape sequences embedded in the string, as well as syntax highlighting an unterminated string at the end of the line (unless your language allows strings to span multiple lines, in which case you'd stop at the end of the input).
For future references
All problems are related to the wrong implementation I done of NetBeans Lexer<T> class; many tutorials on the web do not take into account that a lexer may have more than one mode and that the lexer state must be backuped and restored between Lexer allocation/releases as mentioned by 280Z28.
This is the code I use to make syntax highlighting consistent:
public class LayoutEditorLexer implements Lexer<LayoutTokenId> {
private LexerRestartInfo<LayoutTokenId> info;
private LayoutLexer lexer;
private class LexerState {
public int Mode = -1;
public IntegerStack Stack = null;
public LexerState(int mode, IntegerStack stack)
{
Mode = mode;
Stack = new IntegerStack(stack);
}
}
public LayoutEditorLexer(LexerRestartInfo<LayoutTokenId> info) {
this.info = info;
AntlrCharStream charStream = new AntlrCharStream(info.input(), "LayoutEditor", false);
lexer = new LayoutLexer(charStream);
lexer.removeErrorListeners();
lexer.addErrorListener(ErrorListener.INSTANCE);
LexerState lexerMode = (LexerState)info.state();
if (lexerMode != null)
{
lexer._mode = lexerMode.Mode;
lexer._modeStack.addAll(lexerMode.Stack);
}
}
#Override
public org.netbeans.api.lexer.Token<LayoutTokenId> nextToken() {
Token token = lexer.nextToken();
int ttype = token.getType();
if (ttype != LayoutLexer.EOF)
{
LayoutTokenId tokenId = LayoutLanguageHierarchy.getToken(ttype);
return info.tokenFactory().createToken(tokenId);
}
return null;
}
#Override
public Object state()
{
// Here many tutorials simply returns null.
return new LexerState(lexer._mode, lexer._modeStack);
}
#Override
public void release()
{
}
}
I know that this has been discussed thousand times but I still cannot figure out why is following grammar failing. In interpreter everything works fine, without any errors or warnings. However when running the generated code, I'm getting mismatched input as shown below.
For this grammar:
grammar xxx;
options {
language = Java;
output = AST;
}
#members {
#Override
public String getErrorMessage(RecognitionException e,
String[] tokenNames)
{
List stack = getRuleInvocationStack(e, this.getClass().getName());
String msg = null;
if ( e instanceof NoViableAltException ) {
NoViableAltException nvae = (NoViableAltException)e;
msg = " no viable alt; token="+e.token+
" (decision="+nvae.decisionNumber+
" state "+nvae.stateNumber+")"+
" decision=<<"+nvae.grammarDecisionDescription+">>";
}
else {
msg = super.getErrorMessage(e, tokenNames);
}
return stack+" "+msg;
}
#Override
public String getTokenErrorDisplay(Token t) {
return t.toString();
}
}
obj
: first=subscription
(COMMA other=subscription)*
;
subscription
: ID
(EQUALS arguments_in_brackets)?
filters
;
arguments_in_brackets
: LOPAREN arguments ROPAREN
;
arguments
: (arguments_body)
;
arguments_body
: argument (arguments_more)?
;
arguments_more
: SEMICOLON arguments_body
;
argument
: id_equals argument_body
;
argument_body
: STRING
| INT
| FLOAT
;
filters
: LSPAREN expression RSPAREN
;
expression
: or
;
or
: first=and
(OR^ second=and)*
;
and : first=atom
(AND^ second=atom)*
;
atom
: filter
| atom_expression
;
atom_expression
: LCPAREN
expression
RCPAREN
;
filter
: id_equals arguments_in_brackets
;
id_equals
: WS* ID WS* EQUALS WS*
;
COMMA: WS* ',' WS*;
LCPAREN : WS* '(' WS*;
RCPAREN : WS* ')' WS*;
LSPAREN : WS* '[' WS*;
RSPAREN : WS* ']' WS*;
LOPAREN : WS* '{' WS*;
ROPAREN : WS* '}' WS*;
AND: WS* 'AND' WS*;
OR: WS* 'OR' WS*;
NOT: WS* 'NOT' WS*;
EQUALS: WS* '=' WS*;
SEMICOLON: WS* ';' WS*;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
INT : '0'..'9'+
;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
// : '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
// : '"' (~'"')* '"'
STRING
: '"' (~'"')* '"'
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
NEWLINE: '\r'? '\n' {skip();} ;
WS: (' '|'\t')+ {skip();} ;
And for this input:
status={name="Waiting";val=5}[ownerEmail1={email="dsa#fdsf.ds"} OR internalStatus={status="New"}],comments={type="fds"}[(internalStatus={status="Owned"} AND ownerEmail2={email="dsa#fds.ds"}) OR (role={type="Contributor"} AND status={status="Closed"})]
I'm getting:
line 1:67 [obj, subscription, filters, expression, or, and, atom, filter, arguments_in_brackets] mismatched input [#18,67:80='internalStatus',<11>,1:67] expecting ROPAREN
line 1:157 [obj, subscription, filters, expression, or, and, atom, atom_expression, expression, or, and, atom, filter, arguments_in_brackets] mismatched input [#42,157:167='ownerEmail2',<11>,1:157] expecting ROPAREN
Can someone give me any clues why is this failing please? I've tried to rewrite it in many ways but the error is still the same.
The problem is that you're using WS tokens in other lexer rules and are therefor skipping these tokens. This causes the lexer to discard these tokens entirely, and can then not be used in parser rules.
So, if you have a rule like:
WS : ' ' {skip();};
and then use this rule in NOT:
NOT : WS* 'NOT' WS*;
it causes the NOT token to be skipped as well.
If you're already skipping these WS chars, you don't need to include them in your other lexer rules: simply remove all WS* in other rules:
...
NOT : 'NOT';
...
(also remove them from parser rules: all skipped tokens from the lexer are never available in parser rules anyway!)
I am currently creating a more or less simple expression evaluator using ANTLR.
My grammar is straightforward (at least i hope so) and looks like this:
grammar SXLGrammar;
options {
language = Java;
output = AST;
}
tokens {
OR = 'OR';
AND = 'AND';
NOT = 'NOT';
GT = '>'; //greater then
GE = '>='; //greater then or equal
LT = '<'; //lower then
LE = '<='; //lower then or equal
EQ = '=';
NEQ = '!='; //Not equal
PLUS = '+';
MINUS = '-';
MULTIPLY = '*';
DIVISION = '/';
CALL;
}
#header {
package somepackage;
}
#members {
}
#lexer::header {
package rise.spics.sxl;
}
rule
: ('='|':')! expression
;
expression
: booleanOrExpression
;
booleanOrExpression
:
booleanAndExpression ('OR'^ booleanAndExpression)*
;
booleanAndExpression
:
booleanNotExpression ('AND'^ booleanNotExpression)*
;
booleanNotExpression
:
('NOT'^)? booleanAtom
;
booleanAtom
:
| compareExpression
;
compareExpression
:
commonExpression (('<' | '>' | '=' | '<=' | '>=' | '!=' )^ commonExpression)?
;
commonExpression
:
multExpr
(
(
'+'^
| '-'^
)
multExpr
)*
| DATE
;
multExpr
:
atom (('*'|'/')^ atom)*
| '-'^ atom
;
atom
:
INTEGER
| DECIMAL
| BOOLEAN
| ID
| '(' expression ')' -> expression
| functionCall
;
functionCall
:
ID '(' arguments ')' -> ^(CALL ID arguments?)
;
arguments
:
(expression) (','! expression)*
| WS
;
BOOLEAN
:
'true'
| 'false'
;
ID
:
(
'a'..'z'
| 'A'..'Z'
)+
;
INTEGER
:
('0'..'9')+
;
DECIMAL
:
('0'..'9')+ ('.' ('0'..'9')*)?
;
DATE
:
'!' '0'..'9' '0'..'9' '0'..'9' '0'..'9' '-' '0'..'9' '0'..'9' '-' '0'..'9' '0'..'9' (' ' '0'..'9' '0'..'9' ':''0'..'9' '0'..'9' (':''0'..'9' '0'..'9')?)?
;
WS
: (' '|'\t' | '\n' | '\r' | '\f')+ { $channel = HIDDEN; };
Now if i try to parse an invalid Expression like "= true NOT true", the graphical test-tool of the eclipse plugin throws an NoViableAltException: line 1:6 no viable alternative at input 'NOT', which is correct and supposed.
Now if i try to parse the expression in a Java Program, nothing happens. The Program
String expression = "=true NOT false";
CharStream input = new ANTLRStringStream(expression);
SXLGrammarLexer lexer = new SXLGrammarLexer(input);
TokenStream tokenStream = new CommonTokenStream(lexer);
SXLGrammarParser parser = new SXLGrammarParser(tokenStream);
CommonTree tree = (CommonTree) parser.rule().getTree();
System.out.println(tree.toStringTree());
System.out.println(parser.getNumberOfSyntaxErrors());
would output:
true
0
that means, the AST created by the parser exists of one node and ignores the rest. I'd like to handle syntax errors in my application, but its not possible if the generated parser doesn't find any error.
I also tried to alter the parser by overwriting the displayRecognitionError() method with something like this:
public void displayRecognitionError(String[] tokenNames,
RecognitionException e) {
String msg = getErrorMessage(e, tokenNames);
throw new RuntimeException("Error at position "+e.index+" " + msg);
}
but displayRecognitionError gets never called.
If i try something like "=1+", a error gets displayed. I guess theres something wrong with my grammar, but why does the eclipse plugin throw that error while the generated parser does not?
If you want rule to consume the entire token-stream, you have to specify where you expect the end of your input. Like this:
rule
: ('='|':')! expression EOF
;
Without the EOF your parser reads the true as boolean an ignores the rest.