Antlr4 - Parser for multi line file - - java

I'm trying to use antlr4 to parse a ssh command result, but I can not figure out why this code doesn't work, I keep getting an "extraneous input" error.
Here is a sample of the file I'm trying to parse :
system
home[1] HOME-NEW
sp
cpu[1]
cpu[2]
home[2] SECOND-HOME
sp
cpu[1]
cpu[2]
Here is my grammar file :
listAll
: ( system | home | NL)*
;
elements
: (sp | cpu )*
;
home
: 'home[' number ']' value NL elements
;
system
: 'system' NL
;
sp
: 'sp' NL
;
cpu
: 'cpu[' number ']' NL
;
value
: VALUE
;
number
: INT
;
VALUE : STRING+;
STRING: ('a'..'z'|'A'..'Z'| '-' | ' ' | '(' | ')' | '/' | '.' | '[' | ']');
INT : ('0'..'9')+ ;
NL : '\r'? '\n';
WS : (' '|'\t')* {skip();} ;
The entry point is 'listAll'.
Here is the result I get :
(listAll \r\n (system system \r\n) home[1] HOME-NEW \r\n sp \r\n cpu[1] \r\n cpu[2] \r\n[...])
The parsing failed after 'system'. And I get this error :
line 2:1 extraneous input 'home[1] HOME-NEW' expecting {, system', NL, WS}
Does anybody know why this is not working ?
I am a beginner with Antlr, and I'm not sure I really understand how it works !
Thank you all !

You need to combine NL and WS as one WS element and skip it using -> skip (not {skip()})
And since the WS will be skipped automatically, no need to specify it in all the rules.
Also, your STRING had a space (' ') which was causing the error and taking up the next input.
Here is your complete grammar :
listAll : ( system | home )* ;
elements : ( sp | cpu )* ;
home : 'home[' number ']' value elements;
system : 'system' ;
sp : 'sp' ;
cpu : 'cpu[' number ']' ;
value : VALUE ;
number : INT ;
VALUE : STRING+;
STRING : ('a'..'z'|'A'..'Z'| '-' | '(' | ')' | '/' | '.' | '[' | ']') ;
INT : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
Also, I'll suggest you to go through the ANTLR4 Documentation

Related

ANTLR Grammar line 1:6 mismatched input '<EOF>' expecting '.'

I am playing with antlr4 grammar files, and I wanted to write my own jsonpath grammar.
I've comeup with this:
grammar ObjectPath;
objectPath : dnot;
dnot : ROOT expr ('.' expr)
| EOF
;
expr : select #selectExpr
| ID #idExpr
;
select : ID '[]' #selectAll
| ID '[' INT ']' #selectIndex
| ID '[' INT (',' INT)* ']' #selectIndexes
| ID '[' INT ':' INT ']' #selectRange
| ID '[' INT ':]' #selectFrom
| ID '[:' INT ']' #selectUntil
| ID '[-' INT ':]' #selectLast
| ID '[?(' query ')]' #selectQuery
;
query : expr (AND|OR) expr # andOr
| ALL # all
| QPREF ID # prop
| QPREF ID GT INT # gt
| QPREF ID LT INT # lt
| QPREF ID EQ INT # eq
| QPREF ID GTE INT # gte
| QPREF ID LTE INT # lte
;
/** Lexer **/
ROOT : '$.' ;
QPREF : '#.' ;
ID : [a-zA-Z][a-zA-Z0-9]* ;
INT : '0' | [1-9][0-9]* ;
AND : '&&' ;
OR : '||' ;
GT : '>' ;
LT : '<' ;
EQ : '==' ;
GTE : '>=' ;
LTE : '<=' ;
ALL : '*' ;
After running this on a simple expression:
CharStream input = CharStreams.fromString("$.name");
ObjectPathLexer lexer = new ObjectPathLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
ObjectPathParser parser = new ObjectPathParser(tokens);
ParseTree parseTree = parser.dnot();
ObjectPathDefaultVisitor visitor = ...
System.out.println(visitor.visit(parseTree));
System.out.println(parseTree.toStringTree(parser));
The output is ok, meaning that the "name" is actually retrieved from the json, but there's a warning I cannot explain:
line 1:6 mismatched input '<EOF>' expecting '.'
I've read that I need to explicitly have an EOF rule added to my starting one (dnot), but this doesn't seem to work.
Any idea what can I do ?
Your input $.name cannot be parsed by your rule:
dnot : ROOT expr ('.' expr)
| EOF
;
$.name produces 2 tokens:
ROOT
ID
But your first alternative, ROOT expr ('.' expr), expects 2 expressions separated by a .. Perhaps you meant to make the second expr optional, like this:
dnot : ROOT expr ('.' expr)*
| EOF
;
And the EOF is generally added at the end of your start rule, to force the parser to consume all tokens. As you did it now, the parser successfully parsed ROOT expr, but then failed to parse further, and produces the warning you saw (expecting '.').
Since objectPath seems to be your start rule, I think this is what you want to do:
objectPath : dnot EOF;
dnot : ROOT expr ('.' expr)?
;
Also, tokens like these [], '[?(', etc look suspicious. I'm not really familiar with Object Path, but by glueing these chars to each other, input like this [ ] ([ and ] separated by a space) will not be matched by []. So if foo[ ] is valid, I'd write it like this instead:
select : ID '[' ']' #selectAll
| ...
and skip spaces in the lexer:
SPACES : [ \t\r\n]+ -> skip;

ANTLR "no viable alternative at input" Error for parsing SAS code If Then Else

I am new to ANTLR and working on a parser to parse SAS code which mainly comprises of if then else if statements. I have created the following grammar to parse the code but I am getting error in Intellij when I tried to run using sample application.
Grammar created :
grammar SASDTModel;
parse
: if_block+
| score_block
;
//Model
// : If_block+
// | Score_block
// ;
if_block
: (if_statement|if_in_block)
| else_if_statement+
| else_statement
;
if_statement
: IF '(' if_condition ')' THEN Identifier'='Value ';'
| IF Identifier'='Value THEN Identifier'='Value ';'
;
else_if_statement
: ELSEIF '(' if_condition ')' THEN Identifier'='Value ';'
| ELSEIF Identifier'='Value THEN Identifier'='Value ';'
;
if_condition
: Value ComparisionOperators Identifier ComparisionOperators Value
| Value ComparisionOperators Value;
else_statement
: ELSE Identifier'='Value ';'
;
if_in_block
: IF Identifier IN '(' StringArray ')' THEN Identifier'='Value ';'
;
score_block
: Identifier'='Arithmetic_expression ';'
;
Arithmetic_expression:
| ( ArithmeticOperators '(' Value ')' )+
| ( ArithmeticOperators '(' Value ArithmeticOperators Identifier ')' )+
;
WS : ( ' ' | '\t' | '\r' | '\n' )-> channel(HIDDEN);
//WS : [ \t\n\r]+ -> channel(HIDDEN) ;
//WS : (' ' | '\t')+ -> channel(HIDDEN);
//COMMENT : '/*' .*? '*/' -> skip ;
//LINE_COMMENT : '*' ~[\r\n]* -> skip ;
ArithmeticOperators:
| '+'
| '-'
| '*'
| '/'
| '**'
;
ComparisionOperators
: '=='
| '<'
| '>'
| '<='
| '>='
;
IF: 'IF' | 'if' ;
ELSE: 'ELSE' | 'else' ;
ELSEIF: 'ELSE IF' | 'else if' ;
THEN: 'THEN' | 'then';
IN: 'IN' | 'in';
Value : INT
| DOUBLE
| '-'DOUBLE
| '-'INT
| Identifier
|'null';
INT : [0-9];
DOUBLE : INT+ PT INT+
| PT INT+
| INT+
;
PT : '.';
Identifier : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')* ;
StringArray : (('\'')(Value)('\''))+;
Input:
if scored = null then scored = -0.05;
else if ( 0 < scored <= 300 ) then scored = -0.5;
else if ( 300 < scored <= 500 ) then scored = -0.4;
else if ( 500 < scored <= 800 ) then scored = -0.8;
else if ( 800 < scored <= 1000 ) then scored = 0.9;
else if ( scored > 1000 ) then scored = 1.735409628;
else scored = 0;
Error I am getting
line 1:4 no viable alternative at input 'IF scored'
line 1:61 mismatched input '<=' expecting ')'
line 1:112 mismatched input '<=' expecting ')'
line 1:163 mismatched input '<=' expecting ')'
line 1:214 mismatched input '<=' expecting ')'
line 1:276 mismatched input 'scored' expecting Identifier
line 1:303 mismatched input 'scored' expecting Identifier
All the error codes are 1: since I am preprocessing the SAS code and removing any comments and converting into single line.
So after preprocessing the input is converted to following : `
IF scored = null THEN scored = -0.05;ELSE IF ( 0 < scored <= 300 )
THEN scored = -0.5;ELSE IF ( 300 < scored <= 500 ) THEN scored =
-0.4;ELSE IF ( 500 < scored <= 800 ) THEN scored = -0.8;ELSE IF ( 800 < scored <= 1000 ) THEN scored = 0.9;ELSE IF ( scored > 1000 ) THEN
scored = 1.735409628;ELSE scored = 0;
`
Here are a couple of things that might causing problems:
by making StringArray : (('\'')(Value)('\''))+; a lexer rule, you will only match 'foo123mu' (values without spaces). You should make StringArray a parser rule (and then Value should also become a parser rule)
your else If rule: ELSEIF: 'ELSE IF' | 'else if' ; is rather fragile: whenever there are 2 or more spaces between ELSE and IF, your rule will not be matched. You should remove this rule an use the existing ELSE and IF rules in your parser rule(s)
the rules ArithmeticOperators and Arithmetic_expression match empry strings. Lexer rules must never match empty strings (the lever can produce an infinite amount of empty-string tokens)
the lever rule Arithmetic_expression should be a parser rule: whenever lever rules are used to "glue" other tokens to each other, you should "promote" them to parser rules
your naming convention for lexer rules in inconsistent: use either PascalCasse, or UPPER_CASE, not both
as already mentioned, INT : [0-9]; should be INT : [0-9]+; otherwise 4 would be tokenised as an INT and 42 as a DOUBLE
These are just a few of the things I saw while reading your question, so there may be more things incorrect. I suggest you first take the time to learn a bit more ANTLR before trying to write a SAS grammar. Or, better yet, try to find an existing (ANTLR) grammar for this language instead of writing your own.
Here's an existing one you could take a look at: https://github.com/xueqilsj/sas-grammar (no idea how accurate it is)
The syntax of your input is incorrect: == should be used instead of =.
UPDATE:
Also, although the syntax of INT and DOUBLE should work, it would be better expressed like so:
INT : [0-9]+;
DOUBLE : INT PT INT
| PT INT
| INT
;
otherwise, 300 would be identified as a DOUBLE, not as an INT.
UPDATE 2
As #Raven has commented:
INT : [0-9]+;
DOUBLE : INT PT INT
| PT INT
;
I have completed my grammar and resolved all the errors thanks to #Bart, #Seelenvirtuose and #Maurice.
Following is the ANTLR grammar for parsing SAS If Else and simple Assignment statements.
grammar SASDTModel;
parse : block+ EOF;
block
: if_block+ # oneOrMoreIfBlock
| assignment_block+ # assignmentBlocks
;
if_block
: if_statement (else_if_statement)* else_statement?
;
/*nested_if_else_statement
: If if_condition Then Do? ';'? if_statement (else_if_statement)* else_statement? End? ';'?
;*/
if_statement
: If '('? if_condition ')'? Then if_block # nestedIfStatement
| If '('? if_condition ')'? Then expression Equal expression ';' # ifStatement
| If expression In '(' expression_list+ ')' Then expression Equal expression ';' # ifInBlock
;
else_if_statement
: Else If '('? if_condition ')'? Then expression Equal expression ';' # elseIf
| Else If expression In '(' expression_list+ ')' Then expression Equal expression ';' # elseIfInBlock
;
if_condition
: Identifier (Equal|ComparisionOperators) Quote? expression+ Quote? # equalCondition
| expression # expressionCondition
| expression equals_to_null # checkIfNull
| expression op=(And|Or) expression # andOrExpression
;
/*if_range_condition
: expression ComparisionOperators expression ComparisionOperators expression
;*/
else_statement
: Else expression Equal expression ';'
;
assignment_block
: Identifier Equal Identifier '(' function_parameter ')' ';' # functionCall
| Identifier Equal expression expression* ';' # assignValue
;
expression
: Value # value
| Identifier # identifier
| SignedFloat # signedFloat
| '(' expression ')' # expressionBracket
| expression '(' expression_list? ')' # expressionBracketList
| Not expression # notExpression
| expression (Min|Max) expression # minMaxExpression
| expression op=('*'|'/') expression # mulDivideExpression
| expression op=('+'|'-') expression # addSubtractExpression
| expression ('||' | '!!' ) expression # orOperatorExpression
| expression ComparisionOperators expression ComparisionOperators expression # inRangeExpression
| expression ComparisionOperators Quote? expression+ Quote? # ifPlainCondition
| expression (Equal|ComparisionOperators) Quote {_input.get(_input.index() -1).getType() == WS}? Quote # ifSpaceStringCondition
| expression Equal expression # equalExpression
;
expression_list
: Quote? expression+ Quote? Comma? # generalExpressionList
| Quote ({_input.get(_input.index() -1).getType() == WS}?)? Quote Comma? # spaceString
;
function_parameter
: expression+
;
equals_to_null : Equal Pt ;
/*ArithmeticOperators
: '+'
| '-'
| '*'
| '/'
| '**'
;*/
Equal : '=' ;
ComparisionOperators
: '<'
| '>'
| '<='
| '>='
;
And : '&' | 'and';
Or
: '|'
| '!'
;
Not
: '^'
| '~'
;
Min : '><';
Max : '<>';
If
: 'IF'
| 'if'
| 'If'
;
Else
: 'ELSE'
| 'else'
| 'Else';
Then
: 'THEN'
| 'then'
| 'Then'
;
In : 'IN' | 'in';
Do : 'do' | 'Do';
End : 'end' | 'END';
Value
: Int
| DOUBLE
| '-'DOUBLE
| '-'Int
| SignedFloat
| 'null';
Int : [0-9]+;
SignedFloat
: UnaryOperator? UnsignedFloat
;
MUL : '*' ; // assigns token name to '*' used above in grammar
DIV : '/' ;
ADD : '+' ;
SUB : '-' ;
DOUBLE
: Int Pt Int
| Pt Int
| Int
;
Pt : '.';
UnaryOperator
: '+'
| '-'
;
UnsignedFloat
: ('0'..'9')+ '.' ('0'..'9')* Exponent?
| '.' ('0'..'9')+ Exponent?
| ('0'..'9')+ Exponent
;
Exponent : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
Comma : ',';
Quote
: '\''
| '"'
;
Identifier : [a-zA-Z_] [a-zA-Z_0-9]* ;
WS : ( ' ' | '\t' | '\r' | '\n' )-> channel(HIDDEN);

ANTLR4-based lexer loses syntax hightlighting during typing on NetBeans

I've coded a simple lexer and parser using ANTLR4 grammars to make a language plugin for NetBeans 7.3 to help team write more quickly our layout files (a mix of XHTML and widgets definitions also in form of XHTML tags but with custom properties, characteristics, and with some differencies against XHTML syntax).
Template file example:
<div style="dyn_layout_panel">
#symbol#
<w_label=label, text="Try to close this window" />
<w_buttonclose=button, text = "CLOSE", on_press=press_close />
<w_buttonterminate=button, text="TERMINATE", on_press=press_terminate />
<w_mydatepicker=datepicker, parent=tab0, ary=[10, "str", /regex/i], start_date=2013-10-05, on_selected=datepicker_selected />
<w_myeditbox=editbox, parent=tab0, validation=USER_REGEX, validation_regex=/^[0-9]+[a-z]*$/i,
validation_msg="User regex don't match editbox contents.", on_keyreturn=tab0_editbox_keyreturn />
<div style="dyn_layout_panel">
$SYMBOL_2$
Some text that make a text node.
</div>
</div>
I use AnltrWorks 2 to write and debug lexer and parser and all seem to be fine, in NetBeans also I don't get any exception and the parser work properly but during editing/typing I lose token colors near the cursor.
Screenshot of problem:
Adding a debug console output for each keystroke I see that the lexer enter in IN_TAG or IN_WIDGET mode correctly, but after a WHITESPACE it returns to the default mode and match te rest of text inside a tag as a TEXT_NODE token.
I know that a lexer can have only one active mode at a time, so because it matches the TEXT_NODE rule when in IN_TAG or IN_WIDGET modes?
Lexer grammar file:
lexer grammar LayoutLexer;
COMMENT
: '/*' .*? '*/' -> channel(HIDDEN)
;
WS : ( ' '
| '\t'
| EOL
)+? -> channel(HIDDEN)
;
WDG_START_OPEN : '<w_' PROPERTY -> pushMode(IN_WIDGET) ;
WDG_END_OPEN : '</w_' PROPERTY -> pushMode(IN_WIDGET) ;
TAG_START_OPEN : '<' ATTRIBUTE -> pushMode(IN_TAG) ;
TAG_END_OPEN : '</' ATTRIBUTE -> pushMode(IN_TAG) ;
EXT_REF
: ( ('#' REF_NAME '#') | ('$' SYMBOL '$') | ('§' REF_NAME '§') )
;
fragment
REF_NAME
: ( [a-z]+ [0-9a-z_]*? )
;
fragment
EOL : ( '\r\n' | '\n\r' | '\n' )
;
EQUAL
: '='
;
TEXT_NODE
: ( (~('\r'|'\n'|'<'|'#'|'$'|'§'))+ )
;
ERROR
: ( .+? )
;
mode IN_TAG;
TAG_CLOSE : '>' -> popMode ;
TAG_EMPTY_CLOSE : '/>' -> popMode ;
TAG_WS : WS -> type(WS), channel(HIDDEN) ;
TAG_COMMENT : COMMENT -> type(COMMENT), channel(HIDDEN) ;
TAG_EQ : EQUAL -> type(EQUAL) ;
ATTRIBUTE
: ( LITERAL [0-9a-zA-Z_]* )
;
VAL
: ( '"' ( ESC_SEQ | ~('\\'|'"') )*? '"'
| '\'' ( ESC_SEQ | ~('\\'|'\'') )*? '\'' )
;
TAG_ERR : ERROR -> type(ERROR) ;
mode IN_WIDGET;
WDG_CLOSE : '>' -> popMode ;
WDG_EMPTY_CLOSE : '/>' -> popMode ;
WDG_WS : WS -> type(WS), mode(IN_WIDGET), channel(HIDDEN) ;
WDG_COMMENT : COMMENT -> type(COMMENT), channel(HIDDEN) ;
WDG_EQ : EQUAL -> type(EQUAL), pushMode(WDG_ASSIGN) ;
COMMA
: ','
;
fragment
MINUS
: '-'
;
STRING
: ( '"' ( ESC_SEQ | ~('\\'|'"') )*? '"'
| '\'' ( ESC_SEQ | ~('\\'|'\'') )*? '\'' )
;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
fragment
HEX_DIGIT
: [0-9a-fA-F]
;
fragment
DIGIT
: [0-9]
;
fragment
HEX_NUMBER
: '0x' HEX_DIGIT+
;
fragment
HTML_NUMBER
: (INT_NUMBER | FLOAT_NUMBER) HTML_UNITS
;
fragment
FLOAT_NUMBER
: MINUS? INT_NUMBER '.' DIGIT+
;
fragment
INT_NUMBER
: MINUS? DIGIT+
;
EVENT_HANDLER
: 'on_' PROPERTY
;
PROPERTY
: ( LITERAL [0-9a-zA-Z_]* )
;
fragment
LITERAL
: ( LITERAL_U | LITERAL_L )
;
fragment
LITERAL_U
: [A-Z]+
;
fragment
LITERAL_L
: [a-z]+
;
WDG_ERR : ERROR -> type(ERROR) ;
mode WDG_ASSIGN;
PHP_REF
: ( LITERAL_L ('_' | LITERAL_L | [0-9])* ) -> popMode
;
VALUE : (WDG_VAL | ARRAY) -> popMode;
ASGN_WS : WS -> type(WS), channel(HIDDEN);
ASGN_COMMA : COMMA -> type(COMMA);
ARY_START
: '['
;
ARY_END
: ']'
;
BIT_OR
: '|'
;
ARRAY
: ARY_START ARY_VALUE (ASGN_COMMA ARY_VALUE)* ARY_END
;
fragment
ARY_VALUE : ASGN_WS? WDG_VAL ASGN_WS? -> type(VALUE);
fragment
WDG_VAL
: (STRING
| UTC_DATE
| HEX_NUMBER
| HTML_NUMBER
| FLOAT_NUMBER
| INT_NUMBER
| BOOLEAN
| BITFIELD
| REGEX
| CSS_CLASS)
;
fragment
HTML_UNITS
: ('%'|'in'|'cm'|'mm'|'em'|'ex'|'pt'|'pc'|'px')
;
fragment
BOOLEAN
: ('true'|'false')
;
fragment
BITFIELD
: SYMBOL (WS? BIT_OR WS? SYMBOL)*
;
SYMBOL
: LITERAL_U [0-9A-Z_]*
;
UTC_DATE
: (DIGIT DIGIT DIGIT DIGIT '-' DIGIT DIGIT '-' DIGIT DIGIT)
;
REGEX
: ('/' ('\\'.|.)*? '/' ('g'|'m'|'i')* )
;
CSS_CLASS
: ( LITERAL_L ('-' | '_' | LITERAL_L | [0-9])* )
;
WDG_ASSIGN_ERR : ERROR -> type(ERROR), popMode;
Parser grammar file:
parser grammar LayoutParser;
options
{
tokenVocab=LayoutLexer;
language=Java;
}
document : (element | TEXT_NODE | EXT_REF)* EOF;
element
locals
[
String currentTag
]
: ( ( html_open_tag (element | TEXT_NODE | EXT_REF)* html_close_tag )
| ( wdg_open_tag (element | TEXT_NODE | EXT_REF)* wdg_close_tag )
| ( html_empty_tag | wdg_empty_tag ) )
;
html_empty_tag
: TAG_START_OPEN (ATTRIBUTE EQUAL VAL)* TAG_EMPTY_CLOSE
;
html_open_tag
: ( tag=TAG_START_OPEN (ATTRIBUTE EQUAL VAL)* TAG_CLOSE )
{$element::currentTag = $tag.text.substring(1);}
;
html_close_tag
: tag=TAG_END_OPEN TAG_CLOSE
{
if (!$element::currentTag.equals($tag.text.substring(2)))
notifyErrorListeners("HTML tag mismatch '" + $element::currentTag + "' - '" + $tag.text.substring(2) + "'");
}
;
wdg_empty_tag
: WDG_START_OPEN EQUAL PHP_REF ( COMMA (wdg_prop | wdg_event) )* WDG_EMPTY_CLOSE
;
wdg_open_tag
: tag=WDG_START_OPEN EQUAL PHP_REF ( COMMA (wdg_prop | wdg_event) )* WDG_CLOSE
{$element::currentTag = $tag.text.substring(1);}
;
wdg_close_tag
: tag=WDG_END_OPEN WDG_CLOSE
{
if (!$element::currentTag.equals($tag.text.substring(2)))
notifyErrorListeners("Widget alias mismatch '" + $element::currentTag + "' - '" + $tag.text + "'");
}
;
wdg_prop
: PROPERTY (EQUAL (ARRAY | VALUE | PHP_REF | UTC_DATE | REGEX | CSS_CLASS))?
;
wdg_event
: EVENT_HANDLER EQUAL PHP_REF
;
Depending on the implementation of syntax highlighting, the IDE may or may not start at the beginning of the document when lexing the input for syntax highlighting. If it does not start at the beginning of the document, then before returning any tokens, you need to ensure that the lexer instance is initialized in the correct mode (both the _mode and _modeStack fields need to be initialized to their correct state at the point where lexing starts).
If your lexer reads or writes any custom fields during lexing, you may need to restore those fields as well.
Examples
GoWorks (NetBeans based, LGPL License). This implementation does not use the lexer facilities in the NetBeans API, but instead implements the functionality at a lower level. For now you can ignore the MarkOccurrences* and SemanticHighlighter classes.
package org.tvl.goworks.editor.go.highlighter
package org.antlr.works.editor.antlr4.highlighting
ANTLR 4 IntelliJ Plugin (IntelliJ IDEA, BSD license).
package org.antlr.intellij.adaptor.lexer
package org.antlr.intellij.plugin (in particular, the SyntaxHighlighter classes)
Additional efficiency notes
Your REF_NAME, VAL, and STRING rules use non-greedy loops that do not need to be non-greedy. In each of these rules, change +? to + and change *? to *.
Your WS and ERROR rules use a non-greedy operator +? which is equivalent to not having a closure at all. The unnecessary use of a non-greedy operator in these cases only serves to slow down your lexer. To preserve the existing behavior, you can remove +? from these rules (replacing with + would change behavior).
Additional functionality notes
ANTLR 4 does not perform any error correction during lexing. If the input does not match a token, then the input simply does not match a token. This issue affects your VAL and STRING tokens in particular, which will not get syntax highlighting prior to adding the closing " or ' character. For syntax highlighting these types of tokens, I prefer to use an additional mode in the lexer, allowing me to produce separate tokens for the escape sequences embedded in the string, as well as syntax highlighting an unterminated string at the end of the line (unless your language allows strings to span multiple lines, in which case you'd stop at the end of the input).
For future references
All problems are related to the wrong implementation I done of NetBeans Lexer<T> class; many tutorials on the web do not take into account that a lexer may have more than one mode and that the lexer state must be backuped and restored between Lexer allocation/releases as mentioned by 280Z28.
This is the code I use to make syntax highlighting consistent:
public class LayoutEditorLexer implements Lexer<LayoutTokenId> {
private LexerRestartInfo<LayoutTokenId> info;
private LayoutLexer lexer;
private class LexerState {
public int Mode = -1;
public IntegerStack Stack = null;
public LexerState(int mode, IntegerStack stack)
{
Mode = mode;
Stack = new IntegerStack(stack);
}
}
public LayoutEditorLexer(LexerRestartInfo<LayoutTokenId> info) {
this.info = info;
AntlrCharStream charStream = new AntlrCharStream(info.input(), "LayoutEditor", false);
lexer = new LayoutLexer(charStream);
lexer.removeErrorListeners();
lexer.addErrorListener(ErrorListener.INSTANCE);
LexerState lexerMode = (LexerState)info.state();
if (lexerMode != null)
{
lexer._mode = lexerMode.Mode;
lexer._modeStack.addAll(lexerMode.Stack);
}
}
#Override
public org.netbeans.api.lexer.Token<LayoutTokenId> nextToken() {
Token token = lexer.nextToken();
int ttype = token.getType();
if (ttype != LayoutLexer.EOF)
{
LayoutTokenId tokenId = LayoutLanguageHierarchy.getToken(ttype);
return info.tokenFactory().createToken(tokenId);
}
return null;
}
#Override
public Object state()
{
// Here many tutorials simply returns null.
return new LexerState(lexer._mode, lexer._modeStack);
}
#Override
public void release()
{
}
}

antlr4 share rules across modes

I want to write a lexer that has multiple modes. But the modes are mostly similar. The only difference is that I reference the same character by a different name. I have it working, the problem is I have to copy the entire lexer change all the names, add types, and add one line for each mode.
Here is the general problem I want to solve. I want a comma to have high precedence outside a '[' ']'. I want it to have a low precedence inside the '[', ']'. To do that I push and pop modes with the '[' and ']'. But since the only thing I am changing is the precedence I have to copy all the rules into each mode and given them different names.
One additional thing, once inside a '[' there would be no way to get back to the high precedence comma. So when the grammar moves into a '{', '}' section the comma reverts back to high precedence.
In order to accomplish this I have the initial default mode plus CONJUNCTION (high precedence) and NON_CONJUNCTION (low precedence). I copy all the rules from the default mode and rename them to C_ or NC_. Then I assign their type back to the type of the default mode.
I would rather accomplish this without coping, renaming, and assigning types to all the rules from the default mode.
Here is my lexer:
lexer grammar DabarLexer;
OPEN_PAREN : '(' -> pushMode(NON_CONJUNCTION) ;
CLOSE_PAREN : ')' -> popMode;
OPEN_BRACKET : '[' -> pushMode(NON_CONJUNCTION) ;
CLOSE_BRACKET : ']' -> popMode ;
OPEN_CURLY : '{' -> pushMode(CONJUNCTION) ;
CLOSE_CURLY : '}' -> popMode ;
SPACE : ' ' ;
HEAVY_COMMA : ',';
ENDLINE : '\n' ;
PERIOD : '.' ;
SINGLE_QUOTE : '\'' ;
DOUBLE_QUOTE : '"' ;
INDENTION : '\t' -> skip;
fragment SYMBOL : HEAVY_COMMA | OPEN_BRACKET | CLOSE_BRACKET | OPEN_PAREN | CLOSE_PAREN | OPEN_CURLY | CLOSE_CURLY | SPACE | ENDLINE | PERIOD | SINGLE_QUOTE | DOUBLE_QUOTE | INDENTION ;
ESCAPE : '\\' SYMBOL ;
fragment NON_SYMBOL : ~[(){}',; \n.\t"\[\]] ;
IDENTIFIER : (NON_SYMBOL | ESCAPE)+ ;
LITERAL : (SINGLE_QUOTE (NON_SYMBOL | ESCAPE)+ SINGLE_QUOTE) | DOUBLE_QUOTE (NON_SYMBOL | ESCAPE)+ DOUBLE_QUOTE ;
mode CONJUNCTION ;
C_HEAVY_COMMA : ',' -> type(HEAVY_COMMA);
C_OPEN_PAREN : '(' -> type(OPEN_PAREN), pushMode(NON_CONJUNCTION) ;
C_CLOSE_PAREN : ')' -> type(CLOSE_PAREN), popMode;
C_OPEN_BRACKET : '[' -> type(OPEN_BRACKET), pushMode(NON_CONJUNCTION) ;
C_CLOSE_BRACKET : ']' -> type(CLOSE_BRACKET), popMode ;
C_OPEN_CURLY : '{' -> type(OPEN_CURLY), pushMode(CONJUNCTION) ;
C_CLOSE_CURLY : '}' -> type(CLOSE_CURLY), popMode ;
C_SPACE : ' ' -> type(SPACE);
C_ENDLINE : '\n' -> type(ENDLINE);
C_PERIOD : '.' -> type(PERIOD);
C_SINGLE_QUOTE : '\'' -> type(SINGLE_QUOTE);
C_DOUBLE_QUOTE : '"' -> type(DOUBLE_QUOTE);
C_INDENTION : '\t' -> type(INDENTION),skip;
fragment C_SYMBOL : ( HEAVY_COMMA | C_OPEN_BRACKET | C_CLOSE_BRACKET | C_OPEN_PAREN | C_CLOSE_PAREN | C_OPEN_CURLY | C_CLOSE_CURLY | C_SPACE | C_ENDLINE | C_PERIOD | C_SINGLE_QUOTE | C_DOUBLE_QUOTE | C_INDENTION ) ;
C_ESCAPE : '\\' C_SYMBOL -> type(ESCAPE);
fragment C_NON_SYMBOL : ~[(){}',; \n.\t"\[\]] ;
C_IDENTIFIER : (C_NON_SYMBOL | C_ESCAPE)+ -> type(IDENTIFIER);
C_LITERAL : ((C_SINGLE_QUOTE (C_NON_SYMBOL | C_ESCAPE)+ C_SINGLE_QUOTE) | C_DOUBLE_QUOTE (C_NON_SYMBOL | C_ESCAPE)+ C_DOUBLE_QUOTE) -> type(LITERAL);
mode NON_CONJUNCTION ;
LIGHT_COMMA : ',' ;
NC_OPEN_PAREN : '(' -> type(OPEN_PAREN), pushMode(NON_CONJUNCTION) ;
NC_CLOSE_PAREN : ')' -> type(CLOSE_PAREN), popMode;
NC_OPEN_BRACKET : '[' -> type(OPEN_BRACKET), pushMode(NON_CONJUNCTION) ;
NC_CLOSE_BRACKET : ']' -> type(CLOSE_BRACKET), popMode ;
NC_OPEN_CURLY : '{' -> type(OPEN_CURLY), pushMode(CONJUNCTION) ;
NC_CLOSE_CURLY : '}' -> type(CLOSE_CURLY), popMode ;
NC_SPACE : ' ' -> type(SPACE);
NC_ENDLINE : '\n' -> type(ENDLINE);
NC_PERIOD : '.' -> type(PERIOD);
NC_SINGLE_QUOTE : '\'' -> type(SINGLE_QUOTE);
NC_DOUBLE_QUOTE : '"' -> type(DOUBLE_QUOTE);
NC_INDENTION : '\t' -> type(INDENTION),skip;
fragment NC_SYMBOL : ( LIGHT_COMMA | NC_OPEN_BRACKET | NC_CLOSE_BRACKET | NC_OPEN_PAREN | NC_CLOSE_PAREN | NC_OPEN_CURLY | NC_CLOSE_CURLY | NC_SPACE | NC_ENDLINE | NC_PERIOD | NC_SINGLE_QUOTE | NC_DOUBLE_QUOTE | NC_INDENTION ) ;
NC_ESCAPE : '\\' NC_SYMBOL -> type(ESCAPE);
fragment NC_NON_SYMBOL : ~[(){}',; \n.\t"\[\]] ;
NC_IDENTIFIER : (NC_NON_SYMBOL | NC_ESCAPE)+ -> type(IDENTIFIER);
NC_LITERAL : ((NC_SINGLE_QUOTE (NC_NON_SYMBOL | NC_ESCAPE)+ NC_SINGLE_QUOTE) | NC_DOUBLE_QUOTE (NC_NON_SYMBOL | NC_ESCAPE)+ NC_DOUBLE_QUOTE) -> type(LITERAL);
Your current solution is very similar to the solution I use. For example, take a look at the TemplateComment mode of the grammar I use for for StringTemplate 4 support in ANTLRWorks 2. One helpful thing I implemented in ANTLR 4 a while back is it won't create duplicate token types for a rule in this form.
// No TemplateComment_NEWLINE token type is created here, because the
// type(NEWLINE) action means this rule produces tokens of a specific type.
TemplateComment_NEWLINE : NEWLINE -> type(NEWLINE), channel(HIDDEN);

ANTLR: Syntax Errors are ignored when running parser programmatically

I am currently creating a more or less simple expression evaluator using ANTLR.
My grammar is straightforward (at least i hope so) and looks like this:
grammar SXLGrammar;
options {
language = Java;
output = AST;
}
tokens {
OR = 'OR';
AND = 'AND';
NOT = 'NOT';
GT = '>'; //greater then
GE = '>='; //greater then or equal
LT = '<'; //lower then
LE = '<='; //lower then or equal
EQ = '=';
NEQ = '!='; //Not equal
PLUS = '+';
MINUS = '-';
MULTIPLY = '*';
DIVISION = '/';
CALL;
}
#header {
package somepackage;
}
#members {
}
#lexer::header {
package rise.spics.sxl;
}
rule
: ('='|':')! expression
;
expression
: booleanOrExpression
;
booleanOrExpression
:
booleanAndExpression ('OR'^ booleanAndExpression)*
;
booleanAndExpression
:
booleanNotExpression ('AND'^ booleanNotExpression)*
;
booleanNotExpression
:
('NOT'^)? booleanAtom
;
booleanAtom
:
| compareExpression
;
compareExpression
:
commonExpression (('<' | '>' | '=' | '<=' | '>=' | '!=' )^ commonExpression)?
;
commonExpression
:
multExpr
(
(
'+'^
| '-'^
)
multExpr
)*
| DATE
;
multExpr
:
atom (('*'|'/')^ atom)*
| '-'^ atom
;
atom
:
INTEGER
| DECIMAL
| BOOLEAN
| ID
| '(' expression ')' -> expression
| functionCall
;
functionCall
:
ID '(' arguments ')' -> ^(CALL ID arguments?)
;
arguments
:
(expression) (','! expression)*
| WS
;
BOOLEAN
:
'true'
| 'false'
;
ID
:
(
'a'..'z'
| 'A'..'Z'
)+
;
INTEGER
:
('0'..'9')+
;
DECIMAL
:
('0'..'9')+ ('.' ('0'..'9')*)?
;
DATE
:
'!' '0'..'9' '0'..'9' '0'..'9' '0'..'9' '-' '0'..'9' '0'..'9' '-' '0'..'9' '0'..'9' (' ' '0'..'9' '0'..'9' ':''0'..'9' '0'..'9' (':''0'..'9' '0'..'9')?)?
;
WS
: (' '|'\t' | '\n' | '\r' | '\f')+ { $channel = HIDDEN; };
Now if i try to parse an invalid Expression like "= true NOT true", the graphical test-tool of the eclipse plugin throws an NoViableAltException: line 1:6 no viable alternative at input 'NOT', which is correct and supposed.
Now if i try to parse the expression in a Java Program, nothing happens. The Program
String expression = "=true NOT false";
CharStream input = new ANTLRStringStream(expression);
SXLGrammarLexer lexer = new SXLGrammarLexer(input);
TokenStream tokenStream = new CommonTokenStream(lexer);
SXLGrammarParser parser = new SXLGrammarParser(tokenStream);
CommonTree tree = (CommonTree) parser.rule().getTree();
System.out.println(tree.toStringTree());
System.out.println(parser.getNumberOfSyntaxErrors());
would output:
true
0
that means, the AST created by the parser exists of one node and ignores the rest. I'd like to handle syntax errors in my application, but its not possible if the generated parser doesn't find any error.
I also tried to alter the parser by overwriting the displayRecognitionError() method with something like this:
public void displayRecognitionError(String[] tokenNames,
RecognitionException e) {
String msg = getErrorMessage(e, tokenNames);
throw new RuntimeException("Error at position "+e.index+" " + msg);
}
but displayRecognitionError gets never called.
If i try something like "=1+", a error gets displayed. I guess theres something wrong with my grammar, but why does the eclipse plugin throw that error while the generated parser does not?
If you want rule to consume the entire token-stream, you have to specify where you expect the end of your input. Like this:
rule
: ('='|':')! expression EOF
;
Without the EOF your parser reads the true as boolean an ignores the rest.

Categories