Treat invalid chars as a single token in ANTLR4 lexer

Treat invalid chars as a single token in ANTLR4 lexer - java

I'm using the JSON grammar from the antlr4 grammar repository to parse JSON files for an editor plugin. It works, but reports invalid chars one by one. The following snippet results in 18 lexer errors:
{
sometext-without-quotes : 42
}
I want to boil it down to 1-2 by treating consecutive, invalid single-char tokens of the same type as one bigger invalid token.
For a similar question, a custom lexer was suggested that glues "unknown" elements to larger tokens: In antlr4 lexer, How to have a rule that catches all remaining "words" as Unknown token?
I assume that this bypasses the usual lexer error reporting, which I would like to avoid, if possible. Isn't there a proper solution for that rather simple task? It seems to have worked by default in ANTLR3.

The answer is in the link you provided. I don't want to copy the original answer completely so I'll try and paraphrase a bit...
In antlr4 lexer, How to have a rule that catches all remaining "words" as Unknown token?
Add unknowns to the lexer that will match multiples of these...
unknowns : Unknown+ ;
...
Unknown : . ;
There was an edit made to this post to cater for the case where you were only using a lexer and not using a parser. If using a parser then you do not need to override the nextToken method because the error can be handled in the parser in a much cleaner way ie unknowns are just another token type as far as the lexer is concerned. The lexer passes these to the parser which can then handle the errors. If using a parser I'd normally recognize all tokens as individual tokens and then in the parser emit the errors ie group them or not. The reason for doing this is all error handling is done in one place ie it's not in the lexer and in the parser. It also makes the lexer simpler to write and test ie it must recognize all text and never fail on any utf8 input. Some people would likely do it differently but this has worked for me with hand written lexers in C. The parser is in charge of determining what's actually valid and how to error on it. One other benefit is that the lexer is fairly generic and can be reused.
For lexer only solution...
Check the answer at the link and look for this comment in the answer...
... but I only have a lexer, no parsers ...
The answer states that you override the nextToken method and goes into some detail on how to do that
#Override
public Token nextToken() {
and the important part in the code is this...
Token next = super.nextToken();
if(next.getType() != Unknown) {
return next;
}
The code that comes after this handles the case where you can match the bad tokens.

What you could do is use lexer modes. For this you'd had to split grammar to parser and lexer grammar. Let's start with lexer grammar:
JSONLexer.g4
/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */
// Derived from http://json.org
lexer grammar JSONLexer;
STRING
: '"' (ESC | ~ ["\\])* '"'
;
fragment ESC
: '\\' (["\\/bfnrt] | UNICODE)
;
fragment UNICODE
: 'u' HEX HEX HEX HEX
;
fragment HEX
: [0-9a-fA-F]
;
NUMBER
: '-'? INT '.' [0-9] + EXP? | '-'? INT EXP | '-'? INT
;
fragment INT
: '0' | [1-9] [0-9]*
;
// no leading zeros
fragment EXP
: [Ee] [+\-]? INT
;
// \- since - means "range" inside [...]
TRUE : 'true';
FALSE : 'false';
NULL : 'null';
LCURL : '{';
RCURL : '}';
COL : ':';
COMA : ',';
LBRACK : '[';
RBRACK : ']';
WS
: [ \t\n\r] + -> skip
;
NON_VALID_STRING : . ->pushMode(MODE_ERR);
mode MODE_ERR;
WS1
: [ \t\n\r] + -> skip
;
COL1 : ':' ->popMode;
MY_ERR_TOKEN : ~[':']* ->type(NON_VALID_STRING);
Basically I have added some tokens used in the parser part (like LCURL, COL, COMA etc) and introduced NON_VALID_STRING token, which is basically the first character that's nothing that already is (should be) matched. Once this token is detected, I switch the lexer to MODE_ERR mode. In this mode I go back to default mode once : is detected (this can be changed and maybe refined, but server the purpose here :) ) or I say that everything else is MY_ERR_TOKEN to which I assign NON_VALID_STRING token type. Here is what ATNLRWorks says to this when I run interpret lexer option with your input:
So s is NON_VALID_STRING type and so is everything else until :. So, same type but two different tokens. If you want them not to be of the same type, simply omit the type call in the lexer grammar.
Here is the parser grammar now
JSONParser.g4
/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */
// Derived from http://json.org
parser grammar JSONParser;
options {
tokenVocab=JSONLexer;
}
json
: object
| array
;
object
: LCURL pair (COMA pair)* RCURL
| LCURL RCURL
;
pair
: STRING COL value
;
array
: LBRACK value (COMA value)* RBRACK
| LBRACK RBRACK
;
value
: STRING
| NUMBER
| object
| array
| TRUE
| FALSE
| NULL
;
and if you run the test rig (I do it with ANTLRworks) you'll get a single error (see screenshot)
Also you could accumulate lexer errors by overriding the generated lexer class, but I understood in the question that this is not desired or I didn't understand that part :)

Related

Extract hidden comment content preceding a specific rule or token (Antlr, Java)

I am new to antlr and java so this may be a trivial question (hopefully!). I am using antlr 3.4. I have a grammar for the lexer:
lexer grammar MyLexer;
options {
language = Java;
}
COMMENT:
( '//' ~('\n'|'\r')* '\r'? '\n'
| '/*' .* '*/'
) {$channel=HIDDEN;};
WS: (' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;};
COLLECTION: 'collection';
BRACE_OPEN: '{';
BRACE_CLOSE: '}';
and another for the parser:
parser grammar myParser;
options {
language = Java;
tokenVocab = myLexer;
}
collection_def
scope {
MyCollection currentCollection;
}
#init {
$collection_def::currentCollection = new MyCollection();
}
#after {
// There should be a comment preceding this rule. How to get the content of that comment into the commentContent variable?
$collection_def::currentCollection.setDescription(commentContent);
...
}
: COLLECTION BRACE_OPEN
...
BRACE_CLOSE;
The lexer sends comments to the hidden channel. But I want the parser to extract the text contained in the comment that precedes a specific rule (or a specific token, since the COLLECTION token only appears in the rule above).
For example, I want this input:
/* Text describing the collection */
collection {
item 1;
item 2;
}
to be parsed to a MyCollection object with its description member variable set to "Text describing the collection".
How can I do this?

The token stream has all the tokens, included those on the hidden channel. Every token that you get from the parser result (e.g. through tree.getToken() if you're using output = AST) knows its position in the token stream (Token.getTokenIndex()). That's the information you need to be able to locate and read the hidden token(s) preceding your token.
All that's left for you to do is get all this info to the place where you need to use it. One possible way to do this is get the tokens list (via CommonTokenStream.getTokens() if you use a CommonTokenStream between lexer and parser) and pass it to whatever method is doing the processing of the comments, or do some post-processing of the result to add the info to it.

antlr grammar for triple quoted string

I am trying to update an ANTLR grammar that follows the following spec
https://github.com/facebook/graphql/pull/327/files
In logical terms its defined as
StringValue ::
- `"` StringCharacter* `"`
- `"""` MultiLineStringCharacter* `"""`
StringCharacter ::
- SourceCharacter but not `"` or \ or LineTerminator
- \u EscapedUnicode
- \ EscapedCharacter
MultiLineStringCharacter ::
- SourceCharacter but not `"""` or `\"""`
- `\"""`
(Not the above is logical - not ANTLR syntax)
I tried the follow in ANTRL 4 but it wont recognize more than 1 character inside a triple quoted string
string : triplequotedstring | StringValue ;
triplequotedstring: '"""' triplequotedstringpart? '"""';
triplequotedstringpart : EscapedTripleQuote* | SourceCharacter*;
EscapedTripleQuote : '\\"""';
SourceCharacter :[\u0009\u000A\u000D\u0020-\uFFFF];
StringValue: '"' (~(["\\\n\r\u2028\u2029])|EscapedChar)* '"';
With these rules it will recognize '"""a"""' but as soon as I add more characters it fails
eg: '"""abc"""' wont parse and the IntelliJ plugin for ANTLR says
line 1:14 extraneous input 'abc' expecting {'"""', '\\"""', SourceCharacter}
How do I do triple quoted strings in ANTLR with '\"""' escaping?

Some of your parer rules should really be lexer rules. And SourceCharacter should probably be a fragment.
Also, instead of EscapedTripleQuote* | SourceCharacter*, you probably want ( EscapedTripleQuote | SourceCharacter )*. The first matches aaa... or bbb..., while you probably meant to match aababbba...
Try something like this instead:
string
: Triplequotedstring
| StringValue
;
Triplequotedstring
: '"""' TriplequotedstringPart*? '"""'
;
StringValue
: '"' ( ~["\\\n\r\u2028\u2029] | EscapedChar )* '"'
;
// Fragments never become a token of their own: they are only used inside other lexer rules
fragment TriplequotedstringPart : EscapedTripleQuote | SourceCharacter;
fragment EscapedTripleQuote : '\\"""';
fragment SourceCharacter :[\u0009\u000A\u000D\u0020-\uFFFF];

ANTLR not recognizing special characters tokens

I am writing the grammar rules for complex logic operations and I am stuck with the tokens. My lexer grammar goes like this:
VAR : 'A'..'Z';
WS : [ \t\r]+ -> skip;
NL : '\n';
TRUE : '1';
FALSE : '0';
AND : '∧';
NAND : '⊼';
OR : '∨';
XOR : '⊻';
NOR : '⊽';
IMPLIES : '⇒';
BICOND : '⇔';
NEGATE : '¬';
EQUIV : '≡';
EQ : '=';
LPAR : '(';
RPAR : ')';
As you can see I am using special symbols for every operation (that should be recognized). The problem is that when I test the parser and I try to visit the tree it gives me the next error:
line 1:1 token recognition error at: '⊼'
It gives me the same error using every operator.
I can tell that the problem is related to encoding because if I replace the symbols for more common ones it visits the tree and gives me the correct result of the operation.
I am using ANTLR in Java.
Thanks in advance!

I just found out the solution. As I thought it was an encoding problem, I just had to set the tool-option encoding in the grammar file as UTF-8.
Thank you anyways!

ANTLR4: context-sensitive spaces?

In a grammar I would like to implement texts without string delimiting xxx.
The idea is to define things like
a = xxx;
instead of
a ="xxx";
to simplify typewriting. Otherwise there should be variable definitions
and other kind of stuff as well.
As a first approach I experimented with this grammar:
grammar SpaceNoSpace;
prog: stat+;
stat:
'somethingelse' ';'
| typed description* content
;
typed:
'something' '-'
| 'anotherthing' '-'
;
description:
'someSortOfDetails' COLON ID HASH
| 'otherSortOfDetails' COLON ID HASH
;
content:
contenttext ';'
;
contenttext:
(~';')*
;
COLON: ':' ;
HASH: '#';
SEMI: ';';
SPACE: ' ';
ID: [a-zA-Z][a-zA-z0-9]+;
WS : [ \t\n\r]+ -> channel(HIDDEN);
ANY_CHAR : . ;
This works fine for input files like this:
something-someSortOfDetails: aVariableName#
this is the content of this;
anotherthing-someSortOfDetails: aVariableName#
here spaces are accepted as much as you like;
somethingelse;
But modifying the last line to
somethingelse ;
leads to a syntax error:
line 7:15 extraneous input ' ' expecting ';'
This probably reveals that the lexer rule
WS : [ \t\n\r]+ -> channel(HIDDEN);
is not applied, (but the SPACE rule???).
Otherwise, if I delete the SPACE lexer-rule, the space
in "somethingelse ;" is ignored (by lexer-rule WS), so that the parser rule
stat : somethingelse as a consequence is detected correctly.
But as a consequence of the deleted SPACE-rule the content text will be reduced to single in-between-spaces,
so "this here" will be reduced to "this here".
This is not a big problem, but nevertheless it is an
interesting question:
is it possible to implement context-sensitive WS or SPACE
lexer rules:
within the content parser-rule any space should be preserved,
in any other rule spaces should be ignored.
Is this possible to define such a context-sensitive lexer-rule behavior in ANTLR4?

Have you considered Lexer Modes? The section with mode(), pushMode(), popMode is probably interesting for you.
Yet I think that lexer modes are more a problem than a solution. Their purpose is to use (parser) context in the lexer. Consequently one should discard the paradigm of separating lexer and parser - and use a PEG-Parser instead.

Since the SPACE rule is before the WS rule, the lexer is returning a space token to the parser. The ' ' is not being being placed on the hidden channel.

Match a single senerio with ANTLR and skip everything else as noise

I defined a simple grammar using an ANTLR V4 Eclipse Plugin. I want to parse a file that contains Coldfusion cfscript code, and find every instance of a property definition. For example:
property name="productTypeID" ormtype="string" length="32" fieldtype="id" generator="uuid" unsavedvalue="" default="";
That is, a property keyword followed by any number of attributes, line terminated with a semicolon.
.g4 file
grammar CFProperty;
property : 'property ' (ATR'='STRING)+EOL; // match keyword property followed by an attribute definition
ATR : [a-zA-Z]+; // match lower and upper-case identifiers name
STRING: '"' .*? '"'; // match any string
WS : [ \t\r\n]+ -> skip; // skip spaces, tabs, newlines
EOL : ';'; // end of the property line
I put together a simple java project that uses the generated parser, tree-walker etc to printout the occurrences of those matches.
The input I'm testing this with is:
"property id=\"actionID\" name=\"actionName\" attr=\"actionAttr\" hbMethod=\"HBMethod\"; public function some funtion {//some text} property name=\"actionID\" name=\"actionName\" attr=\"actionAttr\" hbMethod=\"HBMethod\"; \n more noise "
My issue is that this is only matching:
property id="actionID" name="actionName" attr="actionAttr" hbMethod="HBMethod";
And because it doesn't understand everthing else to be noise, it doesn't match the second instance of the property definition.
How can I match on multiple instances of the property definition and match on everything else in-between as noise to be skipped?

You can use lexer mode to do what you want. One mode for property and stuffs and one mode for noise. The idea behind mode is to go from a mode (a state) to another following token we found during lexing operation.
To do this, you have to cut your grammar in two files, the parser in one file and the lexer in the other.
Here is the lexer part (named TestLexer.g4 in my case)
lexer grammar TestLexer;
// Normal mode
PROPERTY : 'property';
EQUALS : '=';
ATR : [a-zA-Z]+; // match lower and upper-case identifiers name
STRING: '"' .*? '"'; // match any string
WS : [ \t\r\n]+ -> skip; // skip spaces, tabs, newlines
EOL : ';' -> pushMode(NOISE); // when ';' is found, go to noise mode where everything is skip
mode NOISE;
NOISE_PROPERTY : 'property' -> type(PROPERTY), popMode; // when 'property' is found, we say it's a PROPERTY token and we go back to normal mode
ALL : .+? -> skip; // skip all other stuffs
Here is the parser part (named Test.g4 in my case)
grammar Test;
options { tokenVocab=TestLexer; }
root : property+;
property : PROPERTY (ATR EQUALS STRING)+ EOL; // match keyword property followed by an attribute definition
This should do the work :)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Treat invalid chars as a single token in ANTLR4 lexer - java

Related

Extract hidden comment content preceding a specific rule or token (Antlr, Java)

antlr grammar for triple quoted string

ANTLR not recognizing special characters tokens

ANTLR4: context-sensitive spaces?

Match a single senerio with ANTLR and skip everything else as noise

Categories

Resources