Reading and Parsing PIPE using Antlr [duplicate] - java

I have been starting to use ANTLR and have noticed that it is pretty fickle with its lexer rules. An extremely frustrating example is the following:
grammar output;
test: FILEPATH NEWLINE TITLE ;
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
This grammar will not match something like:
c:\test.txt
x
Oddly if I change TITLE to be TITLE: 'x' ; it still fails this time giving an error message saying "mismatched input 'x' expecting 'x'" which is highly confusing. Even more oddly if I replace the usage of TITLE in test with FILEPATH the whole thing works (although FILEPATH will match more than I am looking to match so in general it isn't a valid solution for me).
I am highly confused as to why ANTLR is giving such extremely strange errors and then suddenly working for no apparent reason when shuffling things around.

This seems to be a common misunderstanding of ANTLR:
Language Processing in ANTLR:
The Language Processing is done in two strictly separated phases:
Lexing, i.e. partitioning the text into tokens
Parsing, i.e. building a parse tree from the tokens
Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.
Lexing
Lexing in ANTLR works as following:
all rules with uppercase first character are lexer rules
the lexer starts at the beginning and tries to find a rule that matches best to the current input
a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
tokens are generated from matches:
if one rule matches the maximum length match the corresponding token is pushed into the token stream
if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream
Example: What is wrong with your grammar
Your grammar has two rules that are critical:
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.
There are two hints for that:
keep your lexer rules disjunct (no token should match a superset of another).
if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).

This was not directly OP's problem, but for those who have the same error message, here is something you could check.
I had the same Mismatched Input 'x' expecting 'x' vague error message when I introduced a new keyword. The reason for me was that I had placed the new key word after my VARNAME lexer rule, which assigned it as a variable name instead of as the new keyword. I fixed it by putting the keywords before the VARNAME rule.

Related

Separate definitions of decimal number and word in ANTLR grammar

I'm working on defining a grammar in ANTLR4 which includes words and numbers separately.
Numbers are described:
NUM
: INTEGER+ ('.' INTEGER+)?
;
fragment INTEGER
: ('0' .. '9')
;
and words are described:
WORD
: VALID_CHAR +
;
fragment VALID_CHAR
: ('a' .. 'z') | ('A' .. 'Z')
;
The simplified grammar below describes the addition between either a word or a letter (and needs to be defined recursively like this):
expression
: left = expression '+' right = expression #addition
| value = WORD #word
| value = NUM #num
;
The issue is that when I enter 'd3' into the parser, I get a returned instance of a Word 'd'. Similarly, entering 3f returns a Number of value 3. Is there a way to ensure that 'd3' or any similar strings returns an error message from the grammar?
I've looked at the '~' symbol but that seems to be 'everything except', rather than 'only'.
To summarize, I'm looking for a way to ensure that ONLY a series of letters can be parsed to a Word, and contain no other symbols. Currently, the grammar seems to ignore any additional disallowed characters.
Similar to the message received when '3+' is entered:
simpleGrammar::compileUnit:1:2: mismatched input '<EOF>' expecting {WORD, NUM}
At present, the following occurs:
d --> (d) (word) (correct)
22.3 --> (22.2) number (correct)
d3 --> d (word) (incorrect)
22f.4 --> 22 (number) (incorrect)
But ideally the following would happen :
d --> (d) (word) (correct)
22.3 --> (22.2) number (correct)
d3 --> (error)
22f.4 --> (error)
[Revised to response to revised question and comments]
ANTLR will attempt to match what it can in your input stream in your input stream and then stop once it's reached the longest recognizable input. That means, the best ANTLR could do with your input was to recognize a word ('d') and then it quite, because it could match the rest of your input to any of your rules (using the root expression rule)
You can add a rule to tell ANTLR that it needs to consume to entire input, with a top-level rule something like:
root: expression EOF;
With this rule in place you'll get 'mismatched input' at the '3' in 'd3'.
This same rule would give a 'mismatched input' at the 'f' character in '22f.4'.
That should address the specific question you've asked, and, hopefully, is sufficient to meet your needs. The following discussion is reading a bit into your comment, and maybe assuming too much about what you want in the way of error messages.
Your comment (sort of) implies that you'd prefer to see error messages along the lines of "you have a digit in your word", or "you have a letter in you number"
It helps to understand ANTLR's pipeline for processing your input. First it processes your input stream using the Lexer rules (rules beginning with capital letters) to create a stream of tokens.
Your 'd3' input produces a stream of 2 tokens with your current grammar;
WORD ('d')
NUM ('3')
This stream of tokens is what is being matched against in your parser rules (i.e. expression).
'22f.4' results in the stream:
NUM ('22')
WORD ('f')
(I would expect an error here as there is no Lexer rule that matches a stream of characters beginning with a '.')
As soon as ANTLR saw something other than a number (or '.') while matching your NUM rule, it considered what it matched so far to be the contents of the NUM token, put it into the token stream and moved on. (similar with finding a number in a word)
This is standard lexing/parsing behavior.
You can implement your own ErrorListener where ANTLR will hand the details of the error it encountered to you and you could word you error message as you see fit, but I think you'll find it tricky to hit what it seems your target is. You would not have enough context in the error handler to know what came immediately before, etc., and even if you did, this would get very complicated very fast.
IF you always want some sort of whitespace to occur between NUMs and WORDs, you could do something like defining the following Lexer rules:
BAD_ATOM: (INTEGER|VALID_CHAR|'.')+;
(put it last in the grammar so that the valid streams will match first)
Then when a parser rule errors out with a BAD_ATOM rule, you could inspect it and provide an more specific error message.
Warning: This is a bit unorthodox, and could introduce constraints on what you could allow as you build up your grammar. That said, it's not uncommon to find a "catch-all" Lexer rule at the bottom of a grammar that some people use for better error messages and/or error recovery.

ANTLR4 parsing subrules

I have a grammar that works fine when parsing in one pass (entire file).
Now I wish to break the parsing up into components. And run the parser on subrules. I ran into an issue I assume others parsing subrules will see with the following rule:
thing : LABEL? THING THINGDATA thingClause?
//{System.out.println("G4 Lexer/parser thing encountered");}
;
...
thingClause : ',' ID ( ',' ID)?
;
When the above rule is parsed from a top level start rule which parses to EOF everything works fine. When parsed as a sub-rule (not parse to EOF) the parser gets upset when there is no thing clause, as it is expecting to see EITHER a "," character or an EOF character.
line 8:0 mismatched input '%' expecting {, ','}
When I parse to EOF, the % gets correctly parsed into another "thing" component, because the top level rule looks for:
toprule : thing+
| endOfThingsTokens
;
And endOfThingsTokens occurs before EOF... so I expect this is why the top level rule works.
For parsing the subrule, I want the ANTLR4 parser to accept or ignore the % token and say "OK we aren't seeing a thingClause", then reset the token stream so the next thing object can be parsed by a different instance of the parser.
In this specific case I could change the lexer to pass newlines to the parser, which I currently skip in the lexer grammar. That would require lots of other changes to accept newlines in the token stream which are currently not needed.
Essentially I need some way to make the rule have a "end of record" token. But I was wondering if there was some way to solve this with a semantic predicate rule.
something like:
thing : { if comma before %}? LABEL? THING THINGDATA thingClause?
| LABEL? THING THINGDATA
;
...
thingClause : ',' ID ( ',' ID)?
;
The above predicate pseudo code would hide the optional thingClause? if it won't be satisfied so that the parser would stop after parsing one "thing" without looking for a specific "end of thing" token (i.e. newline).
If I solve this I will post the answer.
The parser will (effectively) look-ahead in the token stream to determine if the current rule can be satisfied. The corresponding tokens are then consumed. If any look-ahead tokens remain unconsumed, the parser looks for another rule against which to consume these and additional look-ahead tokens.
The thingClause? element, when not matched, will result in unconsumed tokens in the parser. Hence the error you are seeing.
The parser look-ahead is data dependent. Meaning that the evaluation of the elements of a rule can easily read into the parser more tokens than the current rule could possibly consume.
While a predicate could help, it will not make the problem deterministic. That is, even if the parser matches the non-predicated alt, it may well have read more tokens into the parser than can be consumed by that alt.
The only way to avoid this non-determinism would be to pre-inject <EOF> tokens into the token stream at the sub-rule boundaries.

Regex for commas and periods allowed

I tried searching for an answer to this question and also reading the Regex Wiki but I couldn't find what I'm looking for exactly.
I have a program that validates a document. (It was written by someone else).
If certain lines or characters don't match the regex then an error is generated. I've noted that a few false errors are always generated and I want to correct this. I believe I have narrowed down the problem to this:
Here is an example:
This error is flagged by the program logic:
ERROR: File header immediate origin name is invalid: CITIBANK, N.A.
Here is the code that causes that error:
if(strLine.substring(63,86).matches("[A-Z,a-z,0-9, ]+")){
}else{
JOptionPane.showMessageDialog(null, "ERROR: File header immediate origin name is invalid: "+strLine.substring(63,86));
errorFound=true;
fileHeaderErrorFound=true;
bw.write("ERROR: File header immediate origin name is invalid: "+strLine.substring(63,86));
bw.newLine();
I believe the reason that the error is called at runtime is because the text contains a period and comma.. I am unsure how to allow these in the regex.
I have tried using this
if(strLine.substring(63,86).matches("[A-Z,a-z,0-9,,,. ]+")){
and it seemed to work I just wanted to make sure that is the correct way because it doesn't look right.
You're right in your analysis, the match failed because there was a dot in the text that isn't contained in the character class.
However, you can simplify the regex - no need to repeat the commas, they don't have any special meaning inside a class:
if(strLine.substring(63,86).matches("[A-Za-z0-9,. ]+"))
Are you sure that you'll never have to match non-ASCII letters or any other kind of punctuation, though?
Alphabets and digits : a-zA-Z0-9 can effectively be replaced by \w denoting 'words'.
The period and comma don't need escaping and can be used as is. Hence this regex might come in handy:
"[\w,.]"
Hope this helps. :)

how should i limit length of an ID token in ANTLR?

This should be fairly simple. I'm working on a lexer grammar using ANTLR, and want to limit the maximum length of variable identifiers to 32 characters. I attempted to accomplish this with this line(following normal regex - syntax):
ID : ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'){0,31};
No errors in code generation, but compilation failed due to a line in the generated code that was simply:
0,31
Obviously antlr is taking the section of text between the brackets and placing it in the accept state area along with the print line. I searched the ANTLR site, and I found no example or reference to an equivalent expression. What should the syntax of this expression be?
ANTLR4 is not able to deal with the quantifier syntax {a,b}, moreover, I don't know if it is great to set this constraint in the lexer. I explain myself. The constraint you add in the lexer is responsible for the token recognition. So, if your string is more than 32 char, then the token will not be recognized as an ID token. That seems not so great because it can lead your string to be recognized as another token and will probably lead to a failure fom the parsing phase.
A solution is to avoid this length constraint and deal with it in a Java ANTLR4 Listener or Visitor for example, throwing an exception/displaying an error...etc when the length is greater than 32 char.
EDIT> This question had already been answered here: Range quantifier syntax in ANTLR Regex

Regular expression for splitting JSON text in lines after symbols

I am trying to use a regular expression to have this kind of string
{
"key1"
:
value1
,
"key2"
:
"value2"
,
"arrayKey"
:
[
{
"keyA"
:
valueA
,
"keyB"
:
"valueB"
,
"keyC"
:
[
0
,
1
,
2
]
}
]
}
from
JSONObject.toString()
that is one long line of text in my Android Java app
{"key1":"value1","key2":"value2","arrayKey":[{"keyA":"valueA","keyB":"valueB","keyC":[0,1,2]}]}
I found this regular expression for finding all commas.
/(,)(?=(?:[^"]|"[^"]*")*$)/
Now I need to know:
0- if this is reliable, that is, does what they say.
1- if this is works also with commas inside double-quotes.
2- if this takes into account escaped double-quotes.
3- if I have to take into account also single quotes, as this file is produced by my app but occasionally it could be manually edited by the user.
5- It has to be used with the multi-line flag to work with multi-line text.
6- It has to work with replaceAll().
The resulting regular expression will be be used for replacing each symbol with a two-char sequence made of the symbol itself plus \n character.
The resulting text has to be still JSON text.
Subsequent replace actions will take place also for the other symbols
: [ ] { }
and other symbols that can be found in JSON files outside the alphanumeric sequences between quotes (I do not know if the mentioned symbols are the only ones).
Its not that much simple, but yes if you want to do then you need to filter characters([,{,",',:) and replace then with a new line character against it.
like:
[ should get replaced with [\n
Answer to your question is Yes its very much reliable and good to implement its just a single line of code doing all. Thats what regex is made for.
0- if this is reliable, that is, does what they say.
Let's break down the expression a little:
(,) is a capturing group that matches a single comma
(?=...) would mean a positive lookahead meaning the comma would need to be followed by a match of that group's content
(?:...)* would be a non-capturing group that can occur 0 to many times
[^"]|"[^"]*" would match either any character except a double quote ([^"]) or (|) a pair of double quotes with any character in between except other double quotes ("[^"]*")
As you can see especially the last part could make it unreliable if there are escaped double quotes in a text value, so the answer would be "this is reliable if the input is simple enough".
1- if this is works also with commas inside double-quotes.
If the double quote pairs are correctly identified any commas in between would be ignored.
2- if this takes into account escaped double-quotes.
Here's one of the major problems: escaped double quotes would need to be handled. This can get quite complex if you want to handle arbitrary cases, especially if the texts could contain commas as well.
3- if I have to take into account also single quotes, as this file is produced by my app but occasionally it could be manually edited by the user.
Single quotes aren't allowed by the JSON sepcification but many parsers support them because humans tend to use them anyway. Thus you might need to take them into account and that makes no. 2 even more complex because now there might be an unescaped double quote in a single quote text.
5- It has to be used with the multi-line flag to work with multi-line text.
I'm not entirely sure about that but adding the multi-line flag shouldn't hurt. You could add it to the expression itself though, i.e. by prepeding (?m).
6- It has to work with replaceAll().
In its current form the regex would work with String#replaceAll() because it only matches the comma - the lookahead is used to determine a match but won't result in the wrong parts being replaced. The matches themselves might not be correct though, as described above.
That being said, you should note that JSON is not a regular language and only regular languages are a perfect fit for regular expressions.
Thus I'd recommend using a proper JSON parser (there are quite a lot out there) to parse the JSON into POJOs (might just be a bunch of generic JsonObject and JsonArray instances) and reformat that according to your needs.
Here's an example of how Jackson could be used to accomplish that: https://kodejava.org/how-to-pretty-print-json-string-using-jackson/
In fact, since you're already using JSONObject.toString() you probably don't need the parser itself but just a proper formatter (if you want/need to roll your own you could have a look at the org.json.JSONObject sources ).

Categories