I have a grammar that works fine when parsing in one pass (entire file).
Now I wish to break the parsing up into components. And run the parser on subrules. I ran into an issue I assume others parsing subrules will see with the following rule:
thing : LABEL? THING THINGDATA thingClause?
//{System.out.println("G4 Lexer/parser thing encountered");}
;
...
thingClause : ',' ID ( ',' ID)?
;
When the above rule is parsed from a top level start rule which parses to EOF everything works fine. When parsed as a sub-rule (not parse to EOF) the parser gets upset when there is no thing clause, as it is expecting to see EITHER a "," character or an EOF character.
line 8:0 mismatched input '%' expecting {, ','}
When I parse to EOF, the % gets correctly parsed into another "thing" component, because the top level rule looks for:
toprule : thing+
| endOfThingsTokens
;
And endOfThingsTokens occurs before EOF... so I expect this is why the top level rule works.
For parsing the subrule, I want the ANTLR4 parser to accept or ignore the % token and say "OK we aren't seeing a thingClause", then reset the token stream so the next thing object can be parsed by a different instance of the parser.
In this specific case I could change the lexer to pass newlines to the parser, which I currently skip in the lexer grammar. That would require lots of other changes to accept newlines in the token stream which are currently not needed.
Essentially I need some way to make the rule have a "end of record" token. But I was wondering if there was some way to solve this with a semantic predicate rule.
something like:
thing : { if comma before %}? LABEL? THING THINGDATA thingClause?
| LABEL? THING THINGDATA
;
...
thingClause : ',' ID ( ',' ID)?
;
The above predicate pseudo code would hide the optional thingClause? if it won't be satisfied so that the parser would stop after parsing one "thing" without looking for a specific "end of thing" token (i.e. newline).
If I solve this I will post the answer.
The parser will (effectively) look-ahead in the token stream to determine if the current rule can be satisfied. The corresponding tokens are then consumed. If any look-ahead tokens remain unconsumed, the parser looks for another rule against which to consume these and additional look-ahead tokens.
The thingClause? element, when not matched, will result in unconsumed tokens in the parser. Hence the error you are seeing.
The parser look-ahead is data dependent. Meaning that the evaluation of the elements of a rule can easily read into the parser more tokens than the current rule could possibly consume.
While a predicate could help, it will not make the problem deterministic. That is, even if the parser matches the non-predicated alt, it may well have read more tokens into the parser than can be consumed by that alt.
The only way to avoid this non-determinism would be to pre-inject <EOF> tokens into the token stream at the sub-rule boundaries.
Related
I have been starting to use ANTLR and have noticed that it is pretty fickle with its lexer rules. An extremely frustrating example is the following:
grammar output;
test: FILEPATH NEWLINE TITLE ;
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
This grammar will not match something like:
c:\test.txt
x
Oddly if I change TITLE to be TITLE: 'x' ; it still fails this time giving an error message saying "mismatched input 'x' expecting 'x'" which is highly confusing. Even more oddly if I replace the usage of TITLE in test with FILEPATH the whole thing works (although FILEPATH will match more than I am looking to match so in general it isn't a valid solution for me).
I am highly confused as to why ANTLR is giving such extremely strange errors and then suddenly working for no apparent reason when shuffling things around.
This seems to be a common misunderstanding of ANTLR:
Language Processing in ANTLR:
The Language Processing is done in two strictly separated phases:
Lexing, i.e. partitioning the text into tokens
Parsing, i.e. building a parse tree from the tokens
Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.
Lexing
Lexing in ANTLR works as following:
all rules with uppercase first character are lexer rules
the lexer starts at the beginning and tries to find a rule that matches best to the current input
a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
tokens are generated from matches:
if one rule matches the maximum length match the corresponding token is pushed into the token stream
if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream
Example: What is wrong with your grammar
Your grammar has two rules that are critical:
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.
There are two hints for that:
keep your lexer rules disjunct (no token should match a superset of another).
if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).
This was not directly OP's problem, but for those who have the same error message, here is something you could check.
I had the same Mismatched Input 'x' expecting 'x' vague error message when I introduced a new keyword. The reason for me was that I had placed the new key word after my VARNAME lexer rule, which assigned it as a variable name instead of as the new keyword. I fixed it by putting the keywords before the VARNAME rule.
I'm working on defining a grammar in ANTLR4 which includes words and numbers separately.
Numbers are described:
NUM
: INTEGER+ ('.' INTEGER+)?
;
fragment INTEGER
: ('0' .. '9')
;
and words are described:
WORD
: VALID_CHAR +
;
fragment VALID_CHAR
: ('a' .. 'z') | ('A' .. 'Z')
;
The simplified grammar below describes the addition between either a word or a letter (and needs to be defined recursively like this):
expression
: left = expression '+' right = expression #addition
| value = WORD #word
| value = NUM #num
;
The issue is that when I enter 'd3' into the parser, I get a returned instance of a Word 'd'. Similarly, entering 3f returns a Number of value 3. Is there a way to ensure that 'd3' or any similar strings returns an error message from the grammar?
I've looked at the '~' symbol but that seems to be 'everything except', rather than 'only'.
To summarize, I'm looking for a way to ensure that ONLY a series of letters can be parsed to a Word, and contain no other symbols. Currently, the grammar seems to ignore any additional disallowed characters.
Similar to the message received when '3+' is entered:
simpleGrammar::compileUnit:1:2: mismatched input '<EOF>' expecting {WORD, NUM}
At present, the following occurs:
d --> (d) (word) (correct)
22.3 --> (22.2) number (correct)
d3 --> d (word) (incorrect)
22f.4 --> 22 (number) (incorrect)
But ideally the following would happen :
d --> (d) (word) (correct)
22.3 --> (22.2) number (correct)
d3 --> (error)
22f.4 --> (error)
[Revised to response to revised question and comments]
ANTLR will attempt to match what it can in your input stream in your input stream and then stop once it's reached the longest recognizable input. That means, the best ANTLR could do with your input was to recognize a word ('d') and then it quite, because it could match the rest of your input to any of your rules (using the root expression rule)
You can add a rule to tell ANTLR that it needs to consume to entire input, with a top-level rule something like:
root: expression EOF;
With this rule in place you'll get 'mismatched input' at the '3' in 'd3'.
This same rule would give a 'mismatched input' at the 'f' character in '22f.4'.
That should address the specific question you've asked, and, hopefully, is sufficient to meet your needs. The following discussion is reading a bit into your comment, and maybe assuming too much about what you want in the way of error messages.
Your comment (sort of) implies that you'd prefer to see error messages along the lines of "you have a digit in your word", or "you have a letter in you number"
It helps to understand ANTLR's pipeline for processing your input. First it processes your input stream using the Lexer rules (rules beginning with capital letters) to create a stream of tokens.
Your 'd3' input produces a stream of 2 tokens with your current grammar;
WORD ('d')
NUM ('3')
This stream of tokens is what is being matched against in your parser rules (i.e. expression).
'22f.4' results in the stream:
NUM ('22')
WORD ('f')
(I would expect an error here as there is no Lexer rule that matches a stream of characters beginning with a '.')
As soon as ANTLR saw something other than a number (or '.') while matching your NUM rule, it considered what it matched so far to be the contents of the NUM token, put it into the token stream and moved on. (similar with finding a number in a word)
This is standard lexing/parsing behavior.
You can implement your own ErrorListener where ANTLR will hand the details of the error it encountered to you and you could word you error message as you see fit, but I think you'll find it tricky to hit what it seems your target is. You would not have enough context in the error handler to know what came immediately before, etc., and even if you did, this would get very complicated very fast.
IF you always want some sort of whitespace to occur between NUMs and WORDs, you could do something like defining the following Lexer rules:
BAD_ATOM: (INTEGER|VALID_CHAR|'.')+;
(put it last in the grammar so that the valid streams will match first)
Then when a parser rule errors out with a BAD_ATOM rule, you could inspect it and provide an more specific error message.
Warning: This is a bit unorthodox, and could introduce constraints on what you could allow as you build up your grammar. That said, it's not uncommon to find a "catch-all" Lexer rule at the bottom of a grammar that some people use for better error messages and/or error recovery.
I have the follow grammar
GUID : GUIDBLOCK GUIDBLOCK '-' GUIDBLOCK '-' GUIDBLOCK '-' GUIDBLOCK
'-'
GUIDBLOCK GUIDBLOCK GUIDBLOCK;
SELF : 'self(' GUID ')';
fragment
GUIDBLOCK: [A-Za-z0-9][A-Za-z0-9][A-Za-z0-9][A-Za-z0-9];
atom : SELF # CurrentGuid
This is my visitor
#Override
public String visitCurrentGuid(CalcParser.CurrentRecordContext ctx) {
System.out.println("Guid is : " + ctx.getText());
System.out.println("Guid is : " + ctx.getChild(0));
return ctx.getText();
}
With input "self(5827389b-c8ab-4804-8194-e23fbdd1e370)"
There's only one child which is the whole input itself "self(5827389b-c8ab-4804-8194-e23fbdd1e370)"
How should I go about to get the guid part?
From my understanding if my grammar structure is construct as AST, I should be able to print out the tree.
How should I update my grammar?
Thanks
Fragments don't appear at all in the AST - they're basically treated as if you'd written their contents directly in the lexer rule that uses them. So moving code into fragments makes your code easier to read, but does not affect the generated AST at all.
Lexer rules that are used by other lexer rules are also treated as fragments in that context. That is, if a lexer rule uses another lexer rule, it will still produce a single token with no nested structure - just as if you had used a fragment. The fact that it's a lexer rule and not a fragment only makes a difference when the pattern occurs on its own without being part of the larger pattern.
The key is that a lexer rule always produces a single token and tokens have no subtokens. They're the leaves of the AST. Nodes with children are generated from parser rules.
The only parser rule that you have is atom. atom only invokes one other rule SELF. So the generated tree will consist of an atom that contains as its only child a SELF token and, as previously stated, tokens are leafs, so that's the end of the tree.
What you probably want to do to get a useful tree is to make GUIDBLOCK a lexer rule (your only lexer rule, in fact) and turn everything else into parser rules. That'd also mean that you can get rid of atom (possibly renaming SELF to atom if you want).
Then you'll end up with a tree consisting of a self (or atom if you renamed it) node that contains as its children a 'self(' token, a guid node (which you might want to assign a name for easy access) and a ) token. The guid node in turn would contain a sequence of GUIDBLOCK and '-' tokens. You can also add blocks+= before every use of GUIDBLOCK to get a list that only contains the GUIDBLOCK tokens without the dashes.
It might also make sense to turn 'self(' into two tokens (i.e. 'self' '(') - especially if you ever want to add a rule to ignore whitespace.
For example I want replace any prompt function in an SQL query
I have used this expression
Query = Query.replaceAll("#prompt\\s*\\(.*?\\)", "(1)");
This expression works in this example
#Prompt('Bill Cycle','A','MIGRATION\BC',,,)
#Prompt('Bill Cycle','A','MIGRATION\BC',,,)
and the output is is (1)
but when it does not work on this example
#Prompt('Groups','A','[Lookup] Price Group (2)\Nested Group',,,)
the out put is (1) \Nested Group',,,) which is not valid
Sadly, as pointed out by Joe C in a comment, what you are trying to do cannot be done in a regular expression for arbitrary depth parenthesis. The reason is because regular-expressions are not capable of "counting". You need a stack machine for that, or a context-free language parser.
However, you also suggest that the 'prompted' content is always inside single quotes. I assume below the standard Java regexp library. Other regexp libraries might need translation...
"#Prompt\\('[^']*'(\s*,\s*(('[^']*')|([^',)]*)))*\\)"
So, you are searching within prompt for blocks of single-quoted text. The search assumes that each internal bit of content is enclosed in single quotes.
Verify at https://regex101.com/r/nByy0Y/1 (I made a couple fixes). Note that at regex101.com, it will treat the double back-slash as intending a literal back-slash. What you want instead is just to quote the parenthesis so that you want a literal parenthesis.
Because you are using the lazy quantifier '?', it is stopping the match at the end of the first ')'. removing that will let it go to the end greedily, as such:
#prompt\(.*\)
But if there is concern that the entries may have more parans after the one in question, it will cause problems.
Assuming the additional parens will always be in quotes, you can do this:
#prompt\((('([^'])*',*)*|(.*,*)*)\)
Here is it looking for items wrapped in single quotes OR text without parens, which should capture all of the single quoted elements or null params or unquoted text params
I am trying to use a regular expression to have this kind of string
{
"key1"
:
value1
,
"key2"
:
"value2"
,
"arrayKey"
:
[
{
"keyA"
:
valueA
,
"keyB"
:
"valueB"
,
"keyC"
:
[
0
,
1
,
2
]
}
]
}
from
JSONObject.toString()
that is one long line of text in my Android Java app
{"key1":"value1","key2":"value2","arrayKey":[{"keyA":"valueA","keyB":"valueB","keyC":[0,1,2]}]}
I found this regular expression for finding all commas.
/(,)(?=(?:[^"]|"[^"]*")*$)/
Now I need to know:
0- if this is reliable, that is, does what they say.
1- if this is works also with commas inside double-quotes.
2- if this takes into account escaped double-quotes.
3- if I have to take into account also single quotes, as this file is produced by my app but occasionally it could be manually edited by the user.
5- It has to be used with the multi-line flag to work with multi-line text.
6- It has to work with replaceAll().
The resulting regular expression will be be used for replacing each symbol with a two-char sequence made of the symbol itself plus \n character.
The resulting text has to be still JSON text.
Subsequent replace actions will take place also for the other symbols
: [ ] { }
and other symbols that can be found in JSON files outside the alphanumeric sequences between quotes (I do not know if the mentioned symbols are the only ones).
Its not that much simple, but yes if you want to do then you need to filter characters([,{,",',:) and replace then with a new line character against it.
like:
[ should get replaced with [\n
Answer to your question is Yes its very much reliable and good to implement its just a single line of code doing all. Thats what regex is made for.
0- if this is reliable, that is, does what they say.
Let's break down the expression a little:
(,) is a capturing group that matches a single comma
(?=...) would mean a positive lookahead meaning the comma would need to be followed by a match of that group's content
(?:...)* would be a non-capturing group that can occur 0 to many times
[^"]|"[^"]*" would match either any character except a double quote ([^"]) or (|) a pair of double quotes with any character in between except other double quotes ("[^"]*")
As you can see especially the last part could make it unreliable if there are escaped double quotes in a text value, so the answer would be "this is reliable if the input is simple enough".
1- if this is works also with commas inside double-quotes.
If the double quote pairs are correctly identified any commas in between would be ignored.
2- if this takes into account escaped double-quotes.
Here's one of the major problems: escaped double quotes would need to be handled. This can get quite complex if you want to handle arbitrary cases, especially if the texts could contain commas as well.
3- if I have to take into account also single quotes, as this file is produced by my app but occasionally it could be manually edited by the user.
Single quotes aren't allowed by the JSON sepcification but many parsers support them because humans tend to use them anyway. Thus you might need to take them into account and that makes no. 2 even more complex because now there might be an unescaped double quote in a single quote text.
5- It has to be used with the multi-line flag to work with multi-line text.
I'm not entirely sure about that but adding the multi-line flag shouldn't hurt. You could add it to the expression itself though, i.e. by prepeding (?m).
6- It has to work with replaceAll().
In its current form the regex would work with String#replaceAll() because it only matches the comma - the lookahead is used to determine a match but won't result in the wrong parts being replaced. The matches themselves might not be correct though, as described above.
That being said, you should note that JSON is not a regular language and only regular languages are a perfect fit for regular expressions.
Thus I'd recommend using a proper JSON parser (there are quite a lot out there) to parse the JSON into POJOs (might just be a bunch of generic JsonObject and JsonArray instances) and reformat that according to your needs.
Here's an example of how Jackson could be used to accomplish that: https://kodejava.org/how-to-pretty-print-json-string-using-jackson/
In fact, since you're already using JSONObject.toString() you probably don't need the parser itself but just a proper formatter (if you want/need to roll your own you could have a look at the org.json.JSONObject sources ).