Separate definitions of decimal number and word in ANTLR grammar

Separate definitions of decimal number and word in ANTLR grammar - java

I'm working on defining a grammar in ANTLR4 which includes words and numbers separately.
Numbers are described:
NUM
: INTEGER+ ('.' INTEGER+)?
;
fragment INTEGER
: ('0' .. '9')
;
and words are described:
WORD
: VALID_CHAR +
;
fragment VALID_CHAR
: ('a' .. 'z') | ('A' .. 'Z')
;
The simplified grammar below describes the addition between either a word or a letter (and needs to be defined recursively like this):
expression
: left = expression '+' right = expression #addition
| value = WORD #word
| value = NUM #num
;
The issue is that when I enter 'd3' into the parser, I get a returned instance of a Word 'd'. Similarly, entering 3f returns a Number of value 3. Is there a way to ensure that 'd3' or any similar strings returns an error message from the grammar?
I've looked at the '~' symbol but that seems to be 'everything except', rather than 'only'.
To summarize, I'm looking for a way to ensure that ONLY a series of letters can be parsed to a Word, and contain no other symbols. Currently, the grammar seems to ignore any additional disallowed characters.
Similar to the message received when '3+' is entered:
simpleGrammar::compileUnit:1:2: mismatched input '<EOF>' expecting {WORD, NUM}
At present, the following occurs:
d --> (d) (word) (correct)
22.3 --> (22.2) number (correct)
d3 --> d (word) (incorrect)
22f.4 --> 22 (number) (incorrect)
But ideally the following would happen :
d --> (d) (word) (correct)
22.3 --> (22.2) number (correct)
d3 --> (error)
22f.4 --> (error)

[Revised to response to revised question and comments]
ANTLR will attempt to match what it can in your input stream in your input stream and then stop once it's reached the longest recognizable input. That means, the best ANTLR could do with your input was to recognize a word ('d') and then it quite, because it could match the rest of your input to any of your rules (using the root expression rule)
You can add a rule to tell ANTLR that it needs to consume to entire input, with a top-level rule something like:
root: expression EOF;
With this rule in place you'll get 'mismatched input' at the '3' in 'd3'.
This same rule would give a 'mismatched input' at the 'f' character in '22f.4'.
That should address the specific question you've asked, and, hopefully, is sufficient to meet your needs. The following discussion is reading a bit into your comment, and maybe assuming too much about what you want in the way of error messages.
Your comment (sort of) implies that you'd prefer to see error messages along the lines of "you have a digit in your word", or "you have a letter in you number"
It helps to understand ANTLR's pipeline for processing your input. First it processes your input stream using the Lexer rules (rules beginning with capital letters) to create a stream of tokens.
Your 'd3' input produces a stream of 2 tokens with your current grammar;
WORD ('d')
NUM ('3')
This stream of tokens is what is being matched against in your parser rules (i.e. expression).
'22f.4' results in the stream:
NUM ('22')
WORD ('f')
(I would expect an error here as there is no Lexer rule that matches a stream of characters beginning with a '.')
As soon as ANTLR saw something other than a number (or '.') while matching your NUM rule, it considered what it matched so far to be the contents of the NUM token, put it into the token stream and moved on. (similar with finding a number in a word)
This is standard lexing/parsing behavior.
You can implement your own ErrorListener where ANTLR will hand the details of the error it encountered to you and you could word you error message as you see fit, but I think you'll find it tricky to hit what it seems your target is. You would not have enough context in the error handler to know what came immediately before, etc., and even if you did, this would get very complicated very fast.
IF you always want some sort of whitespace to occur between NUMs and WORDs, you could do something like defining the following Lexer rules:
BAD_ATOM: (INTEGER|VALID_CHAR|'.')+;
(put it last in the grammar so that the valid streams will match first)
Then when a parser rule errors out with a BAD_ATOM rule, you could inspect it and provide an more specific error message.
Warning: This is a bit unorthodox, and could introduce constraints on what you could allow as you build up your grammar. That said, it's not uncommon to find a "catch-all" Lexer rule at the bottom of a grammar that some people use for better error messages and/or error recovery.

Related

Reading and Parsing PIPE using Antlr [duplicate]

I have been starting to use ANTLR and have noticed that it is pretty fickle with its lexer rules. An extremely frustrating example is the following:
grammar output;
test: FILEPATH NEWLINE TITLE ;
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
This grammar will not match something like:
c:\test.txt
x
Oddly if I change TITLE to be TITLE: 'x' ; it still fails this time giving an error message saying "mismatched input 'x' expecting 'x'" which is highly confusing. Even more oddly if I replace the usage of TITLE in test with FILEPATH the whole thing works (although FILEPATH will match more than I am looking to match so in general it isn't a valid solution for me).
I am highly confused as to why ANTLR is giving such extremely strange errors and then suddenly working for no apparent reason when shuffling things around.

This seems to be a common misunderstanding of ANTLR:
Language Processing in ANTLR:
The Language Processing is done in two strictly separated phases:
Lexing, i.e. partitioning the text into tokens
Parsing, i.e. building a parse tree from the tokens
Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.
Lexing
Lexing in ANTLR works as following:
all rules with uppercase first character are lexer rules
the lexer starts at the beginning and tries to find a rule that matches best to the current input
a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
tokens are generated from matches:
if one rule matches the maximum length match the corresponding token is pushed into the token stream
if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream
Example: What is wrong with your grammar
Your grammar has two rules that are critical:
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.
There are two hints for that:
keep your lexer rules disjunct (no token should match a superset of another).
if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).

This was not directly OP's problem, but for those who have the same error message, here is something you could check.
I had the same Mismatched Input 'x' expecting 'x' vague error message when I introduced a new keyword. The reason for me was that I had placed the new key word after my VARNAME lexer rule, which assigned it as a variable name instead of as the new keyword. I fixed it by putting the keywords before the VARNAME rule.

ANTLR4 parsing subrules

I have a grammar that works fine when parsing in one pass (entire file).
Now I wish to break the parsing up into components. And run the parser on subrules. I ran into an issue I assume others parsing subrules will see with the following rule:
thing : LABEL? THING THINGDATA thingClause?
//{System.out.println("G4 Lexer/parser thing encountered");}
;
...
thingClause : ',' ID ( ',' ID)?
;
When the above rule is parsed from a top level start rule which parses to EOF everything works fine. When parsed as a sub-rule (not parse to EOF) the parser gets upset when there is no thing clause, as it is expecting to see EITHER a "," character or an EOF character.
line 8:0 mismatched input '%' expecting {, ','}
When I parse to EOF, the % gets correctly parsed into another "thing" component, because the top level rule looks for:
toprule : thing+
| endOfThingsTokens
;
And endOfThingsTokens occurs before EOF... so I expect this is why the top level rule works.
For parsing the subrule, I want the ANTLR4 parser to accept or ignore the % token and say "OK we aren't seeing a thingClause", then reset the token stream so the next thing object can be parsed by a different instance of the parser.
In this specific case I could change the lexer to pass newlines to the parser, which I currently skip in the lexer grammar. That would require lots of other changes to accept newlines in the token stream which are currently not needed.
Essentially I need some way to make the rule have a "end of record" token. But I was wondering if there was some way to solve this with a semantic predicate rule.
something like:
thing : { if comma before %}? LABEL? THING THINGDATA thingClause?
| LABEL? THING THINGDATA
;
...
thingClause : ',' ID ( ',' ID)?
;
The above predicate pseudo code would hide the optional thingClause? if it won't be satisfied so that the parser would stop after parsing one "thing" without looking for a specific "end of thing" token (i.e. newline).
If I solve this I will post the answer.

The parser will (effectively) look-ahead in the token stream to determine if the current rule can be satisfied. The corresponding tokens are then consumed. If any look-ahead tokens remain unconsumed, the parser looks for another rule against which to consume these and additional look-ahead tokens.
The thingClause? element, when not matched, will result in unconsumed tokens in the parser. Hence the error you are seeing.
The parser look-ahead is data dependent. Meaning that the evaluation of the elements of a rule can easily read into the parser more tokens than the current rule could possibly consume.
While a predicate could help, it will not make the problem deterministic. That is, even if the parser matches the non-predicated alt, it may well have read more tokens into the parser than can be consumed by that alt.
The only way to avoid this non-determinism would be to pre-inject <EOF> tokens into the token stream at the sub-rule boundaries.

defining rule for identifiers in ANTLR

I'm trying to write a grammar in ANTLR, and the rules for recognizing IDs and int literals are written as follows:
ID : Letter(Letter|Digit|'_')*;
TOK_INTLIT : [0-9]+ ;
//this is not the complete grammar btw
and when the input is :
void main(){
int 2a;
}
the problem is, the lexer is recognizing 2 as an int literal and a as an ID, which is completely logical based on the grammar I've written, but I don't want 2a to be recognized this way, instead I want an error to be displayed since identifiers cannot begin with something other than a letter... I'm really new to this compiler course... what should be done here?

It's at least interesting that in C and C++, 2n is an invalid number, not an invalid identifier. That's because the C lexer (or, to be more precise, the preprocessor) is required by the standard to interpret any sequence of digits and letters starting with a digit as a "preprocessor number". Later on, an attempt is made to reinterpret the preprocessor number (if it is still part of the preprocessed code) as one of the many possible numeric syntaxes. 2n isn't, so an error will be generated at that point.
Preprocessor numbers are more complicated than that, but that should be enough of a hint for you to come up with a simple solution for your problem.

Parsing regular expressions based on a context free grammar

Good evening, Stack Overflow.
I'd like to develop an interpreter for expressions based on a pretty simple context-free grammar:
Grammar
Basically, the language is constituted by 2 base statements
( SET var 25 ) // Output: var = 25
( GET ( MUL var 5 ) ) // Output: 125
( SET var2 ( MUL 30 5 ) ) //Output: var2 = 150
Now, I'm pretty sure about what should I do in order to interpret a statement: 1) Lexical analysis to turn a statement into a sequence of tokens 2) Syntax analysis to get a symbol table (HashMap with the variables and their values) and a syntactic tree (to perform the GET statements) to 3) perform an inorder visit of the tree to get the results I want.
I'd like some advice on the parsing method to read the source file. Considering the parser should ignore any whitespace, tabulation or newline, is it possible to use a Java Pattern to get a general statement I want to analyze? Is there a good way to read a statement weirdly formatted (and possibly more complex) like this
(
SET var
25
)
without confusing the parser with the open and closed parenthesises?
For example
Scanner scan; //scanner reading the source file
String pattern = "..." //ideal pattern I've found to represent an expression
while(scan.hasNext(pattern))
Interpreter.computeStatement(scan.next(pattern));
would it be a viable option for this problem?

Solution proposed by Ira Braxter:
Your title is extremely confused. You appear to want to parse what are commonly called "S-expressions" in the LISP world; this takes a (simple but) context-free grammar. You cannot parse such expressions with regexps. Time to learn about real parsers.
Maybe this will help: stackoverflow.com/a/2336769/120163

In the end, I understood thanks to Ira Baxter that this context free grammar can't be parsed with RegExp and I used the concepts of S-Expressions to build up the interpreter, whose source code you can find here. If you have any question about it (mainly because the comments aren't translated in english, even though I think the code is pretty clear), just message me or comment here.
Basically what I do is:
Parse every character and tokenize it (e.g '(' -> is OPEN_PAR, while "SET" -> STATEMENT_SET or a random letter like 'b' is parsed as a VARIABLE )
Then, I use the token list created to do a syntactic analysis, which checks the patterns occuring inside the token list, according to the grammar
If there's an expression inside the statement, I check recursively for any expression inside an expression, throwing an exception and going to the following correct statement if needed
At the end of analysing every single statement, I compute the statement as necessary as for specifications

How to check the ranges of numbers in ANTLR 3?

I know this might end up being language specific, so a Java or Python solution would be acceptable.
Given the grammar:
MONTH : DIGIT DIGIT ;
DIGIT : ('0'..'9') ;
I want a check constraint on MONTH to ensure the value is between 01 and 12. Where do I start looking, and how do I specify this constraint as a rule?

You can embed custom code by wrapping { and } around it. So you could do something like:
MONTH
: DIGIT DIGIT
{
int month = Integer.parseInt(getText());
// do your check here
}
;
As you can see, I called getText() to get a hold of the matched text of the token.
Note that I assumed you're referencing this MONTH rule from another lexer rule. If you're going to throw an exception if 1 > month > 12, then whenever your source contains an illegal month value, non of the parser rules will ever be matched. Although lexer- and parser rules can be mixed in one .g grammar file, the input source is first tokenized based on the lexer rules, and once that has happened, only then the parser rules will be matched.

You can use this free online utility Regex_For_Range to generate a regular expression for any continuous integer range. For the values 01-12 (with allowed leading 0's) the utility gives:
0*([1-9]|1[0-2])
From here you can see that if you want to constrain this to just the 2-digit strings '01' through '12', then adjust this to read:
0[1-9]|1[0-2]
For days 01-31 we get:
0*([1-9]|[12][0-9]|3[01])
And for the years 2000-2099 the expression is simply:
20[0-9]{2}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.