defining rule for identifiers in ANTLR

defining rule for identifiers in ANTLR - java

I'm trying to write a grammar in ANTLR, and the rules for recognizing IDs and int literals are written as follows:
ID : Letter(Letter|Digit|'_')*;
TOK_INTLIT : [0-9]+ ;
//this is not the complete grammar btw
and when the input is :
void main(){
int 2a;
}
the problem is, the lexer is recognizing 2 as an int literal and a as an ID, which is completely logical based on the grammar I've written, but I don't want 2a to be recognized this way, instead I want an error to be displayed since identifiers cannot begin with something other than a letter... I'm really new to this compiler course... what should be done here?

It's at least interesting that in C and C++, 2n is an invalid number, not an invalid identifier. That's because the C lexer (or, to be more precise, the preprocessor) is required by the standard to interpret any sequence of digits and letters starting with a digit as a "preprocessor number". Later on, an attempt is made to reinterpret the preprocessor number (if it is still part of the preprocessed code) as one of the many possible numeric syntaxes. 2n isn't, so an error will be generated at that point.
Preprocessor numbers are more complicated than that, but that should be enough of a hint for you to come up with a simple solution for your problem.

Related

Separate definitions of decimal number and word in ANTLR grammar

I'm working on defining a grammar in ANTLR4 which includes words and numbers separately.
Numbers are described:
NUM
: INTEGER+ ('.' INTEGER+)?
;
fragment INTEGER
: ('0' .. '9')
;
and words are described:
WORD
: VALID_CHAR +
;
fragment VALID_CHAR
: ('a' .. 'z') | ('A' .. 'Z')
;
The simplified grammar below describes the addition between either a word or a letter (and needs to be defined recursively like this):
expression
: left = expression '+' right = expression #addition
| value = WORD #word
| value = NUM #num
;
The issue is that when I enter 'd3' into the parser, I get a returned instance of a Word 'd'. Similarly, entering 3f returns a Number of value 3. Is there a way to ensure that 'd3' or any similar strings returns an error message from the grammar?
I've looked at the '~' symbol but that seems to be 'everything except', rather than 'only'.
To summarize, I'm looking for a way to ensure that ONLY a series of letters can be parsed to a Word, and contain no other symbols. Currently, the grammar seems to ignore any additional disallowed characters.
Similar to the message received when '3+' is entered:
simpleGrammar::compileUnit:1:2: mismatched input '<EOF>' expecting {WORD, NUM}
At present, the following occurs:
d --> (d) (word) (correct)
22.3 --> (22.2) number (correct)
d3 --> d (word) (incorrect)
22f.4 --> 22 (number) (incorrect)
But ideally the following would happen :
d --> (d) (word) (correct)
22.3 --> (22.2) number (correct)
d3 --> (error)
22f.4 --> (error)

[Revised to response to revised question and comments]
ANTLR will attempt to match what it can in your input stream in your input stream and then stop once it's reached the longest recognizable input. That means, the best ANTLR could do with your input was to recognize a word ('d') and then it quite, because it could match the rest of your input to any of your rules (using the root expression rule)
You can add a rule to tell ANTLR that it needs to consume to entire input, with a top-level rule something like:
root: expression EOF;
With this rule in place you'll get 'mismatched input' at the '3' in 'd3'.
This same rule would give a 'mismatched input' at the 'f' character in '22f.4'.
That should address the specific question you've asked, and, hopefully, is sufficient to meet your needs. The following discussion is reading a bit into your comment, and maybe assuming too much about what you want in the way of error messages.
Your comment (sort of) implies that you'd prefer to see error messages along the lines of "you have a digit in your word", or "you have a letter in you number"
It helps to understand ANTLR's pipeline for processing your input. First it processes your input stream using the Lexer rules (rules beginning with capital letters) to create a stream of tokens.
Your 'd3' input produces a stream of 2 tokens with your current grammar;
WORD ('d')
NUM ('3')
This stream of tokens is what is being matched against in your parser rules (i.e. expression).
'22f.4' results in the stream:
NUM ('22')
WORD ('f')
(I would expect an error here as there is no Lexer rule that matches a stream of characters beginning with a '.')
As soon as ANTLR saw something other than a number (or '.') while matching your NUM rule, it considered what it matched so far to be the contents of the NUM token, put it into the token stream and moved on. (similar with finding a number in a word)
This is standard lexing/parsing behavior.
You can implement your own ErrorListener where ANTLR will hand the details of the error it encountered to you and you could word you error message as you see fit, but I think you'll find it tricky to hit what it seems your target is. You would not have enough context in the error handler to know what came immediately before, etc., and even if you did, this would get very complicated very fast.
IF you always want some sort of whitespace to occur between NUMs and WORDs, you could do something like defining the following Lexer rules:
BAD_ATOM: (INTEGER|VALID_CHAR|'.')+;
(put it last in the grammar so that the valid streams will match first)
Then when a parser rule errors out with a BAD_ATOM rule, you could inspect it and provide an more specific error message.
Warning: This is a bit unorthodox, and could introduce constraints on what you could allow as you build up your grammar. That said, it's not uncommon to find a "catch-all" Lexer rule at the bottom of a grammar that some people use for better error messages and/or error recovery.

antlr4 grammar string with number

I have a problem with antlr4 grammar in java.
I would like to have a lexer value, that is able to parse all of the following inputs:
Only letters
Letters and numbers
Only numbers
My code looks like this:
parser rule:
new_string: NEW_STRING+;
lexer rule:
NEW_DIGIT: [0-9]+;
STRING_CHAR : ~[;\r\n"'];
NEW_STRING: (NEW_DIGIT+ | STRING_CHAR+ | STRING_CHAR+ NEW_DIGIT+);
I know there must be an obvious solution, but I have been trying to find one, and I can't seem to figure out a way.
Thank you in advance!

Since the first two lexer rules are not fragments, they can (and will) be matched if the input contains just digits, or ~[;\r\n"'] (since if equally long sequence of input can be matched, first lexer rule wins).
In fact, STRING_CHAR can match anything that NEW_STRING can, so the latter will never be used.
You need to:
make sure STRING_CHAR does not match digits
make NEW_DIGIT and STRING_CHAR fragments
check the asterisks - almost everything is allowed to repeat in your lexer, it doesn't make sense at first look ( but you need to adjust that according to your requirements that we do not know)
Like this:
fragment NEW_DIGIT: [0-9];
fragment STRING_CHAR : ~[;\r\n"'0-9];
NEW_STRING: (NEW_DIGIT+ | STRING_CHAR+ (NEW_DIGIT+)?);

Distinguishing between lexical error and semantic error

What is the difference between these two errors, lexical and semantic?
int d = "orange";
inw d = 4;
Would the first one be a semantic error? Since you can't assign a literal to an int? As for the second one the individual tokens are messed up so it would be lexical? That is my thought process, I could be wrong but I'd like to understand this a little more.

There are really three commonly recognized levels of interpretation: lexical, syntactic and semantic. Lexical analysis turns a string of characters into tokens, syntactic builds the tokens into valid statements in the language and semantic interprets those statements correctly to perform some algorithm.
Your first error is semantic: while all the tokens are legal it's not legal in Java to assign a string constant to a integer variable.
Your second error could be classified as lexical (as the string "inw" is not a valid keyword) or as syntactic ("inw" could be the name of a variable but it's not legal syntax to have a variable name in that context).
A semantic error can also be something that is legal in the language but does not represent the intended algorithm. For example: "1" + n is perfectly valid code but if it is intending to do an arithmetic addition then it has a semantic error. Some semantic errors can be picked up by modern compilers but ones such as these depend on the intention of the programmer.
See the answers to whats-the-difference-between-syntax-and-semantics for more details.

Why can't an identifier begin with a digit in java

I can not think anything other than "string of digits would be a valid identifier as well as a valid number."
Is there any other explanation other than this one?

Because that would make telling number literals from symbols names a serious PITA.
For example with a digit being valid for the first character a variables of the names 0xdeadbeef or 0xc00lcafe were valid. But that could be interpreted as a hexadecimal number as well. By limiting the first character of a symbol to be a non-digit, ambiguities of that kind are avoided.

If it could then this assignment would be possible
int 33 = 44; // oh oh
then how would the JVM distinguish between a numeric literal and a variable?

It's to keep the rules simple for the compiler as well as for the programmer.
An identifier could be defined as any alphanumeric sequence that can not be interpreted as a number, but you would get into situations where the compiler would interpret the code differently from what you expect.
Example:
double 1e = 9;
double x = 1e-4;
The result in x would not be 5 but 0.0001 as 1e-4 is a number in scientific notation and not interpreted as 1e minus 4.

This is done in Java and in many other languages so that a parser could classify a terminal symbol uniquely regardless of its surrounding context. Technically, it is entirely possible to allow identifiers that look like numbers or even like keywords: for example, it is possible to write a parser that lifts the restriction on identifiers, allowing you to write something like this:
int 123 = 321; // 123 is an identifier in this imaginary compiler
The compiler knows enough to "understand" that whatever comes after the type name must be a variable name, so 123 is an identifier, and so it could treat this as a valid declaration. However, this would create more ambiguities down the road, because 123 becomes in invalid number "shadowed" by your new "identifier".
In the end, the rule works both ways: it helps compiler designers write simpler compilers, and it also helps programmers write readable code.
Note that there were attempts in the past to build compilers that are not particularly picky about names of identifiers - for example
int a real int = 3
would declare an identifier with spaces (i.e. "a real int" is a single identifier). This did not help readability, though, so modern compilers abandoned the trend.

How can I simplify token prediction DFA?

Lexer DFA results in "code too large" error
I'm trying to parse Java Server Pages using ANTLR 3.
Java has a limit of 64k for the byte code of a single method, and I keep running into a "code too large" error when compiling the Java source generated by ANTLR.
In some cases, I've been able to fix it by compromising my lexer. For example, JSP uses the XML "Name" token, which can include a wide variety of characters. I decided to accept only ASCII characters in my "Name" token, which drastically simplified some tests in the and lexer allowed it to compile.
However, I've gotten to the point where I can't cut any more corners, but the DFA is still too complex.
What should I do about it?
Are there common mistakes that result in complex DFAs?
Is there a way to inhibit generation of the DFA, perhaps relying on semantic predicates or fixed lookahead to help with the prediction?
Writing this lexer by hand will be easy, but before I give up on ANTLR, I want to make sure I'm not overlooking something obvious.
Background
ANTLR 3 lexers use a DFA to decide how to tokenize input. In the generated DFA, there is a method called specialStateTransition(). This method contains a switch statement with a case for each state in the DFA. Within each case, there is a series of if statements, one for each transition from the state. The condition of each if statement tests an input character to see if it matches the transition.
These character-testing conditions can be very complex. They normally have the following form:
int ch = … ; /* "ch" is the next character in the input stream. */
switch(s) { /* "s" is the current state. */
…
case 13 :
if ((('a' <= ch) && (ch <= 'z')) || (('A' <= ch) && (ch <= 'Z')) || … )
s = 24; /* If the character matches, move to the next state. */
else if …
A seemingly minor change to my lexer can result in dozens of comparisons for a single transition, several transitions for each state, and scores of states. I think that some of the states being considered are impossible to reach due to my semantic predicates, but it seems like semantic predicates are ignored by the DFA. (I could be misreading things though—this code is definitely not what I'd be able to write by hand!)
I found an ANTLR 2 grammar in the Jsp2x tool, but I'm not satisfied with its parse tree, and I want to refresh my ANTLR skills, so I thought I'd try writing my own. I am using ANTLRWorks, and I tried to generate graphs for the DFA, but there appear to be bugs in ANTLRWorks that prevent it.

Grammars that are very large (many different tokens) have that problem, unfortunately (SQL grammars suffer from this too).
Sometimes this can be fixed by making certain lexer rules fragments opposed to "full" lexer rules that produce tokens and/or re-arranging the way characters are matched inside the rules, but by looking at the way you already tried yourself, I doubt there can gained much in your case. However, if you're willing to post your lexer grammar here on SO, I, or someone else, might see something that could be changed.
In general, this problem is fixed by splitting the lexer grammar into 2 or more separate lexer grammars and then importing those in one "master" grammar. In ANTLR terms, these are called composite grammars. See this ANTLR Wiki page about them: http://www.antlr.org/wiki/display/ANTLR3/Composite+Grammars
EDIT
As #Gunther rightfully mentioned in the comment beneath the OP, see the Q&A: Why my antlr lexer java class is "code too large"? where a small change (the removal of a certain predicate) caused this "code too large"-error to disappear.

Well, actually it is not always easy to make a composite grammar. In many cases this AntTask helps to fix this problem (it must be run every time after recompiling a grammar, but this process is not so boring).
Unfortunately, even this magic script doesn't help in some complex cases. Compiler can begin to complaining about too large blocks of DFA transitions (static String[] fields).
I found an easy way to solve it, by moving (using IDE refactoring features) such fields to another class with arbitrarily generated name. It always helps when moving just one or more fields in such way.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.