This should be fairly simple.
I'm working on a lexer grammar using ANTLR, and want to limit the maximum length of variable identifiers to 30 characters. I attempted to accomplish this with this line(following normal regex - except for the '' thing - syntax):
ID : ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'){0,29} {System.out.println("IDENTIFIER FOUND.");}
;
No errors in code generation, but compilation failed due to a line in the generated code that was simply:
0,29
Obviously antlr is taking the section of text between the brackets and placing it in the accept state area along with the print line. I searched the ANTLR site, and I found no example or reference to an equivalent expression.
What should the syntax of this expression be?
ANTLR does not support the {m,n} quantifier syntax. ANTLR sees the {} of your quantifier and can't tell them apart from the {} that surround your actions.
Workarounds:
Enforce the limit semantically. Let it gather an unlimited size ID and then complain/truncate it as part of your action code or later in the compiler.
Create the quantification rules manually.
This is an example of a manual rule that limits IDs to 8.
SUBID : ('a'..'z'|'A'..'Z'|'0'..'9'|'_')
;
ID : ('a'..'z'|'A'..'Z')
(SUBID (SUBID (SUBID (SUBID (SUBID (SUBID SUBID?)?)?)?)?)?)?
;
Personally, I'd go with the semantic solution (#1). There is very little reason these days to limit the identifiers in a language, and even less reason to cause a syntax error (early abort of the compile) when such a rule is violated.
Related
This should be fairly simple. I'm working on a lexer grammar using ANTLR, and want to limit the maximum length of variable identifiers to 32 characters. I attempted to accomplish this with this line(following normal regex - syntax):
ID : ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'){0,31};
No errors in code generation, but compilation failed due to a line in the generated code that was simply:
0,31
Obviously antlr is taking the section of text between the brackets and placing it in the accept state area along with the print line. I searched the ANTLR site, and I found no example or reference to an equivalent expression. What should the syntax of this expression be?
ANTLR4 is not able to deal with the quantifier syntax {a,b}, moreover, I don't know if it is great to set this constraint in the lexer. I explain myself. The constraint you add in the lexer is responsible for the token recognition. So, if your string is more than 32 char, then the token will not be recognized as an ID token. That seems not so great because it can lead your string to be recognized as another token and will probably lead to a failure fom the parsing phase.
A solution is to avoid this length constraint and deal with it in a Java ANTLR4 Listener or Visitor for example, throwing an exception/displaying an error...etc when the length is greater than 32 char.
EDIT> This question had already been answered here: Range quantifier syntax in ANTLR Regex
I'm trying to write a grammar in ANTLR, and the rules for recognizing IDs and int literals are written as follows:
ID : Letter(Letter|Digit|'_')*;
TOK_INTLIT : [0-9]+ ;
//this is not the complete grammar btw
and when the input is :
void main(){
int 2a;
}
the problem is, the lexer is recognizing 2 as an int literal and a as an ID, which is completely logical based on the grammar I've written, but I don't want 2a to be recognized this way, instead I want an error to be displayed since identifiers cannot begin with something other than a letter... I'm really new to this compiler course... what should be done here?
It's at least interesting that in C and C++, 2n is an invalid number, not an invalid identifier. That's because the C lexer (or, to be more precise, the preprocessor) is required by the standard to interpret any sequence of digits and letters starting with a digit as a "preprocessor number". Later on, an attempt is made to reinterpret the preprocessor number (if it is still part of the preprocessed code) as one of the many possible numeric syntaxes. 2n isn't, so an error will be generated at that point.
Preprocessor numbers are more complicated than that, but that should be enough of a hint for you to come up with a simple solution for your problem.
Good evening, Stack Overflow.
I'd like to develop an interpreter for expressions based on a pretty simple context-free grammar:
Grammar
Basically, the language is constituted by 2 base statements
( SET var 25 ) // Output: var = 25
( GET ( MUL var 5 ) ) // Output: 125
( SET var2 ( MUL 30 5 ) ) //Output: var2 = 150
Now, I'm pretty sure about what should I do in order to interpret a statement: 1) Lexical analysis to turn a statement into a sequence of tokens 2) Syntax analysis to get a symbol table (HashMap with the variables and their values) and a syntactic tree (to perform the GET statements) to 3) perform an inorder visit of the tree to get the results I want.
I'd like some advice on the parsing method to read the source file. Considering the parser should ignore any whitespace, tabulation or newline, is it possible to use a Java Pattern to get a general statement I want to analyze? Is there a good way to read a statement weirdly formatted (and possibly more complex) like this
(
SET var
25
)
without confusing the parser with the open and closed parenthesises?
For example
Scanner scan; //scanner reading the source file
String pattern = "..." //ideal pattern I've found to represent an expression
while(scan.hasNext(pattern))
Interpreter.computeStatement(scan.next(pattern));
would it be a viable option for this problem?
Solution proposed by Ira Braxter:
Your title is extremely confused. You appear to want to parse what are commonly called "S-expressions" in the LISP world; this takes a (simple but) context-free grammar. You cannot parse such expressions with regexps. Time to learn about real parsers.
Maybe this will help: stackoverflow.com/a/2336769/120163
In the end, I understood thanks to Ira Baxter that this context free grammar can't be parsed with RegExp and I used the concepts of S-Expressions to build up the interpreter, whose source code you can find here. If you have any question about it (mainly because the comments aren't translated in english, even though I think the code is pretty clear), just message me or comment here.
Basically what I do is:
Parse every character and tokenize it (e.g '(' -> is OPEN_PAR, while "SET" -> STATEMENT_SET or a random letter like 'b' is parsed as a VARIABLE )
Then, I use the token list created to do a syntactic analysis, which checks the patterns occuring inside the token list, according to the grammar
If there's an expression inside the statement, I check recursively for any expression inside an expression, throwing an exception and going to the following correct statement if needed
At the end of analysing every single statement, I compute the statement as necessary as for specifications
Searching for solutions for my problem, I got this question, suggesting composite grammars to get rid of code too large. Problem there, I'm already using grammar imports, but when I further extend one of the imported grammars, the root parser grammar shows the error. Apparently, the problem lies in the many tokens and DFA definitions that ANTLR generates after analyzing the whole grammar. Is there a way/what is the suggested way to get rid of this problem? Is it scalable, i.e. does it not depend on the parts changed by the workaround being small enough?
EDIT: To make this clear (the linked question didn't make it clear): The code too large error is a compiler error on the generated parser code, to my understanding usually caused by a grammar so large that some code is larger than the limit of the java specification. In my case, it's the static initializer of the root parser class, which contains tons of DFA lookahead variables, all resulting in code in the initializer. So, Ideally, ANTLR should be able to split that up in the case that the grammar is too big/the user tells ANTLR to do it. Is there such an option?
(I have to admit, the asker of the linked question had an... interesting rule that caused his grammar to bloat up, and it may be my error here, too. But the possibility of this being not the grammar's author's error (in any large grammar) stands, so I see this as a valid, non-grammar specific ANTLR question)
EDIT END
My grammar parses "Magic the Gathering" rules text and is available here (git). The problem specifically appears when exchanging line 33 for 34-36 in this file. I use Maven and antlr3-maven-plugin for building, so ideally, the workaround is doable using the plugin, but if it's not, that's a smaller problem than the one I have now...
Thanks a lot and I hope I haven't overseen any obvious documentation that would help me.
The fragment keyword can only be used before lexer rules, not before parser rules as I see you do. First change that in all your grammars (I only looked at ObjectExpressions.g). It's unfortunate that ANTLR does not produce an error when you try it. But believe me: it's wrong, and might be causing (a part of) your problem(s).
Also, your rule from line 34-36:
qualities
: qualities0
| qualities0 (COMMA qualities0)+ -> qualities0+
| qualities0 (Or qualities0)+ -> ^(Or qualities0+)
;
should be rewritten as:
qualities
: qualities0 (COMMA qualities0)* -> qualities0+
| qualities0 (Or qualities0)+ -> ^(Or qualities0+)
;
EDIT
So, Ideally, ANTLR should be able to split that up in the case that the grammar is too big/the user tells ANTLR to do it. Is there such an option?
No, there is no such option unfortunately. You'll have to divide the grammar into (even more) smaller ones.
Lexer DFA results in "code too large" error
I'm trying to parse Java Server Pages using ANTLR 3.
Java has a limit of 64k for the byte code of a single method, and I keep running into a "code too large" error when compiling the Java source generated by ANTLR.
In some cases, I've been able to fix it by compromising my lexer. For example, JSP uses the XML "Name" token, which can include a wide variety of characters. I decided to accept only ASCII characters in my "Name" token, which drastically simplified some tests in the and lexer allowed it to compile.
However, I've gotten to the point where I can't cut any more corners, but the DFA is still too complex.
What should I do about it?
Are there common mistakes that result in complex DFAs?
Is there a way to inhibit generation of the DFA, perhaps relying on semantic predicates or fixed lookahead to help with the prediction?
Writing this lexer by hand will be easy, but before I give up on ANTLR, I want to make sure I'm not overlooking something obvious.
Background
ANTLR 3 lexers use a DFA to decide how to tokenize input. In the generated DFA, there is a method called specialStateTransition(). This method contains a switch statement with a case for each state in the DFA. Within each case, there is a series of if statements, one for each transition from the state. The condition of each if statement tests an input character to see if it matches the transition.
These character-testing conditions can be very complex. They normally have the following form:
int ch = … ; /* "ch" is the next character in the input stream. */
switch(s) { /* "s" is the current state. */
…
case 13 :
if ((('a' <= ch) && (ch <= 'z')) || (('A' <= ch) && (ch <= 'Z')) || … )
s = 24; /* If the character matches, move to the next state. */
else if …
A seemingly minor change to my lexer can result in dozens of comparisons for a single transition, several transitions for each state, and scores of states. I think that some of the states being considered are impossible to reach due to my semantic predicates, but it seems like semantic predicates are ignored by the DFA. (I could be misreading things though—this code is definitely not what I'd be able to write by hand!)
I found an ANTLR 2 grammar in the Jsp2x tool, but I'm not satisfied with its parse tree, and I want to refresh my ANTLR skills, so I thought I'd try writing my own. I am using ANTLRWorks, and I tried to generate graphs for the DFA, but there appear to be bugs in ANTLRWorks that prevent it.
Grammars that are very large (many different tokens) have that problem, unfortunately (SQL grammars suffer from this too).
Sometimes this can be fixed by making certain lexer rules fragments opposed to "full" lexer rules that produce tokens and/or re-arranging the way characters are matched inside the rules, but by looking at the way you already tried yourself, I doubt there can gained much in your case. However, if you're willing to post your lexer grammar here on SO, I, or someone else, might see something that could be changed.
In general, this problem is fixed by splitting the lexer grammar into 2 or more separate lexer grammars and then importing those in one "master" grammar. In ANTLR terms, these are called composite grammars. See this ANTLR Wiki page about them: http://www.antlr.org/wiki/display/ANTLR3/Composite+Grammars
EDIT
As #Gunther rightfully mentioned in the comment beneath the OP, see the Q&A: Why my antlr lexer java class is "code too large"? where a small change (the removal of a certain predicate) caused this "code too large"-error to disappear.
Well, actually it is not always easy to make a composite grammar. In many cases this AntTask helps to fix this problem (it must be run every time after recompiling a grammar, but this process is not so boring).
Unfortunately, even this magic script doesn't help in some complex cases. Compiler can begin to complaining about too large blocks of DFA transitions (static String[] fields).
I found an easy way to solve it, by moving (using IDE refactoring features) such fields to another class with arbitrarily generated name. It always helps when moving just one or more fields in such way.