how should i limit length of an ID token in ANTLR? - java

This should be fairly simple. I'm working on a lexer grammar using ANTLR, and want to limit the maximum length of variable identifiers to 32 characters. I attempted to accomplish this with this line(following normal regex - syntax):
ID : ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'){0,31};
No errors in code generation, but compilation failed due to a line in the generated code that was simply:
0,31
Obviously antlr is taking the section of text between the brackets and placing it in the accept state area along with the print line. I searched the ANTLR site, and I found no example or reference to an equivalent expression. What should the syntax of this expression be?

ANTLR4 is not able to deal with the quantifier syntax {a,b}, moreover, I don't know if it is great to set this constraint in the lexer. I explain myself. The constraint you add in the lexer is responsible for the token recognition. So, if your string is more than 32 char, then the token will not be recognized as an ID token. That seems not so great because it can lead your string to be recognized as another token and will probably lead to a failure fom the parsing phase.
A solution is to avoid this length constraint and deal with it in a Java ANTLR4 Listener or Visitor for example, throwing an exception/displaying an error...etc when the length is greater than 32 char.
EDIT> This question had already been answered here: Range quantifier syntax in ANTLR Regex

Related

Reading and Parsing PIPE using Antlr [duplicate]

I have been starting to use ANTLR and have noticed that it is pretty fickle with its lexer rules. An extremely frustrating example is the following:
grammar output;
test: FILEPATH NEWLINE TITLE ;
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
This grammar will not match something like:
c:\test.txt
x
Oddly if I change TITLE to be TITLE: 'x' ; it still fails this time giving an error message saying "mismatched input 'x' expecting 'x'" which is highly confusing. Even more oddly if I replace the usage of TITLE in test with FILEPATH the whole thing works (although FILEPATH will match more than I am looking to match so in general it isn't a valid solution for me).
I am highly confused as to why ANTLR is giving such extremely strange errors and then suddenly working for no apparent reason when shuffling things around.
This seems to be a common misunderstanding of ANTLR:
Language Processing in ANTLR:
The Language Processing is done in two strictly separated phases:
Lexing, i.e. partitioning the text into tokens
Parsing, i.e. building a parse tree from the tokens
Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.
Lexing
Lexing in ANTLR works as following:
all rules with uppercase first character are lexer rules
the lexer starts at the beginning and tries to find a rule that matches best to the current input
a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
tokens are generated from matches:
if one rule matches the maximum length match the corresponding token is pushed into the token stream
if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream
Example: What is wrong with your grammar
Your grammar has two rules that are critical:
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.
There are two hints for that:
keep your lexer rules disjunct (no token should match a superset of another).
if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).
This was not directly OP's problem, but for those who have the same error message, here is something you could check.
I had the same Mismatched Input 'x' expecting 'x' vague error message when I introduced a new keyword. The reason for me was that I had placed the new key word after my VARNAME lexer rule, which assigned it as a variable name instead of as the new keyword. I fixed it by putting the keywords before the VARNAME rule.

Regex for commas and periods allowed

I tried searching for an answer to this question and also reading the Regex Wiki but I couldn't find what I'm looking for exactly.
I have a program that validates a document. (It was written by someone else).
If certain lines or characters don't match the regex then an error is generated. I've noted that a few false errors are always generated and I want to correct this. I believe I have narrowed down the problem to this:
Here is an example:
This error is flagged by the program logic:
ERROR: File header immediate origin name is invalid: CITIBANK, N.A.
Here is the code that causes that error:
if(strLine.substring(63,86).matches("[A-Z,a-z,0-9, ]+")){
}else{
JOptionPane.showMessageDialog(null, "ERROR: File header immediate origin name is invalid: "+strLine.substring(63,86));
errorFound=true;
fileHeaderErrorFound=true;
bw.write("ERROR: File header immediate origin name is invalid: "+strLine.substring(63,86));
bw.newLine();
I believe the reason that the error is called at runtime is because the text contains a period and comma.. I am unsure how to allow these in the regex.
I have tried using this
if(strLine.substring(63,86).matches("[A-Z,a-z,0-9,,,. ]+")){
and it seemed to work I just wanted to make sure that is the correct way because it doesn't look right.
You're right in your analysis, the match failed because there was a dot in the text that isn't contained in the character class.
However, you can simplify the regex - no need to repeat the commas, they don't have any special meaning inside a class:
if(strLine.substring(63,86).matches("[A-Za-z0-9,. ]+"))
Are you sure that you'll never have to match non-ASCII letters or any other kind of punctuation, though?
Alphabets and digits : a-zA-Z0-9 can effectively be replaced by \w denoting 'words'.
The period and comma don't need escaping and can be used as is. Hence this regex might come in handy:
"[\w,.]"
Hope this helps. :)

defining rule for identifiers in ANTLR

I'm trying to write a grammar in ANTLR, and the rules for recognizing IDs and int literals are written as follows:
ID : Letter(Letter|Digit|'_')*;
TOK_INTLIT : [0-9]+ ;
//this is not the complete grammar btw
and when the input is :
void main(){
int 2a;
}
the problem is, the lexer is recognizing 2 as an int literal and a as an ID, which is completely logical based on the grammar I've written, but I don't want 2a to be recognized this way, instead I want an error to be displayed since identifiers cannot begin with something other than a letter... I'm really new to this compiler course... what should be done here?
It's at least interesting that in C and C++, 2n is an invalid number, not an invalid identifier. That's because the C lexer (or, to be more precise, the preprocessor) is required by the standard to interpret any sequence of digits and letters starting with a digit as a "preprocessor number". Later on, an attempt is made to reinterpret the preprocessor number (if it is still part of the preprocessed code) as one of the many possible numeric syntaxes. 2n isn't, so an error will be generated at that point.
Preprocessor numbers are more complicated than that, but that should be enough of a hint for you to come up with a simple solution for your problem.

Regular expression, excluding .. in suffix of email addy [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Using a regular expression to validate an email address
This is homework, I've been working on it for a while, I've done lots of reading and feel I have gotten pretty familiar with regex for a beginner.
I am trying to find a regular expression for validating/invalidating a list of emails. There are two addresses which are giving me problems, I can't get them both to validate the correct way at the same time. I've gone through a dozen different expressions that work for all the other emails on the list but I can't get those two at the same time.
First, the addresses.
me#example..com - invalid
someone.nothere#1.0.0.127 - valid
The part of my expression which validates the suffix
I originally started with
#.+\\.[[a-z]0-9]+
And had a second pattern for checking some more invalid addresses and checked the email against both patterns, one checked for validity the other invalidity but my professor said he wanted it all in on expression.
#[[\\w]+\\.[\\w]+]+
or
#[\\w]+\\.[\\w]+
I've tried it written many, many different ways but I'm pretty sure I was just using different syntax to express these two expressions.
I know what I want it to do, I want it to match a character class of "character+"."character+"+
The plus sign being at least one. It works for the invalid class when I only allow the character class to repeat one time(and obviously the ip doesn't get matched), but when I allow the character class to repeat itself it matches the second period even thought it isn't preceded by a character. I don't understand why.
I've even tried grouping everything with () and putting {1} after the escaped . and changing the \w to a-z and replacing + with {1,}; nothing seems to require the period to surrounded by characters.
You need a negative look-ahead :
#\w+\.(?!\.)
See http://www.regular-expressions.info/lookaround.html
test in Perl :
Perl> $_ = 'someone.nothere#1.0.0.127'
someone.nothere#1.0.0.127
Perl> print "OK\n" if /\#\w+\.(?!\.)/
OK
1
Perl> $_ = 'me#example..com'
me#example..com
Perl> print "OK\n" if /\#\w+\.(?!\.)/
Perl>
#([\\w]+\\.)+[\\w]+
Matches at least one word character, followed by a '.'. This is repeated at least once, and is then followed by at least on more word character.
I think you want this:
#[\\w]+(\\.[\\w]+)+
This matches a "word" followed by one or more "." "word" sequences. (You can also do the grouping the other way around; e.g. see Dailin's answer.)
The problem with what you are doing before was that you were trying to embed a repeat inside a character class. That doesn't make sense, and there is no syntax that would support it. A character class defines a set of characters and matches against one character. Nothing more.
The official standard RFC 2822 describes the syntax that valid email addresses with this regular expression:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
More practical implementation of RFC 2822 (if we omit the syntax using double quotes and square brackets), which will still match 99.99% of all email addresses in actual use today, is:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?

Range quantifier syntax in ANTLR Regex

This should be fairly simple.
I'm working on a lexer grammar using ANTLR, and want to limit the maximum length of variable identifiers to 30 characters. I attempted to accomplish this with this line(following normal regex - except for the '' thing - syntax):
ID : ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'){0,29} {System.out.println("IDENTIFIER FOUND.");}
;
No errors in code generation, but compilation failed due to a line in the generated code that was simply:
0,29
Obviously antlr is taking the section of text between the brackets and placing it in the accept state area along with the print line. I searched the ANTLR site, and I found no example or reference to an equivalent expression.
What should the syntax of this expression be?
ANTLR does not support the {m,n} quantifier syntax. ANTLR sees the {} of your quantifier and can't tell them apart from the {} that surround your actions.
Workarounds:
Enforce the limit semantically. Let it gather an unlimited size ID and then complain/truncate it as part of your action code or later in the compiler.
Create the quantification rules manually.
This is an example of a manual rule that limits IDs to 8.
SUBID : ('a'..'z'|'A'..'Z'|'0'..'9'|'_')
;
ID : ('a'..'z'|'A'..'Z')
(SUBID (SUBID (SUBID (SUBID (SUBID (SUBID SUBID?)?)?)?)?)?)?
;
Personally, I'd go with the semantic solution (#1). There is very little reason these days to limit the identifiers in a language, and even less reason to cause a syntax error (early abort of the compile) when such a rule is violated.

Categories