How to check the ranges of numbers in ANTLR 3?

How to check the ranges of numbers in ANTLR 3? - java

I know this might end up being language specific, so a Java or Python solution would be acceptable.
Given the grammar:
MONTH : DIGIT DIGIT ;
DIGIT : ('0'..'9') ;
I want a check constraint on MONTH to ensure the value is between 01 and 12. Where do I start looking, and how do I specify this constraint as a rule?

You can embed custom code by wrapping { and } around it. So you could do something like:
MONTH
: DIGIT DIGIT
{
int month = Integer.parseInt(getText());
// do your check here
}
;
As you can see, I called getText() to get a hold of the matched text of the token.
Note that I assumed you're referencing this MONTH rule from another lexer rule. If you're going to throw an exception if 1 > month > 12, then whenever your source contains an illegal month value, non of the parser rules will ever be matched. Although lexer- and parser rules can be mixed in one .g grammar file, the input source is first tokenized based on the lexer rules, and once that has happened, only then the parser rules will be matched.

You can use this free online utility Regex_For_Range to generate a regular expression for any continuous integer range. For the values 01-12 (with allowed leading 0's) the utility gives:
0*([1-9]|1[0-2])
From here you can see that if you want to constrain this to just the 2-digit strings '01' through '12', then adjust this to read:
0[1-9]|1[0-2]
For days 01-31 we get:
0*([1-9]|[12][0-9]|3[01])
And for the years 2000-2099 the expression is simply:
20[0-9]{2}

Related

Separate definitions of decimal number and word in ANTLR grammar

I'm working on defining a grammar in ANTLR4 which includes words and numbers separately.
Numbers are described:
NUM
: INTEGER+ ('.' INTEGER+)?
;
fragment INTEGER
: ('0' .. '9')
;
and words are described:
WORD
: VALID_CHAR +
;
fragment VALID_CHAR
: ('a' .. 'z') | ('A' .. 'Z')
;
The simplified grammar below describes the addition between either a word or a letter (and needs to be defined recursively like this):
expression
: left = expression '+' right = expression #addition
| value = WORD #word
| value = NUM #num
;
The issue is that when I enter 'd3' into the parser, I get a returned instance of a Word 'd'. Similarly, entering 3f returns a Number of value 3. Is there a way to ensure that 'd3' or any similar strings returns an error message from the grammar?
I've looked at the '~' symbol but that seems to be 'everything except', rather than 'only'.
To summarize, I'm looking for a way to ensure that ONLY a series of letters can be parsed to a Word, and contain no other symbols. Currently, the grammar seems to ignore any additional disallowed characters.
Similar to the message received when '3+' is entered:
simpleGrammar::compileUnit:1:2: mismatched input '<EOF>' expecting {WORD, NUM}
At present, the following occurs:
d --> (d) (word) (correct)
22.3 --> (22.2) number (correct)
d3 --> d (word) (incorrect)
22f.4 --> 22 (number) (incorrect)
But ideally the following would happen :
d --> (d) (word) (correct)
22.3 --> (22.2) number (correct)
d3 --> (error)
22f.4 --> (error)

[Revised to response to revised question and comments]
ANTLR will attempt to match what it can in your input stream in your input stream and then stop once it's reached the longest recognizable input. That means, the best ANTLR could do with your input was to recognize a word ('d') and then it quite, because it could match the rest of your input to any of your rules (using the root expression rule)
You can add a rule to tell ANTLR that it needs to consume to entire input, with a top-level rule something like:
root: expression EOF;
With this rule in place you'll get 'mismatched input' at the '3' in 'd3'.
This same rule would give a 'mismatched input' at the 'f' character in '22f.4'.
That should address the specific question you've asked, and, hopefully, is sufficient to meet your needs. The following discussion is reading a bit into your comment, and maybe assuming too much about what you want in the way of error messages.
Your comment (sort of) implies that you'd prefer to see error messages along the lines of "you have a digit in your word", or "you have a letter in you number"
It helps to understand ANTLR's pipeline for processing your input. First it processes your input stream using the Lexer rules (rules beginning with capital letters) to create a stream of tokens.
Your 'd3' input produces a stream of 2 tokens with your current grammar;
WORD ('d')
NUM ('3')
This stream of tokens is what is being matched against in your parser rules (i.e. expression).
'22f.4' results in the stream:
NUM ('22')
WORD ('f')
(I would expect an error here as there is no Lexer rule that matches a stream of characters beginning with a '.')
As soon as ANTLR saw something other than a number (or '.') while matching your NUM rule, it considered what it matched so far to be the contents of the NUM token, put it into the token stream and moved on. (similar with finding a number in a word)
This is standard lexing/parsing behavior.
You can implement your own ErrorListener where ANTLR will hand the details of the error it encountered to you and you could word you error message as you see fit, but I think you'll find it tricky to hit what it seems your target is. You would not have enough context in the error handler to know what came immediately before, etc., and even if you did, this would get very complicated very fast.
IF you always want some sort of whitespace to occur between NUMs and WORDs, you could do something like defining the following Lexer rules:
BAD_ATOM: (INTEGER|VALID_CHAR|'.')+;
(put it last in the grammar so that the valid streams will match first)
Then when a parser rule errors out with a BAD_ATOM rule, you could inspect it and provide an more specific error message.
Warning: This is a bit unorthodox, and could introduce constraints on what you could allow as you build up your grammar. That said, it's not uncommon to find a "catch-all" Lexer rule at the bottom of a grammar that some people use for better error messages and/or error recovery.

Java Regex First Name Validation

I understand that validating the first name field is highly controversial due to the fact that there are so many different possibilities. However, I am just learning regex and in an effort to help grasp the concept, I have designed some simple validations to create just try to make sure I am able to make the code do exactly what I want it to, despite whether or not it conforms to best business logic practices.
I am trying to validate a few things.
The first name is between 1 and 25 characters.
The first name can only start with an a-z (ignore case) character.
After that the first name can contain a-z (ignore case) and [ '-,.].
The first name can only end with an a-z (ignore case) character.
public static boolean firstNameValidation(String name){
valid = name.matches("(?i)(^[a-z]+)[a-z .,-]((?! .,-)$){1,25}$");
System.out.println("Name: " + name + "\nValid: " + valid);
return valid;
}

Try this regex
^[^- '](?=(?![A-Z]?[A-Z]))(?=(?![a-z]+[A-Z]))(?=(?!.*[A-Z][A-Z]))(?=(?!.*[- '][- '.]))(?=(?!.*[.][-'.]))[A-Za-z- '.]{2,}$
Demo

Your expression is almost correct. The following is a modification that satisfies all of the conditions:
valid = name.matches("(?i)(^[a-z])((?![ .,'-]$)[a-z .,'-]){0,24}$");

A regex for the same:
([a-zA-z]{1}[a-zA-z_'-,.]{0,23}[a-zA-Z]{0,1})

Lets change the order of the requirements:
ignore case: "(?i)"
can only start with an a-z character: "(?i)[a-z]"
can only end with an a-z: "(?i)[a-z](.*[a-z])?"
is between 1 and 25 characters: "(?i)[a-z](.{0,23}[a-z])?"
can contain a-z and [ '-,.]: "(?i)[a-z]([- ',.a-z]{0,23}[a-z])?"
the last one should do the job:
valid = name.matches("(?i)[a-z]([- ',.a-z]{0,23}[a-z])?")
Test on RegexPlanet (press java button).
Notes for above points
could have used "[a-zA-Z]"' instead of"(?i)"'
need ? since we want to allow one character names
23 is total length minus first and the last charracter (25-1-1)
the - must come first (or last) inside [] else it is interpreted as range sepparator (assuming you didn't mean the characters between ' and ,)

Try this simplest version:
^[a-zA-Z][a-zA-Z][-',.][a-zA-Z]{1,25}$
Thanks for sharing.

A unicode compatible version of the answer of #ProPhoto:
^[^- '](?=(?!\p{Lu}?\p{Lu}))(?=(?!\p{Ll}+\p{Lu}))(?=(?!.*\p{Lu}\p{Lu}))(?=(?!.*[- '][- '.]))(?=(?!.*[.][-'.]))(\p{L}|[- '.]){2,}$

Why does DecimalFormat ".#" and "0.#" have different results on 23.0?

Why does java.text.DecimalFormat evaluate the following results:
new DecimalFormat("0.#").format(23.0) // result: "23"
new DecimalFormat(".#").format(23.0) // result: "23.0"
I would have expected the result to be 23 in both cases, because special character # omits zeros. How does the leading special character 0 affect the fraction part? (Tried to match/understand it with the BNF given in javadoc, but failed to do so.)

The second format seems to be invalid according to the JavaDoc, but somehow it parses without error anyway.
Pattern:
PositivePattern
PositivePattern ; NegativePattern
PositivePattern:
Prefixopt Number Suffixopt
NegativePattern:
Prefixopt Number Suffixopt
Prefix:
any Unicode characters except \uFFFE, \uFFFF, and special characters
Suffix:
any Unicode characters except \uFFFE, \uFFFF, and special characters
Number:
Integer Exponentopt
Integer . Fraction Exponentopt
Integer:
MinimumInteger
#
# Integer
# , Integer
MinimumInteger:
0
0 MinimumInteger
0 , MinimumInteger
Fraction:
MinimumFractionopt OptionalFractionopt
MinimumFraction:
0 MinimumFractionopt
OptionalFraction:
# OptionalFractionopt
Exponent:
E MinimumExponent
MinimumExponent:
0 MinimumExponentopt
In this case I'd expect the behaviour of the formatter to be undefined. That is, it may produce any old thing and we can't rely on that being consistent or meaningful in any way. So, I don't know why you're getting the 23.0, but you can assume that it's nonsense that you should avoid in your code.
Update:
I've just run a debugger through Java 7's DecimalFormat library. The code not only explicitly says that '.#' is allowed, there is a comment in there (java.text.DecimalFormat:2582-2593) that says it's allowed, and an implementation that allows it (line 2597). This seems to be in violation of the documented BNF for the pattern.
Given that this is not documented behaviour, you really shouldn't rely on it as it's liable to change between versions of Java or even library implementations.

The following source comment explains the rather unintuitive handling of ".#". Lines 3383-3385 in my DecimalFormat.java file (JDK 8) have the following comment:
// Handle patterns with no '0' pattern character. These patterns
// are legal, but must be interpreted. "##.###" -> "#0.###".
// ".###" -> ".0##".
Seems like the developers have chosen to interpret ".#" as ".0##", instead of what you expected ("0.#").

defining rule for identifiers in ANTLR

I'm trying to write a grammar in ANTLR, and the rules for recognizing IDs and int literals are written as follows:
ID : Letter(Letter|Digit|'_')*;
TOK_INTLIT : [0-9]+ ;
//this is not the complete grammar btw
and when the input is :
void main(){
int 2a;
}
the problem is, the lexer is recognizing 2 as an int literal and a as an ID, which is completely logical based on the grammar I've written, but I don't want 2a to be recognized this way, instead I want an error to be displayed since identifiers cannot begin with something other than a letter... I'm really new to this compiler course... what should be done here?

It's at least interesting that in C and C++, 2n is an invalid number, not an invalid identifier. That's because the C lexer (or, to be more precise, the preprocessor) is required by the standard to interpret any sequence of digits and letters starting with a digit as a "preprocessor number". Later on, an attempt is made to reinterpret the preprocessor number (if it is still part of the preprocessed code) as one of the many possible numeric syntaxes. 2n isn't, so an error will be generated at that point.
Preprocessor numbers are more complicated than that, but that should be enough of a hint for you to come up with a simple solution for your problem.

Parsing regular expressions based on a context free grammar

Good evening, Stack Overflow.
I'd like to develop an interpreter for expressions based on a pretty simple context-free grammar:
Grammar
Basically, the language is constituted by 2 base statements
( SET var 25 ) // Output: var = 25
( GET ( MUL var 5 ) ) // Output: 125
( SET var2 ( MUL 30 5 ) ) //Output: var2 = 150
Now, I'm pretty sure about what should I do in order to interpret a statement: 1) Lexical analysis to turn a statement into a sequence of tokens 2) Syntax analysis to get a symbol table (HashMap with the variables and their values) and a syntactic tree (to perform the GET statements) to 3) perform an inorder visit of the tree to get the results I want.
I'd like some advice on the parsing method to read the source file. Considering the parser should ignore any whitespace, tabulation or newline, is it possible to use a Java Pattern to get a general statement I want to analyze? Is there a good way to read a statement weirdly formatted (and possibly more complex) like this
(
SET var
25
)
without confusing the parser with the open and closed parenthesises?
For example
Scanner scan; //scanner reading the source file
String pattern = "..." //ideal pattern I've found to represent an expression
while(scan.hasNext(pattern))
Interpreter.computeStatement(scan.next(pattern));
would it be a viable option for this problem?

Solution proposed by Ira Braxter:
Your title is extremely confused. You appear to want to parse what are commonly called "S-expressions" in the LISP world; this takes a (simple but) context-free grammar. You cannot parse such expressions with regexps. Time to learn about real parsers.
Maybe this will help: stackoverflow.com/a/2336769/120163

In the end, I understood thanks to Ira Baxter that this context free grammar can't be parsed with RegExp and I used the concepts of S-Expressions to build up the interpreter, whose source code you can find here. If you have any question about it (mainly because the comments aren't translated in english, even though I think the code is pretty clear), just message me or comment here.
Basically what I do is:
Parse every character and tokenize it (e.g '(' -> is OPEN_PAR, while "SET" -> STATEMENT_SET or a random letter like 'b' is parsed as a VARIABLE )
Then, I use the token list created to do a syntactic analysis, which checks the patterns occuring inside the token list, according to the grammar
If there's an expression inside the statement, I check recursively for any expression inside an expression, throwing an exception and going to the following correct statement if needed
At the end of analysing every single statement, I compute the statement as necessary as for specifications

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.