antlr4 grammar string with number

antlr4 grammar string with number - java

I have a problem with antlr4 grammar in java.
I would like to have a lexer value, that is able to parse all of the following inputs:
Only letters
Letters and numbers
Only numbers
My code looks like this:
parser rule:
new_string: NEW_STRING+;
lexer rule:
NEW_DIGIT: [0-9]+;
STRING_CHAR : ~[;\r\n"'];
NEW_STRING: (NEW_DIGIT+ | STRING_CHAR+ | STRING_CHAR+ NEW_DIGIT+);
I know there must be an obvious solution, but I have been trying to find one, and I can't seem to figure out a way.
Thank you in advance!

Since the first two lexer rules are not fragments, they can (and will) be matched if the input contains just digits, or ~[;\r\n"'] (since if equally long sequence of input can be matched, first lexer rule wins).
In fact, STRING_CHAR can match anything that NEW_STRING can, so the latter will never be used.
You need to:
make sure STRING_CHAR does not match digits
make NEW_DIGIT and STRING_CHAR fragments
check the asterisks - almost everything is allowed to repeat in your lexer, it doesn't make sense at first look ( but you need to adjust that according to your requirements that we do not know)
Like this:
fragment NEW_DIGIT: [0-9];
fragment STRING_CHAR : ~[;\r\n"'0-9];
NEW_STRING: (NEW_DIGIT+ | STRING_CHAR+ (NEW_DIGIT+)?);

Related

Java Matcher matches() method to match the entire region against the pattern

I have a pattern (\{!(.*?)\})+ that can be used to validate an expression of format {!someExpression} one or more number of times.
I am performing
Pattern.compile("(\\{!(.*?)\\})+").matcher("{!expression1} {!expression2}").matches() to match the entire region against the pattern.
There is a space between expression1 and expression2.
Expected -> false
Actual -> true
I tried both greedy and lazy quantifiers but not able to figure out the catch here. Any help is appreciated.

Of course it matches. Your regexp says so. matches() matches the whole string, so you're doing exactly what you are asking. The point is, that regex matches the whole string. Try it in any regex tool.
Specifically, (.*?) will happily match expression1} {!expression2. Why shouldn't it? You said 'non-greedy' which doesn't do anything unless we're talking about subgroup matching; non-greediness cannot change what is being matched, it only affects, if it matches, how the groups are divided out. Non-greedy does not mean 'magically do what I want you to', however useful that might seem to be. . will match } just as well as x.
As a general rule if you're using non-greediness you're doing it wrong. It's not a universal rule; if you really know what you're doing (mostly: That you're modifying how backrefs / group matches / find() ends up spacing it out), it's fine. If you're tossing non-greediness in there as you write your regexp that's usually a sign you misunderstand what you're actually writing down.
Presumably, your intent with the non-greedy operator here is that you do not want it to also consume the } that 'ends' the {!expr} block.
In which case, just ask for that then: "Consume everything that isn't a }":
Pattern.compile("(\\{!([^}]*)\\})+").matcher("{!expression1} {!expression2}").matches()
works great.
If your intent is instead that expressions can also contain {} symbols and that this is a much more convoluted grammar system then your question cannot be answered without a full breakdown of what the grammarsystem entails. Note that many grammars are not 'regular' (that's a specific term that refers to a subset of all imaginable grammars), then it cannot be parsed out with a regular expression. That's what the 'regular' in regular expression refers to: A class of grammars. regexes can be used meaningfully on anything that fits a regular grammar. They are useless for anything that isn't, even if it seems like it could work. Thus, if there is a sizable grammar behind this {expr} syntax, it's possible you need an actual full parser for it.
As a simple example, java the language is not regular and therefore cannot meaningfully be parsed with regexes (that is: Whatever aim your regex has, I can write a valid java file that the compiler understands which your regex won't).

Java, poor regex performance with lazy expressions

The code is actually in Scala (Spark/Scala) but the library scala.util.matching.Regex, as per the documentation, delegates to java.util.regex.
The code, essentially, reads a bunch of regex from a config file and then matches them against logs fed to the Spark/Scala app. Everything worked fine until I added a regex to extract strings separated by tabs where the tab has been flattened to "#011" (by rsyslog). Since the strings can have white-spaces, my regex looks like:
(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)
The moment I add this regex to the list, the app takes forever to finish processing logs. To give you an idea of the magnitude of delay, a typical batch of a million lines takes less than 5 seconds to match/extract on my Spark cluster. If I add the expression above, a batch takes an hour!
In my code, I have tried a couple of ways to match regex:
if ( (regex findFirstIn log).nonEmpty ) { do something }
val allGroups = regex.findAllIn(log).matchData.toList
if (allGroups.nonEmpty) { do something }
if (regex.pattern.matcher(log).matches()){do something}
All three suffer from poor performance when the regex mentioned above it added to the list of regex. Any suggestions to improve regex performance or change the regex itself?
The Q/A that's marked as duplicate has a link that I find hard to follow. It might be easier to follow the text if the referenced software, regexbuddy, was free or at least worked on Mac.
I tried negative lookahead but I can't figure out how to negate a string. Instead of /(.+?)#011/, something like /([^#011]+)/ but that just says negate "#" or "0" or "1". How do I negate "#011"? Even after that, I am not sure if negation will fix my performance issue.

The simplest way would be to split on #011. If you want a regex, you can indeed negate the string, but that's complicated. I'd go for an atomic group
(?>(.+?)#011)
Once matched, there's no more backtracking. Done and looking forward for the next group.
Negating a string
The complement of #011 is anything not starting with a #, or starting with a # and not followed by a 0, or starting with the two and not followed... you know. I added some blanks for readability:
((?: [^#] | #[^0] | #0[^1] | #01[^1] )+) #011
Pretty terrible, isn't it? Unlike your original expression it matches newlines (you weren't specific about them).
An alternative is to use negative lookahead: (?!#011) matches iff the following chars are not #011, but doesn't eat anything, so we use a . to eat a single char:
((?: (?!#011). )+)#011
It's all pretty complicated and most probably less performant than simply using the atomic group.
Optimizations
Out of my above regexes, the first one is best. However, as Casimir et Hippolyte wrote, there's a room for improvements (factor 1.8)
( [^#]*+ (?: #(?!011) [^#]* )*+ ) #011
It's not as complicated as it looks. First match any number (including zero) of non-# atomically (the trailing +). Then match a # not followed by 011 and again any number of non-#. Repeat the last sentence any number of times.
A small problem with it is that it matches an empty sequence as well and I can't see an easy way to fix it.

defining rule for identifiers in ANTLR

I'm trying to write a grammar in ANTLR, and the rules for recognizing IDs and int literals are written as follows:
ID : Letter(Letter|Digit|'_')*;
TOK_INTLIT : [0-9]+ ;
//this is not the complete grammar btw
and when the input is :
void main(){
int 2a;
}
the problem is, the lexer is recognizing 2 as an int literal and a as an ID, which is completely logical based on the grammar I've written, but I don't want 2a to be recognized this way, instead I want an error to be displayed since identifiers cannot begin with something other than a letter... I'm really new to this compiler course... what should be done here?

It's at least interesting that in C and C++, 2n is an invalid number, not an invalid identifier. That's because the C lexer (or, to be more precise, the preprocessor) is required by the standard to interpret any sequence of digits and letters starting with a digit as a "preprocessor number". Later on, an attempt is made to reinterpret the preprocessor number (if it is still part of the preprocessed code) as one of the many possible numeric syntaxes. 2n isn't, so an error will be generated at that point.
Preprocessor numbers are more complicated than that, but that should be enough of a hint for you to come up with a simple solution for your problem.

Combining (OR) arbitrary regular expressions

tl;dr Is there a way to OR/combine arbitrary regexes into a single regex (for matching, not capturing) in Java?
In my application I receive two lists from the user:
list of regular expressions
list of strings
and I need to output a list of the strings in (2) that were not matched by any of the regular expressions in (1).
I have the obvious naive implementation in place (iterate over all strings in (2); for each string iterate over all patterns in (1); if no pattern match the string add it to the list that will be returned) but I was wondering if it was possible to combine all patterns into a single one and let the regex compiler exploit optimization opportunities.
The obvious way to OR-combine regexes is obviously (regex1)|(regex2)|(regex3)|...|(regexN) but I'm pretty sure this is not the correct thing to do considering that I have no control over the individual regexes (e.g. they could contain all manners of back/forward references). I was therefore wondering if you can suggest a better way to combine arbitrary regexes in java.
note: it's only implied by the above, but I'll make it explicit: I'm only matching against the string - I don't need to use the output of the capturing groups.

Some regex engines (e.g. PCRE) have the construct (?|...). It's like a non-capturing group, but has the nice feature that in every alternation groups are counted from the same initial value. This would probably immediately solve your problem. So if switching the language for this task is an option for you, that should do the trick.
[edit: In fact, it will still cause problems with clashing named capturing groups. In fact, the pattern won't even compile, since group names cannot be reused.]
Otherwise you will have to manipulate the input patterns. hyde suggested renumbering the backreferences, but I think there is a simpler option: making all groups named groups. You can assure yourself that the names are unique.
So basically, for every input pattern you create a unique identifier (e.g. increment an ID). Then the trickiest part is finding capturing groups in the pattern. You won't be able to do this with a regex. You will have to parse the pattern yourself. Here are some thoughts on what to look out for if you are simply iterating through the pattern string:
Take note when you enter and leave a character class, because inside character classes parentheses are literal characters.
Maybe the trickiest part: ignore all opening parentheses that are followed by ?:, ?=, ?!, ?<=, ?<!, ?>. In addition there are the option setting parentheses: (?idmsuxU-idmsuxU) or (?idmsux-idmsux:somePatternHere) which also capture nothing (of course there could be any subset of those options and they could be in any order - the - is also optional).
Now you should be left only with opening parentheses that are either a normal capturing group or a named on: (?<name>. The easiest thing might be to treat them all the same - that is, having both a number and a name (where the name equals the number if it was not set). Then you rewrite all of those with something like (?<uniqueIdentifier-md5hashOfName> (the hyphen cannot be actually part of the name, you will just have your incremented number followed by the hash - since the hash is of fixed length there won't be any duplicates; pretty much at least). Make sure to remember which number and name the group originally had.
Whenever you encounter a backslash there are three options:
The next character is a number. You have a numbered backreference. Replace all those numbers with k<name> where name is the new group name you generated for the group.
The next characters are k<...>. Again replace this with the corresponding new name.
The next character is anything else. Skip it. That handles escaping of parentheses and escaping of backslashes at the same time.
I think Java might allow forward references. In that case you need two passes. Take care of renaming all groups first. Then change all the references.
Once you have done this on every input pattern, you can safely combine all of them with |. Any other feature than backreferences should not cause problems with this approach. At least not as long as your patterns are valid. Of course, if you have inputs a(b and c)d then you have a problem. But you will have that always if you don't check that the patterns can be compiled on their own.
I hope this gave you a pointer in the right direction.

Parsing a code block with EBNF expression

I am using CocoR to generate a java-like scanner/parser:
I'm having some troubles in creating a EBNF expression to match a codeblock:
I'm assuming a code block is surrounded by two well-known tokens:
<& and &>
example:
public method(int a, int b) <&
various code
&>
If I define a nonterminal symbol
codeblock = "<&" {ANY} "&>"
If the code inside the two symbols contains a '<' character the generated compiler will not handle it thus giving a syntax error.
Any hint?
Edit:
COMPILER JavaLike
CHARACTERS
nonZeroDigit = "123456789".
digit = '0' + nonZeroDigit .
letter = 'A' .. 'Z' + 'a' .. 'z' + '_' + '$'.
TOKENS
ident = letter { letter | digit }.
PRODUCTIONS
JavaLike = {ClassDeclaration}.
ClassDeclaration ="class" ident ["extends" ident] "{" {VarDeclaration} {MethodDeclaration }"}" .
MethodDeclaration ="public" Type ident "("ParamList")" CodeBlock.
Codeblock = "<&" {ANY} "&>".
I have omitted some productions for the sake of simplicity.
This is my actual implementation of the grammar. The main bug is that it fails if the code in the block contains one of the symbols '>' or '&'.

Nick, late to the party here ...
A number of ways to do this:
Define tokens for <& and &> so the lexer knows about them.
You may be able to use a COMMENTS directive
COMMENTS FROM <& TO &> - quoted as CoCo expects.
Or make hack NextToken() in your scanner.frame file. Do something like this (pseudo-code):
if (Peek() == CODE_START)
{
while (NextToken() != CODE_END)
{
// eat tokens
}
}
Or can override the Read() method in the Buffer and eat at the lowest level.
HTH

You can expand the ANY term to include <&, &>, and another nonterminal (call it ANY_WITHIN_BLOCK say).
Then you just use
ANY = "<&" | {ANY_WITHIN_BLOCK} | "&>"
codeblock = "<&" {ANY_WITHIN_BLOCK} "&>"
And then the meaning of {ANY} is unchanged if you really need it later.
Okay, I didn't know anything about CocoR and gave you a useless answer, so let's try again.
As I started to say later in the comments, I feel that the real issue is that your grammar might be too loose and not sufficiently well specified.
When I wrote the CFG for the one language I've tried to create, I ended up using a sort of "meet-in-the-middle" approach: I wrote the top-level structure AND the immediate low-level combinations of tokens first, and then worked to make them meet in the mid-level (at about the level of conditionals and control flow, I guess).
You said this language is a bit like Java, so let me just show you the first lines I would write as a first draft to describe its grammar (in pseudocode, sorry. Actually it's like yacc/bison. And here, I'm using your brackets instead of Java's):
/* High-level stuff */
program: classes
classes: main-class inner-classes
inner-classes: inner-classes inner-class
| /* empty */
main-class: class-modifier "class" identifier class-block
inner-class: "class" identifier class-block
class-block: "<&" class-decls "&>"
class-decls: field-decl
| method
method: method-signature method-block
method-block: "<&" statements "&>"
statements: statements statement
| /* empty */
class-modifier: "public"
| "private"
identifier: /* well, you know */
And at the same time as you do all that, figure out your immediate token combinations, like for example defining "number" as a float or an int and then creating rules for adding/subtracting/etc. them.
I don't know what your approach is so far, but you definitely want to make sure you carefully specify everything and use new rules when you want a specific structure. Don't get ridiculous with creating one-to-one rules, but never be afraid to create a new rule if it helps you organize your thoughts better.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.