Regex are "(" and ")" necessary?

Regex are "(" and ")" necessary? - java

Do I only have to use (and ) if the pattern consists of multiple groups?
So if the pattern is true|false it doesn't matter if I add parenthesis or not, right?
Then again, if the pattern is POINT_PATTERN("\\((\\d+),(\\d+)\\)") it does make a difference because I want to get two different values from it.
Can I write my current patterns, which are:
NUMBER_PATTERN("(?!(0[0-9]))[0-9]+"),
BOOLEAN_PATTERN("(true|false)"),
STRING_PATTERN("(\\w+)"),
INTEGER_PATTERN("/^([+-]?[1-9]\\d*|0)$/"),
as
NUMBER_PATTERN("(?!0[0-9])[0-9]+"),
BOOLEAN_PATTERN("true|false"),
STRING_PATTERN("\\w+"),
INTEGER_PATTERN("^[+-]?[1-9]\\d*|0$");
without any loss?
I am particularly unsure about NUMER_PATTERN and INTEGER_PATTERN. Is there any other reason why I shouldn't do this (bad coding style, ...)?

Yes, a sequence of characters or character classes is the default, and it is of higher precedence than the OR operator |. So if you don't have anything in front or behind your sequence (clearly displayed in your cases of true|false) then you don't need them.
However, if you want to use e.g. this is true|false for "this is true" or "this is false" then the precedence will fail for "this is false" and you need to group the true|false, for instance using a non-capturing group, e.g. this is (?:true|false).
I don't see any problems with your expressions where you removed some of the parentheses. However, if you want to check them, then simply put them into an (online) checker that shows the precedence (e.g. this one) and check if the resulting "explanation" changes. The various IDE plugings for regexp testing will hopefully also provide a similar tree view for you.
Beware that you are sometimes using boundary matchers (^ and $) and sometimes you are not. I'd expect those to be either used or not used.

Related

Java Matcher matches() method to match the entire region against the pattern

I have a pattern (\{!(.*?)\})+ that can be used to validate an expression of format {!someExpression} one or more number of times.
I am performing
Pattern.compile("(\\{!(.*?)\\})+").matcher("{!expression1} {!expression2}").matches() to match the entire region against the pattern.
There is a space between expression1 and expression2.
Expected -> false
Actual -> true
I tried both greedy and lazy quantifiers but not able to figure out the catch here. Any help is appreciated.

Of course it matches. Your regexp says so. matches() matches the whole string, so you're doing exactly what you are asking. The point is, that regex matches the whole string. Try it in any regex tool.
Specifically, (.*?) will happily match expression1} {!expression2. Why shouldn't it? You said 'non-greedy' which doesn't do anything unless we're talking about subgroup matching; non-greediness cannot change what is being matched, it only affects, if it matches, how the groups are divided out. Non-greedy does not mean 'magically do what I want you to', however useful that might seem to be. . will match } just as well as x.
As a general rule if you're using non-greediness you're doing it wrong. It's not a universal rule; if you really know what you're doing (mostly: That you're modifying how backrefs / group matches / find() ends up spacing it out), it's fine. If you're tossing non-greediness in there as you write your regexp that's usually a sign you misunderstand what you're actually writing down.
Presumably, your intent with the non-greedy operator here is that you do not want it to also consume the } that 'ends' the {!expr} block.
In which case, just ask for that then: "Consume everything that isn't a }":
Pattern.compile("(\\{!([^}]*)\\})+").matcher("{!expression1} {!expression2}").matches()
works great.
If your intent is instead that expressions can also contain {} symbols and that this is a much more convoluted grammar system then your question cannot be answered without a full breakdown of what the grammarsystem entails. Note that many grammars are not 'regular' (that's a specific term that refers to a subset of all imaginable grammars), then it cannot be parsed out with a regular expression. That's what the 'regular' in regular expression refers to: A class of grammars. regexes can be used meaningfully on anything that fits a regular grammar. They are useless for anything that isn't, even if it seems like it could work. Thus, if there is a sizable grammar behind this {expr} syntax, it's possible you need an actual full parser for it.
As a simple example, java the language is not regular and therefore cannot meaningfully be parsed with regexes (that is: Whatever aim your regex has, I can write a valid java file that the compiler understands which your regex won't).

How to get best match using java.util.regex.Pattern

Here is my use case. I have different file processing modules which is invoked based on the file name. So if the filename matches the pattern associated with a certain module that module will pick up the file.
I have a catch all pattern defined which is used to do default processing, but this pattern should only kick in if I haven't got a better match.
Consider the following scenario
Pattern 1 - Sample_[0-9]*.xls
Pattern 2 - [a-zA-Z]*_[0-9]*.xls
Now given a file "Sample_11", I want Pattern 1 to be applied as its a better match than Pattern 2, however the method java.util.regex.Pattern.matcher().matches() just returns true or false.
Is there any way to identify what is the better match?
EDIT:
The patterns are defined outside the system (this is a weird use case), so I cannot order
them as suggested by many. In a sense I am looking infer the results of matching to decide if that is the best match or not. Hope this clarifies my question.
Thanks,
Raam

Use the chain of responsibility design pattern (wiki here). Loop (or iterate down a list) through each regex Pattern from most specific to least specific until you find one that matches. Then do the appropriate processing for that match.

Why is the Boolean not sufficient here? Your logic should be checking a more specific regex (or list of regex) first, going down the code path tied to whatever specific regex matches. It should only go on to the catch all if it found no match for the specific patterns. I think the Boolean should work fine for you unless there is more to your problem that I don't see.
Imagine a Map where the key is the pattern and the value is a custom interface for handling a match (let's call it MatchHandler). Iterate the map and if a pattern matches, invoke that MatchHandler. If no match, check the default pattern and if a match, invoke the default MatchHandler. If you needed ordered processing you could use a LinkedHashMap.
Now if you won't know the patterns before hand (and it sounds like that's the case for you) then things get a little more tricky. One possible answer would be to write another regex that evaluates the occurrences of general matching constructs in the pattern (things like [a-z], *, etc). Patterns with more occurrences of these general matching constructs will be less specific matches. It's not perfect but it could work for what you are doing. Just be sure to do a lot of escaping in this other pattern due to the fact that it is looking for regex based constructs using regex itself.

Combining (OR) arbitrary regular expressions

tl;dr Is there a way to OR/combine arbitrary regexes into a single regex (for matching, not capturing) in Java?
In my application I receive two lists from the user:
list of regular expressions
list of strings
and I need to output a list of the strings in (2) that were not matched by any of the regular expressions in (1).
I have the obvious naive implementation in place (iterate over all strings in (2); for each string iterate over all patterns in (1); if no pattern match the string add it to the list that will be returned) but I was wondering if it was possible to combine all patterns into a single one and let the regex compiler exploit optimization opportunities.
The obvious way to OR-combine regexes is obviously (regex1)|(regex2)|(regex3)|...|(regexN) but I'm pretty sure this is not the correct thing to do considering that I have no control over the individual regexes (e.g. they could contain all manners of back/forward references). I was therefore wondering if you can suggest a better way to combine arbitrary regexes in java.
note: it's only implied by the above, but I'll make it explicit: I'm only matching against the string - I don't need to use the output of the capturing groups.

Some regex engines (e.g. PCRE) have the construct (?|...). It's like a non-capturing group, but has the nice feature that in every alternation groups are counted from the same initial value. This would probably immediately solve your problem. So if switching the language for this task is an option for you, that should do the trick.
[edit: In fact, it will still cause problems with clashing named capturing groups. In fact, the pattern won't even compile, since group names cannot be reused.]
Otherwise you will have to manipulate the input patterns. hyde suggested renumbering the backreferences, but I think there is a simpler option: making all groups named groups. You can assure yourself that the names are unique.
So basically, for every input pattern you create a unique identifier (e.g. increment an ID). Then the trickiest part is finding capturing groups in the pattern. You won't be able to do this with a regex. You will have to parse the pattern yourself. Here are some thoughts on what to look out for if you are simply iterating through the pattern string:
Take note when you enter and leave a character class, because inside character classes parentheses are literal characters.
Maybe the trickiest part: ignore all opening parentheses that are followed by ?:, ?=, ?!, ?<=, ?<!, ?>. In addition there are the option setting parentheses: (?idmsuxU-idmsuxU) or (?idmsux-idmsux:somePatternHere) which also capture nothing (of course there could be any subset of those options and they could be in any order - the - is also optional).
Now you should be left only with opening parentheses that are either a normal capturing group or a named on: (?<name>. The easiest thing might be to treat them all the same - that is, having both a number and a name (where the name equals the number if it was not set). Then you rewrite all of those with something like (?<uniqueIdentifier-md5hashOfName> (the hyphen cannot be actually part of the name, you will just have your incremented number followed by the hash - since the hash is of fixed length there won't be any duplicates; pretty much at least). Make sure to remember which number and name the group originally had.
Whenever you encounter a backslash there are three options:
The next character is a number. You have a numbered backreference. Replace all those numbers with k<name> where name is the new group name you generated for the group.
The next characters are k<...>. Again replace this with the corresponding new name.
The next character is anything else. Skip it. That handles escaping of parentheses and escaping of backslashes at the same time.
I think Java might allow forward references. In that case you need two passes. Take care of renaming all groups first. Then change all the references.
Once you have done this on every input pattern, you can safely combine all of them with |. Any other feature than backreferences should not cause problems with this approach. At least not as long as your patterns are valid. Of course, if you have inputs a(b and c)d then you have a problem. But you will have that always if you don't check that the patterns can be compiled on their own.
I hope this gave you a pointer in the right direction.

with regex, is using both "is" and "is not" range definitons within the same range possible?

Note: I'm using a 3rd party app that uses regex for searches which has its own flavor but almost always works like java's flavor of regex. Of course this may not matter.
After searching for many different ways of this same question (phrased many ways), I did not see any tutorials, examples, or even mentions of whether it is possible to use both an "is" (positive?) and "is not" (negative?) definition within the same range.
I can't run a test the example right now in the app to see if my ideas work, because the amount of data being searched is massive and will screw up the matches it has already gathered. I'm only asking because of this.
Here are examples of what I thought might work but caused tester to act weird:
[\w^\s<>.!?]{2}
[\w|^\s<>.!?]{2}
I would rather have it work the way I think the first one would work (any digit, lower case, or upper case character, or other normal character that is not a space, >, <, period, !, or ?) rather then the second which only has an or operator.
The regex testers I used gave me different funky results which is what is confusing me.
Also note: I'm using this within a capture group which is followed by a catch everything match which I may or may not be using properly. So if you'd like to include how to follow what I'm attempting with how to properly do that, feel free. I AM MAINLY JUST CURIOUS TO IF THIS WAS POSSIBLE OR NOT, OR IF IT WAS A IMPROPER METHOD.

Why do you need the \w at all?
[^\s<>.!?]{2}
This already matches all alphanumeric characters since they are neither space nor any of the punctuation characters you mentioned.
In general, you can substract character classes to some degree, for example, to match alphanumerics exluding digits, you can do
[^\W\d]
because [^\W] matches the same as \w, and \d is substracted from that because it's in a negated character class.
Edit:
Some regex engines (like XPath, .NET and JGSoft) allow flexible character class substraction like this:
[a-z-[e-g]]
to match any character from the range [a-z], excluding e, f and g. But Java does not have this feature.

Another possibility is to use two ranges and combine them; e.g.
([\w]|[^\s<>.!?]){2}
However, this does bring up the question of what you are actually trying to express here. Because this example (as I've rewritten it) doesn't make a lot of sense.
What it says is "a word character, or any character that is not whitespace or certain punctuation". But the class of characters that are not "whitespace or certain punctuation" ALREADY includes all of the word characters. So, unless you mean something different, the \w is redundant.

From your question, it looks like a no-space regex would match your needs, you can achieve that with:
[\S]{2}

Regex to find variables and ignore methods

I'm trying to write a regex that finds all variables (and only variables, ignoring methods completely) in a given piece of JavaScript code. The actual code (the one which executes regex) is written in Java.
For now, I've got something like this:
Matcher matcher=Pattern.compile(".*?([a-z]+\\w*?).*?").matcher(string);
while(matcher.find()) {
System.out.println(matcher.group(1));
}
So, when value of "string" is variable*func()*20
printout is:
variable
func
Which is not what I want. The simple negation of ( won't do, because it makes regex catch unnecessary characters or cuts them off, but still functions are captured. For now, I have the following code:
Matcher matcher=Pattern.compile(".*?(([a-z]+\\w*)(\\(?)).*?").matcher(formula);
while(matcher.find()) {
if(matcher.group(3).isEmpty()) {
System.out.println(matcher.group(2));
}
}
It works, the printout is correct, but I don't like the additional check. Any ideas? Please?
EDIT (2011-04-12):
Thank you for all answers. There were questions, why would I need something like that. And you are right, in case of bigger, more complicated scripts, the only sane solution would be parsing them. In my case, however, this would be excessive. The scraps of JS I'm working on are intented to be simple formulas, something like (a+b)/2. No comments, string literals, arrays, etc. Only variables and (probably) some built-in functions. I need variables list to check if they can be initalized and this point (and initialized at all). I realize that all of it can be done manually with RPN as well (which would be safer), but these formulas are going to be wrapped with bigger script and evaluated in web browser, so it's more convenient this way.
This may be a bit dirty, but it's assumed that whoever is writing these formulas (probably me, for most of the time), knows what is doing and is able to check if they are working correctly.
If anyone finds this question, wanting to do something similar, should now the risks/difficulties. I do, at least I hope so ;)

Taking all the sound advice about how regex is not the best tool for the job into consideration is important. But you might get away with a quick and dirty regex if your rule is simple enough (and you are aware of the limitations of that rule):
Pattern regex = Pattern.compile(
"\\b # word boundary\n" +
"[A-Za-z]# 1 ASCII letter\n" +
"\\w* # 0+ alnums\n" +
"\\b # word boundary\n" +
"(?! # Lookahead assertion: Make sure there is no...\n" +
" \\s* # optional whitespace\n" +
" \\( # opening parenthesis\n" +
") # ...at this position in the string",
Pattern.COMMENTS);
This matches an identifier as long as it's not followed by a parenthesis. Of course, now you need group(0) instead of group(1). And of course this matches lots of other stuff (inside strings, comments, etc.)...

If you are rethinking using regex and wondering what else you could do, you could consider using an AST instead to access your source programatically. This answer shows you could use the Eclipse Java AST to build a syntax tree for Java source. I guess you could do similar for Javascript.

A regex won't cut in this case because Java isn't regular. Your best best is to get a parser that understands Java syntax and build onto that. Luckily, ANTLR has a Java 1.6 grammar (and 1.5 grammar).
For your rather limited use case you could probably easily extend the variable assignment rules and get the info you need. It's a bit of a learning curve but this will probably be your best best for a quick and accurate solution.

It's pretty well established that regex cannot be reliably used to parse structured input. See here for the famous response: RegEx match open tags except XHTML self-contained tags
As any given sequence of characters may or may not change meaning depending on previous or subsequent sequences of characters, you cannot reliably identify a syntactic element without both lexing and parsing the input text. Regex can be used for the former (breaking an input stream into tokens), but cannot be used reliably for the latter (assigning meaning to tokens depending on their position in the stream).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.