Regular expression handlers within java - java

Hi I have a series of regular expressions which I am trying to match to an input string. I want to then pass this string to a handler to complete some function on it based on what regular expression it matched. Is there an eloquent way to do this or is a series of if statements my best option?

You may probably combine your multiple regexps into single one like this: (regex1)|(regex2)|...|(regexN). Once combined regex matches, you may query Matcher object for what group is non-empty and choose function based on this.

Related

I want to Capture a alphanumeric group without underscore

I want to Capture an alphanumeric group in regex such that it does not capture starting underscore. For example _reverse(abc) should return reverse(. I am using (?<name>\w+) but it return _reverse(.
You can try this,
[^a-zA-Z0-9()\\s+]
The output will be reverse(abc)
You can specify characters explicitly, e.g.:
[a-zA-Z0-9]+
From what you are showing, I assume you want to strip underscores and content behind the opening parentheses.
Basically, that should work with a regex like this:
"_([a-zA-Z0-9]+\()"
this can be used in conjunction with a Matcher to extract all capturing groups (in this case, [a-zA-Z0-9]+\() and return them.
Note that you can find almost all the help you need with Regular Expressions on utility sites like RegEx 101 and RegEx Per, the latter being a nice visualizer but only working with javaScript-like expressions.
Also, RegEx 101 contains a Regex Debugger to help avoid dangerous regular expressions

Regular Expressions match randomly instead of around quotes in Java

I am writing a program in Java, using Regular expressions, and have run into an error. What I am trying to do, is basically make a programming language, and parse it line by line. Where I am going wrong, is when it tries to find any strings. The thing is, is that I have to have it in the order of identifiers, strings, then integers, but I can have the identifiers find strings. Strings are defined by having double quotes around them. Here is where I have a test, and my expression: here, or here, if you do not want to go to the link:
[^"]([^\W][a-zA-Z0-9]+)[^"]
I cannot show my Java code, because it is all over the place, with the way I programmed it. It should just be the expression, and that's it.
It would be helpful if you can explain more what exactly you are trying to match. E.g. give some example texts and what your expression currently outputs for them.
At the moment I think you are trying to match Strings, text that is surrounded by ". For example foofoo"text123"barbar and your desired output is text123.
If defining a regular expression in Java, you need to escape special characters like ". Here is a Java-usable version for the Regex you have provided:
Pattern pattern = Pattern.compile("[^\"]([^\\W][a-zA-Z0-9]+)[^\"]");
You may then use the Pattern object together with a Matcher object to find your text. Here's the Java-Doc for Pattern.
Here is a Pattern that matches text surrounded by ":
Pattern pattern = Pattern.compile("\"[^\"]*\"");

Does the Java regex library optimize for any characters .*?

I have a wrapper class for matching regular expressions. Obviously, you compile a regular expression into a Pattern like this.
Pattern pattern = Pattern.compile(regex);
But suppose I used a .* to specify any number of characters. So it's basically a wildcard.
Pattern pattern = Pattern.compile(".*");
Does the pattern optimize to always return true and not really calculate anything? Or should I have my wrapper implement that optimization? I am doing this because I could easily process hundreds of thousands of regex operations in a process. If a regex parameter is null I coalesce it to a .*
In your case, I could just use a possessive quantifier to avoid any backtracking:
.*+
The Java pattern-matching engine has several optimizations at its disposal and can apply them automatically.
Here is what Cristian Mocanu's writes in his Optimizing regular expressions in Java about a case similar to .*:
Java regex engine was not able to optimize the expression .*abc.*. I expected it would search for abc in the input string and report a failure very quickly, but it didn't. On the same input string, using String.indexOf("abc") was three times faster then my improved regular expression. It seems that the engine can optimize this expression only when the known string is right at its beginning or at a predetermined position inside it. For example, if I re-write the expression as .{100}abc.* the engine will match it more than ten times faster. Why? Because now the mandatory string abc is at a known position inside the string (there should be exactly one hundred characters before it).
Some of the hints on Java regex optimization from the same source:
If the regular expression contains a string that must be present in the input string (or else the whole expression won't match), the engine can sometimes search that string first and report a failure if it doesn't find a match, without checking the entire regular expression.
Another very useful way to automatically optimize a regular expression is to have the engine check the length of the input string against the expected length according to the regular expression. For example, the expression \d{100} is internally optimized such that if the input string is not 100 characters in length, the engine will report a failure without evaluating the entire regular expression.
Don't hide mandatory strings inside groupings or alternations because the engine won't be able to recognize them. When possible, it is also helpful to specify the lengths of the input strings that you want to match
If you will use a regular expression more than once in your program, be sure to compile the pattern using Pattern.compile() instead of the more direct Pattern.matches().
Also remember that you can re-use the Matcher object for different input strings by calling the method reset().
Beware of alternation. Regular expressions like (X|Y|Z) have a reputation for being slow, so watch out for them. First of all, the order of alternation counts, so place the more common options in the front so they can be matched faster. Also, try to extract common patterns; for example, instead of (abcd|abef) use ab(cd|ef).
Whenever you are using negated character classes to match something other than something else, use possessive quantifiers: instead of [^a]*a use [^a]*+a.
Non-matching strings may cause your code to freeze more often than those that contain a match. Remember to always test your regular expressions using non-matching strings first!
Beware of a known bug #5050507 (when the regex Pattern class throws a StackOverflowError), if you encounter this error, try to rewrite the regular expression or split it into several sub-expressions and run them separately. The latter technique can also sometimes even increase performance.
Instead of lazy dot matching, use tempered greedy token (e.g. (?:(?!something).)*) or unrolling the loop techinque (got downvoted for it today, no idea why).
Unfortunately you can't rely on the engine to optimize your regular expressions all the time. In the above example, the regular expression is actually matched pretty fast, but in many cases the expression is too complex and the input string too large for the engine to optimize.

Java Regex: how to use OR operation to extract substring

Now I am having a set of strings in the following two formats (mixed together):
1. /c/en/SUBSTRING
2. /c/en/SUBSTRING/some_other_string
And I want to use the single Java Regex to extract the SUBSTRING from the strings. I know how to do it for the the second case: Pattern.compile("^/c/en/(\\w+)/"). Obviously I can use two Regex, one for the first case and one for the second case, and then take the result of the successful one. But that is a waste of computation. How can I take the first case into consideration and use a single Regex to finish the task?
I tried "^/c/en/(\\w+)[/|$]" and "^/c/en/(\\w+)/|$" but they do not work. Thanks.
Just do:
Pattern.compile("^/c/en/([^/]+)")
and use a Matcher's .find() against the input. The substring will be available in this matcher's .group(1).

Regular Expression - Return all matches as a single match

I'm working with a piece of code that applies a regex to a string and returns the first match. I don't have access to modify the code to return all matches, nor do I have the ability to implement alternative code.
I have the following example target string:
usera,userb,,userc,,userd,usere,userf,
This is a list of comma delimited usernames joined from multiple sources, some of which were blank resulting in two commas in some places. I'm trying to write a regex that will return all of the comma delimited usernames except for specific values.
For example, consider the following expression:
[^,]\w{1,},(?<!(userb|userc|userd),)
This results in three matches:
usera,
usere,
userf,
Is there any way to get these results as a single match, instead of a match collection, e.g. a single match having the text 'usera,usere,userf,' ?
If I could write code in any language this would be trivial, but I'm limited to input of only the target string and the pattern, and I need a single match that has all items except for the ones I'm omitting. I'm not sure if this is even possible, everything I've ever done with regex's involves processing multiple items in a match collection.
Here is an example in Regex Coach. This image shows that there are the three matches I want, but my requirement is to have the text in a single match, not three separate matches.
EDIT1:
To clarify this ticket is specifically intended to solve the use case using only regular expression syntax. Solving this problem in code is trivial but solving it using only a regex was the requirement given the fact that the executing code is part of a 3rd party product that I didn't want to reverse engineer, wrap, or replace.
Is there any way to get these results as a single match, instead of a match collection, e.g. a single match having the text 'usera,usere,userf,'?
No. Regex matches are consecutive.
A regular expression matches a (sub)string from start to finish. You cannot drop the middle part, this is not how regex engines work. But you can apply the expression again to find another matching substring (incremental search - that's what Regex Coach does). This would result in a match collection.
That being said, you could also just match everything you don't want to keep and remove it, e.g.
,(?=[\s,]+)|(userb|userc|userd)[\s,]*
http://rubular.com/r/LOKOg6IeBa

Categories