Java Regex Engine Crashing - java

Regex Pattern - ([^=](\\s*[\\w-.]*)*$)
Test String - paginationInput.entriesPerPage=5
Java Regex Engine Crashing / Taking Ages (> 2mins) finding a match. This is not the case for the following test inputs:
paginationInput=5
paginationInput.entries=5
My requirement is to get hold of the String on the right-hand side of = and replace it with something. The above pattern is doing it fine except for the input mentioned above.
I want to understand why the error and how can I optimize the Regex for my requirement so as to avoid other peculiar cases.

You can use a look behind to make sure your string starts at the character after the =:
(?<=\\=)([\\s\\w\\-.]*)$
As for why it is crashing, it's the second * around the group. I'm not sure why you need that, since that sounds like you are asking for :
A single character, anything but equals
Then 0 or more repeats of the following group:
Any amount of white space
Then any amount of word characters, dash, or dot
End of string
Anyway, take out that *, and it doesn't spin forever anymore, but I'd still go for the more specific regex using the look behind.
Also, I don't know how you are using this, but why did you have the $ in there? Then you can only match the last one in the string (if you have more than one). It seems like you'd be better off with a look-ahead to the new line or the end: (?=\\n|$)
[Edit]: Update per comment below.

Try this:
=\\s*(.*)$

Related

Regex to match a fixed sub string in a String

I am trying to write a regular expression to verify the presence of a specific number in a fixed position in a String.
String: 109300300330066611111111100000000017000656052086116020170111Name 1
Number to find: 111111111 (Staring from position 17)
I have written the following regular expression:
^.{16}(?<Ones>111111111)(.*)
My understanding is:
Let first 16 characters be whatever they are
Use the Named Capturing Group to grab the specific word
Let the rest of the characters be whatever they are
I am new to regex, is there any issue with the above approach?
Can it be done in other/better way?
I am using Java 8.
Without more details of why you're doing what you're doing, there's just one possible improvement I can see. You repeated any character 16 times at the beginning of the string rather than writing out 16 .s, which is nice and readable, but then, it would be nice to do the same for the repeated 1s:
^.{16}(?<Ones>1{9})(.*)
Otherwise, the string of 1s is hard to understand without the coder manually counting how many there are in the regex.
If you want to hard-code the ones and you know the starting position and you just wnat to know if it is there, using a regex seems unnecessary. you can use this:
String s = "109300300330066611111111100000000017000656052086116020170111Name 1";
if (s.indexOf("111111111").equals(16) doSomething();
Another possible solution without regex:
if(s.substring(16,25).equals("111111111") doSomething();
Otherwise your regex looks good.

Java, poor regex performance with lazy expressions

The code is actually in Scala (Spark/Scala) but the library scala.util.matching.Regex, as per the documentation, delegates to java.util.regex.
The code, essentially, reads a bunch of regex from a config file and then matches them against logs fed to the Spark/Scala app. Everything worked fine until I added a regex to extract strings separated by tabs where the tab has been flattened to "#011" (by rsyslog). Since the strings can have white-spaces, my regex looks like:
(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)
The moment I add this regex to the list, the app takes forever to finish processing logs. To give you an idea of the magnitude of delay, a typical batch of a million lines takes less than 5 seconds to match/extract on my Spark cluster. If I add the expression above, a batch takes an hour!
In my code, I have tried a couple of ways to match regex:
if ( (regex findFirstIn log).nonEmpty ) { do something }
val allGroups = regex.findAllIn(log).matchData.toList
if (allGroups.nonEmpty) { do something }
if (regex.pattern.matcher(log).matches()){do something}
All three suffer from poor performance when the regex mentioned above it added to the list of regex. Any suggestions to improve regex performance or change the regex itself?
The Q/A that's marked as duplicate has a link that I find hard to follow. It might be easier to follow the text if the referenced software, regexbuddy, was free or at least worked on Mac.
I tried negative lookahead but I can't figure out how to negate a string. Instead of /(.+?)#011/, something like /([^#011]+)/ but that just says negate "#" or "0" or "1". How do I negate "#011"? Even after that, I am not sure if negation will fix my performance issue.
The simplest way would be to split on #011. If you want a regex, you can indeed negate the string, but that's complicated. I'd go for an atomic group
(?>(.+?)#011)
Once matched, there's no more backtracking. Done and looking forward for the next group.
Negating a string
The complement of #011 is anything not starting with a #, or starting with a # and not followed by a 0, or starting with the two and not followed... you know. I added some blanks for readability:
((?: [^#] | #[^0] | #0[^1] | #01[^1] )+) #011
Pretty terrible, isn't it? Unlike your original expression it matches newlines (you weren't specific about them).
An alternative is to use negative lookahead: (?!#011) matches iff the following chars are not #011, but doesn't eat anything, so we use a . to eat a single char:
((?: (?!#011). )+)#011
It's all pretty complicated and most probably less performant than simply using the atomic group.
Optimizations
Out of my above regexes, the first one is best. However, as Casimir et Hippolyte wrote, there's a room for improvements (factor 1.8)
( [^#]*+ (?: #(?!011) [^#]* )*+ ) #011
It's not as complicated as it looks. First match any number (including zero) of non-# atomically (the trailing +). Then match a # not followed by 011 and again any number of non-#. Repeat the last sentence any number of times.
A small problem with it is that it matches an empty sequence as well and I can't see an easy way to fix it.

Regex Pattern Catastrophic backtracking

I have the regex shown below used in one of my old Java systems which is causing backtracking issues lately.
Quite often the backtracking threads cause the CPU of the machine to hit the upper limit and it does not return back until the application is restarted.
Could any one suggest a better way to rewrite this pattern or a tool which would help me to do so?
Pattern:
^\[(([\p{N}]*\]\,\[[\p{N}]*)*|[\p{N}]*)\]$
Values working:
[1234567],[89023432],[124534543],[4564362],[1234543],[12234567],[124567],[1234567],[1234567]
Catastrophic backtracking values — if anything is wrong in the values (an extra brace added at the end):
[1234567],[89023432],[124534543],[4564362],[1234543],[12234567],[124567],[1234567],[1234567]]
Never use * when + is what you mean. The first thing I noticed about your regex is that almost everything is optional. Only the opening and closing square brackets are required, and I'm pretty sure you don't want to treat [] as a valid input.
One of the biggest causes of runaway backtracking is to have two or more alternatives that can match the same things. That's what you've got with the |[\p{N}]* part. The regex engine has to try every conceivable path through the string before it gives up, so all those \p{N}* constructs get into an endless tug-of-war over every group of digits.
But there's no point trying to fix those problems, because the overall structure is wrong. I think this is what you're looking for:
^\[\p{N}+\](?:,\[\p{N}+\])*$
After it consumes the first token ([1234567]), if the next thing in the string is not a comma or the end of the string, it fails immediately. If it does see a comma, it must go on to match another complete token ([89023432]), or it fails immediately.
That's probably the most important thing to remember when you're creating a regex: if it's going to fail, you want it to fail as quickly as possible. You can use features like atomic groups and possessive quantifiers toward that end, but if you get the structure of the regex right, you rarely need them. Backtracking is not inevitable.

Regular Expression to Match Number of Lines and Characters per Line

I'm trying to make sure that a string contains between 0 and 3 lines, and that for a given line that is present that it contains 0 to 100 characters. It would need to be a valid expression for JavaScript and Java. Like many people doing RegEx I'm copying from various spots on the Internet.
Working backwards I think ^.{0,100}$ gets me the "line contains 0 to 100 characters", but trying to group that as (^.{0,100}$){0,3} doesn't work.
The new line character is probably part of my problem, so I ended up with something like .{0,100}(?:\n.{0,100}){0,2} trying to say "a line of 0 to 100 characters optionally followed by 0 to 2 instances of a new line and 0 to 100 more characters", but that also failed.
Up until now I got those expressions from other people. Using an online test tool I finally monkeyed this together: ^.{0,100}(?:(?:\r\n|[\r\n]).{0,100}){0,2}$ which appears to work.
So, my question is, am I missing any pitfalls in ^.{0,100}(?:(?:\r\n|[\r\n]).{0,100}){0,2}$ given what I'm after? Furthermore, even if that does work is it the best expression to use?
I think what you have will work fine. You can make the line break part a little more compact if you want, and you don't need ^ and $ if you are using matches():
String regex = ".{0,100}(?:[\r\n]+.{0,100}){0,2}";
EDIT
After some more thoughts I realized the newline suggestion above will match 4 (or more) lines as long as a couple of them are empty. So, we are back to your suggested example. Oh well, at least the start and end characters can be omitted.
String regex = ".{0,100}(?:(?:\r\n|[\r\n]).{0,100}){0,2}";
I'm not very good at regular expressions but would this work?
^.{0,100}\n?(.{0,100}\n)?.{0,100}?$
Again I'm still new to reg exp, so if there is an error(which is likely) please tell me.

Regex: How not to match a few letters

I have the following string: SEE ATTACHED ADDENDUM TO HUD-1194,520.07
Inside that string is HUD-1 and after that is 194,520.07. What I want is the 194,520.07 part.
I have written the following regular expression to pull that value out:
[^D\-1](?:-|\()?\$?(?:\d{1,3}[ ,]?)*(?:\.\d+)\)?
However, this pulls out: 94,520.07
I know it has something to do with this part: [^D\-1] "eating" to many of the 1's. Any ideas how I can stop it from "eating" 1's after the first one that appears in HUD-1?
UPDATED:
The reason for all the other stuff is I only want to match as well if the value after HUD-1 is a money amount. And the rest of that regex tries to determine all the different ways a money amount could be written
Why not something as simple as:
.*HUD\-1(.*+)
Ok, you need to be more restrictive I see based on your updated question. Try changing [^D\-1] to just (?:HUD\-1)?. For what it's worth, your currency RegEx is vary lax, allowing input like:
001 001 .31412341234123
You might consider not reinventing the wheel there, I'm sure you can find a currency RegEx quickly via Google. Otherwise, I'd also suggest anchoring your RegEx with a $ at the end of it.
this change will make the second match group of the regex include the full number you would like (everything after the first 1), and put the possible HUD-1 in a separate matching group, if present.
(HUD-1)?((?:-|\()?\$?(?:\d{1,3}[ ,]?)*(?:\.\d+)\)?)

Categories