StackOverflow Error while trying to match a large String with regex

StackOverflow Error while trying to match a large String with regex - java

I have a String containing natural numbers and between of them any pattern like (,-) or (,).
With a while - condition i make sure that the given Strings contains as less numbers as a given window size, e.g. 5.
The condition looks like that:
while(discretizedTs.substring(lagWindowStart).matches("(-?,?\\d+,){5,}")) {
}
Where lagWindowStart jumps to the index of the next number or (-,) pattern.
For small Strings this regular expression is working fine (as far as tested).
My Problem is, that for large Strings (and i have to deal with very large Strings normaly) this regular expression caused an SOF.
This happens for example if the String contains more than 17k characters.
Is there a limitation of the length of a String which is to match? or a limit in time the matching must be completed?
I did not know how to solve the given problem without regular expressions.
I hope you have any ideas..
Thank you
best

Default JVM stack size is rather small. You can increase it with -Xss option, eg -Xss1024k, should help.

Related

Rearranging one string to another in Java

I am trying to find whether a part of given string A can be or can not be rearranged to given string B (Boolean output).
Since the algorithm must be at most O(n), to ease it, I used stringA.retainAll(stringB), so now I know string A and string B consist of the same set of characters and now the whole task smells like regex.
And .. reading about regex, I might be now having two problems(c).
The question is, do I potentially face a risk of getting O(infinity) by using regex or its more efficient to use StreamAPI with the purpose of finding whether each character of string A has enough duplicates to cover each of character of string B? Let alone regex syntax is not easy to read and build.
As of now, I can't use sorting (any sorting is at least n*log(n)) nor hashsets and the likes (as it eliminates duplicates in both strings).
Thank you.

You can use a HashMap<Character,Integer> to count the number of occurrences of each character of the first String. That would take linear time.
Then, for each Character of the second String, find if it's in the HashMap and decrement the counter (if it's still positive). This will also take linear time, and if you manage to decrement the counters for all the characters of the second String, you succeed.

Regex to match a fixed sub string in a String

I am trying to write a regular expression to verify the presence of a specific number in a fixed position in a String.
String: 109300300330066611111111100000000017000656052086116020170111Name 1
Number to find: 111111111 (Staring from position 17)
I have written the following regular expression:
^.{16}(?<Ones>111111111)(.*)
My understanding is:
Let first 16 characters be whatever they are
Use the Named Capturing Group to grab the specific word
Let the rest of the characters be whatever they are
I am new to regex, is there any issue with the above approach?
Can it be done in other/better way?
I am using Java 8.

Without more details of why you're doing what you're doing, there's just one possible improvement I can see. You repeated any character 16 times at the beginning of the string rather than writing out 16 .s, which is nice and readable, but then, it would be nice to do the same for the repeated 1s:
^.{16}(?<Ones>1{9})(.*)
Otherwise, the string of 1s is hard to understand without the coder manually counting how many there are in the regex.

If you want to hard-code the ones and you know the starting position and you just wnat to know if it is there, using a regex seems unnecessary. you can use this:
String s = "109300300330066611111111100000000017000656052086116020170111Name 1";
if (s.indexOf("111111111").equals(16) doSomething();
Another possible solution without regex:
if(s.substring(16,25).equals("111111111") doSomething();
Otherwise your regex looks good.

Java, poor regex performance with lazy expressions

The code is actually in Scala (Spark/Scala) but the library scala.util.matching.Regex, as per the documentation, delegates to java.util.regex.
The code, essentially, reads a bunch of regex from a config file and then matches them against logs fed to the Spark/Scala app. Everything worked fine until I added a regex to extract strings separated by tabs where the tab has been flattened to "#011" (by rsyslog). Since the strings can have white-spaces, my regex looks like:
(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)
The moment I add this regex to the list, the app takes forever to finish processing logs. To give you an idea of the magnitude of delay, a typical batch of a million lines takes less than 5 seconds to match/extract on my Spark cluster. If I add the expression above, a batch takes an hour!
In my code, I have tried a couple of ways to match regex:
if ( (regex findFirstIn log).nonEmpty ) { do something }
val allGroups = regex.findAllIn(log).matchData.toList
if (allGroups.nonEmpty) { do something }
if (regex.pattern.matcher(log).matches()){do something}
All three suffer from poor performance when the regex mentioned above it added to the list of regex. Any suggestions to improve regex performance or change the regex itself?
The Q/A that's marked as duplicate has a link that I find hard to follow. It might be easier to follow the text if the referenced software, regexbuddy, was free or at least worked on Mac.
I tried negative lookahead but I can't figure out how to negate a string. Instead of /(.+?)#011/, something like /([^#011]+)/ but that just says negate "#" or "0" or "1". How do I negate "#011"? Even after that, I am not sure if negation will fix my performance issue.

The simplest way would be to split on #011. If you want a regex, you can indeed negate the string, but that's complicated. I'd go for an atomic group
(?>(.+?)#011)
Once matched, there's no more backtracking. Done and looking forward for the next group.
Negating a string
The complement of #011 is anything not starting with a #, or starting with a # and not followed by a 0, or starting with the two and not followed... you know. I added some blanks for readability:
((?: [^#] | #[^0] | #0[^1] | #01[^1] )+) #011
Pretty terrible, isn't it? Unlike your original expression it matches newlines (you weren't specific about them).
An alternative is to use negative lookahead: (?!#011) matches iff the following chars are not #011, but doesn't eat anything, so we use a . to eat a single char:
((?: (?!#011). )+)#011
It's all pretty complicated and most probably less performant than simply using the atomic group.
Optimizations
Out of my above regexes, the first one is best. However, as Casimir et Hippolyte wrote, there's a room for improvements (factor 1.8)
( [^#]*+ (?: #(?!011) [^#]* )*+ ) #011
It's not as complicated as it looks. First match any number (including zero) of non-# atomically (the trailing +). Then match a # not followed by 011 and again any number of non-#. Repeat the last sentence any number of times.
A small problem with it is that it matches an empty sequence as well and I can't see an easy way to fix it.

Does java world has the counterpart Regexp::Optimizer in perl? [duplicate]

I wrote a Java program which can generate a sequence of symbols, like "abcdbcdefbcdbcdefg". What I need is Regex optimizer, which can result "a((bcd){2}ef){2}g".
As the input may contain unicodes, like "a\u0063\u0063\bbd", I prefer a Java version.
The reason I want to get a "shorter" expression is for saving space/memory. The sequence of symbols here could be very long.
In general, to find the "shortest" optimized regex is hard. So, here, I don't need ones that guarantee the "shortest" criteria.

I've got a nasty feeling that the problem of creating the shortest regex that matches a given input string or set of strings is going to be computationally "difficult". (There are parallels with the problem of computing Kolmogorov Complexity ...)
It is also worth noting that the optimal regex for abcdbcdefbcdbcdefg in terms of matching speed is likely to be abcdbcdefbcdbcdefg. Adding repeating groups may make the regex string shorter, but it won't make the regex faster. In fact, it is likely to be slower unless the regex engine unrolls the repeating groups.
The reason that I need this is due to the space/memory limits.
Do you have clear evidence that you need to do this?
I suspect that you won't save a worthwhile amount of space by doing this ... unless the input strings are really long. (And if they are long, then you'll get better results using a regular text compression algorithm to compress the strings.)

Regular expressions are not a substitute for compression
Don't use a regular expression to represent a string constant. Regular expressions are designed to be used to match one of many strings. That's not what you're doing.

I assume you are trying to find a small regex to encode a finite set of input strings. If so, you haven't chosen the best possible subject line.
I can't give you an existing program, but I can tell you how to approach writing one.
There is no canonical minimum regex form and determining the true minimum size regex is NP hard. Certainly your sets are finite, so this may be a simpler problem. I'll have to think about it.
But a good heuristic algorithm would be:
Construct a trivial non-deterministic finite automaton (NFA) that accepts all your strings.
Convert the NFA to a deterministic finite automaton (DFA) with the subset construction.
Minimize the DFA with the standard algorithm.
Use the construction from the proof of Kleene's theorem to get to a regex.
Note that step 3 does give you a unique minimum DFA. That would probably be the best way to encode your string sets.

Using regex in Java to SELECTIVELY find a pattern

I have a list of strings (in this case tweets from Twitter). These strings are posted by users, and sometimes reference other specific users. I am using Regular expressions along with Java's String.replaceAll(pattern, replace) method to replace instances of common problems with speech (in this case, repeated consonants), but I need a way to make it ignore any pattern it finds in a username. Username patterns universally match the regex \b#\S+\b
So I want to match y+, but NOT as a member of anything that would match \b#\S+\b
So in everybodyy #everybodyy everybodyy I would match ever(y)bod(yy) #everybodyy ever(y)bod(yy)
Is this possible, and how do I do it?

text.replaceAll("(?i)(?<!\\B#\\S{1,20})y+", "y"); works. The current version of Java supports variable length lookbefore, so long as it's explicitly less than some maximum length size of look-before.
Since twitter usernames have a fixed maximum length, a fixed maximum on the variable length lookbefore solves the problem.

try the following:
String regEx = "(\\s+[^#\\s]\\S*y+\\S*)|(^[^#\\s]\\S*y+\\S*)";

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

StackOverflow Error while trying to match a large String with regex - java

Default JVM stack size is rather small. You can increase it with -Xss option, eg -Xss1024k, should help.

Related

Rearranging one string to another in Java

Regex to match a fixed sub string in a String

Java, poor regex performance with lazy expressions

Does java world has the counterpart Regexp::Optimizer in perl? [duplicate]

Using regex in Java to SELECTIVELY find a pattern

Categories

Resources