Java String.split() spiralling out of control

Java String.split() spiralling out of control - java

I am trying to parse strings (some can be very long, paragraphs) based on white space (spaces, return keys, tabs). Currently using String.split("\\s++"). In the previous project we are updating, we had simply used StringTokenizer. Using String.split("\\s++") works just fine in all our testing and with all our beta testers.
The minute we release it to expanded users, it runs for a while until it soaks up all server resources. From what I've researched, it appears to be catastrophic backtracking. We get errors like:
....was in progress with java.base#11.0.5/java.util.regex.Pattern$GroupHead.match(Pattern.java:4804)
java.base#11.0.5/java.util.regex.Pattern$Start.match(Pattern.java:3619)
java.base#11.0.5/java.util.regex.Matcher.search(Matcher.java:1729)
java.base#11.0.5/java.util.regex.Matcher.find(Matcher.java:746)
java.base#11.0.5/java.util.regex.Pattern.split(Pattern.java:1264)
java.base#11.0.5/java.lang.String.split(String.java:2317)
Users can type some crazy text. What is the best option to parse strings that could be anywhere from 10 characters to 1000 characters long? I am at a brick wall. Been trying different patterns (regex is not my strongest area) for the past 4 days without long term success.

The simple solution if you dont trust the regex is to use a non regex based solution such as ApacheCommons StringUtils#split. Alternatively, its pretty easy to write one yourself.
Keep in mind the difference between using StringTokenizer versus a split function is the tokenizer is lazy. If you were only retrieving a subset of the split results you may be eating up more memory with a split. I would only expect this to be a problem with large strings though.

Related

Detecting phone numbers using sapi

This is regarding to the question in Detecting phone numbers using sapi. I used that grammar in the answer.
But it gets numbers with space how can I create grammar get numbers without space?

Short answer - it would be quite difficult to do that, as SAPI is designed to recognize words, which have spaces in between them.
However, it's pretty straightforward to annotate the grammar so that you can find the start & end position of the phone number within the reco result, at which point you can remove the spaces using string replacement.
Alternatively, it's also pretty straightforward to place the phone number in a tag, at which point you can extract the number from the rules structure. (This might be trickier from Java.)

Java : Matcher.find using high cpu utilization

I am using mod security rules https://github.com/SpiderLabs/owasp-modsecurity-crs to sanitize user input data. I am facing cpu shoot up and delay in matching the user input with mod security rule regular expressions. Overall it contains 500+ regular expression to check different types of attacks(xss , badrobots , generic and sql). For each request , I go through all parameters and check against all these 500 regular expressions. I am using Matcher.find to check the parameters. In this case some parameters fall in infinite looping , I tackled this using below technique.
Cancelling a long running regex match?.
Sanitize a user request took around ~500 ms and cpu % shoots up. I analyzed using visualvm.java.net with my test suite runner.
Cpu Profile Output
Please help me to reduce the cpu usage % and load average?

If possible, compile your regexes once and keep them around, rather than repeatedly (implicitly) compiling (especially inside a loop).
See java.util.regex - importance of Pattern.compile()? for more info.

I suggest you look at this paper:
"Towards Faster String Matching for Intrusion Detection or Exceeding the Speed of Snort"
There are better ways to do the matching you describe. Essentially you take the 500 patterns you want to match and compile it into a single suffix tree which can very efficiently match an input against all the rules at once.
The paper explains that this approach was described as "Boyer-Moore Approach to Exact Set Matching" by Dan Gusfield.
Boyer-Moore is a well known algorithm for String matching. The paper describes a variation on Boyer-Moore for Set Matching.

I think this is the root of your problem, not the regex performance per-se:
For each request , I go through all parameters and check against all these 500 regular expressions
No matter how fast your regex will be, this is still plenty of work. I don't know how many parameters you have, but even if there are only a few, that's still checking thousands of regular expressions per request. That can kill your CPU.
Apart from the obvious things like improving your regex performance by precompiling and/or simplifying them, you can do the following things to reduce the amount of regex checking:
Use positive-validation of user input based on the parameter type. E.g. if some parameter must be a simple number, don't waste time checking if it contains malicious XML script. Just check whether it matches [0-9]+ (or something similarly simple). If it does, it is ok - skip checking all the 500 regexps.
Try to find simple regexps that could eliminate the whole classes of attacks - find common things in your regexps. If e.g. you've got 100 regexps checking for existence of certain HTML tags, check if the content contains at least one HTML tag first. If it doesn't, you immediately save on checking 100 regexps.
Cache results. Many parameters generated in webapps repeat themselves. Don't check the same content over and over again, but just remember the final validation result. Beware to limit the maximum size of the cache to avoid DOS attacks.
Also note that negative-validation is usually easy to bypass. Someone just changes a few chars in their malicious code and your regexps won't match. You'll have to grow your "database" of regexps in order to protect against new attacks. Positive validation (whitelisting) doesn't have this disadvantage and is much more effective.

Avoid expresions with:
Multiline
case insensitive
etc.
Perhaps you can consider grouping regular expressions and apply a given group of regular expresions depending on user input.

If you have such a big number of regex, you could group (at least some of) them using a trie algorithm (http://en.wikipedia.org/wiki/Trie).
The idea is that if you have for example regexes like /abc[0-9-]/, /abde/, /another example/, /.something else/ and /.I run out of ideas/, you can combine them into the single regex
/a(?:b(?:c[0-9-]|de)|nother example)|.(?:I run out of ideas|something else)/
In this way, the matcher has to run only once instead of four times, and you avoid a lot of backtracking, because of how the common starting parts have been written in the regex above.

There must be a subset of problematic regexes among these 500. I.e. such a regex
String s = "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB";
Pattern.compile("(A+)+").matcher(s).matches();
will take years to complete.
So in your case I would log all the problematic regexes with their problematic inputs. Once these are found you could manually rewrite these few problematic regexes and test them versus the original. Regexes can always be rewritten with a simpler and more readable java functions.
Another option, though it would not resolve the problem above, is you could also utilize a faster (x20 in some cases) and more limited regex library. It is available in Maven Central.

Java & Regex: Matching a substring that is not preceded by specific characters

This is one of those questions that has been asked and answered hundreds of times over, but I'm having a hard time adapting other solutions to my needs.
In my Java-application I have a method for censoring bad words in chat messages. It works for most of my words, but there is one particular (and popular) curse word that I can't seem to get rid of. The word is "faen" (which is simply a modern slang for "satan", in the language in question).
Using the pattern "fa+e+n" for matching multiple A's and E's actually works; however, in this language, the word for "that couch" or "that sofa" is "sofaen". I've tried a lot of different approaches, using variations of [^so] and (?!=so), but so far I haven't been able to find a way to match one and not the other.
The real goal here, is to be able to match the bad words, regardless of the number of vowels, and regardless of any non-letters in between the components of the word.
Here's a few examples of what I'm trying to do:
"String containing faen" Should match
"String containing sofaen" Should not match
"Non-letter-censored string with f-a#a-e.n" Should match
"Non-letter-censored string with sof-a#a-e.n" Should not match
Any tips to set me off in the right direction on this?

You want something like \bf[^\s]+a[^\s]+e[^\s]+n[^\s]\b. Note that this is the regular expression; if you want the Java then you need to use \\b[^\\s]+f[^\\s]+a[^\\s]+e[^\\s]+n[^\\s]\b.
Note also that this isn't perfect, but does handle the situations that you have suggested.

It's a terrible idea to begin with. You think, your users would write something like "f-aeen" to avoid your filter but would not come up with "ffaen" or "-faen" or whatever variation that you did not prepare for? This is a race you cannot win and the real loser is usability.

Sentence Auto-Complete with Java

Lets say I have about 1000 sentences that I want to offer as suggestions when user is typing into a field.
I was thinking about running lucene in memory search and then feeding the results into the suggestions set.
The trigger for running the searches would be space char and exit from the input field.
I intend to use this with GWT so the client with be just getting the results from server.
I don't want to do what google is doing; where they complete each word and than make suggestions on each set of keywords. I just want to check the keywords and make suggestions based on that. Sort of like when I'm typing the title for the question here on stackoverflow.
Did anyone do something like this before? Is there already library I could use?

I was working on a similar solution. This paper titled Effective Phrase Prediction was quite helpful for me . You will have to prioritize the suggestions as well

If you've only got 1000 sentences, you probably don't need a powerful indexer like lucene. I'm not sure whether you want to do "complete the sentence" suggestions or "suggest other queries that have the same keywords" suggestions. Here are solutions to both:
Assuming that you want to complete the sentence input by the user, then you could put all of your strings into a SortedSet, and use the tailSet method to get a list of strings that are "greater" than the input string (since the string comparator considers a longer string A that starts with string B to be "greater" than B). Then, iterate over the top few entries of the set returned by tailSet to create a set of strings where the first inputString.length() characters match the input string. You can stop iterating as soon as the first inputString.length() characters don't match the input string.
If you want to do keyword suggestions instead of "complete the sentence" suggestions, then the overhead depends on how long your sentences are, and how many unique words there are in the sentences. If this set is small enough, you'll be able to get away with a HashMap<String,Set<String>>, where you mapped keywords to the sentences that contained them. Then you could handle multiword queries by intersecting the sets.
In both cases, I'd probably convert all strings to lower case first (assuming that's appropriate in your application). I don't think either solution would scale to hundreds of thousands of suggestions either. Do either of those do what you want? Happy to provide code if you'd like it.

When would it be worth using RegEx in Java?

I'm writing a small app that reads some input and do something based on that input.
Currently I'm looking for a line that ends with, say, "magic", I would use String's endsWith method. It's pretty clear to whoever reads my code what's going on.
Another way to do it is create a Pattern and try to match a line that ends with "magic". This is also clear, but I personally think this is an overkill because the pattern I'm looking for is not complex at all.
When do you think it's worth using RegEx Java? If it's complexity, how would you personally define what's complex enough?
Also, are there times when using Patterns are actually faster than string manipulation?
EDIT: I'm using Java 6.

Basically: if there is a non-regex operation that does what you want in one step, always go for that.
This is not so much about performance, but about a) readability and b) compile-time-safety. Specialized non-regex versions are usually a lot easier to read than regex-versions. And a typo in one of these specialized methods will not compile, while a typo in a Regex will fail miserably at runtime.
Comparing Regex-based solutions to non-Regex-bases solutions
String s = "Magic_Carpet_Ride";
s.startsWith("Magic"); // non-regex
s.matches("Magic.*"); // regex
s.contains("Carpet"); // non-regex
s.matches(".*Carpet.*"); // regex
s.endsWith("Ride"); // non-regex
s.matches(".*Ride"); // regex
In all these cases it's a No-brainer: use the non-regex version.
But when things get a bit more complicated, it depends. I guess I'd still stick with non-regex in the following case, but many wouldn't:
// Test whether a string ends with "magic" in any case,
// followed by optional white space
s.toLowerCase().trim().endsWith("magic"); // non-regex, 3 calls
s.matches(".*(?i:magic)\\s*"); // regex, 1 call, but ugly
And in response to RegexesCanCertainlyBeEasierToReadThanMultipleFunctionCallsToDoTheSameThing:
I still think the non-regex version is more readable, but I would write it like this:
s.toLowerCase()
.trim()
.endsWith("magic");
Makes the whole difference, doesn't it?

You would use Regex when the normal manipulations on the String class are not enough to elegantly get what you need from the String.
A good indicator that this is the case is when you start splitting, then splitting those results, then splitting those results. The code is getting unwieldy. Two lines of Pattern/Regex code can clean this up, neatly wrapped in a method that is unit tested....

Anything that can be done with regex can also be hand-coded.
Use regex if:
Doing it manually is going to take more effort without much benefit.
You can easily come up with a regex for your task.
Don't use regex if:
It's very easy to do it otherwise, as in your example.
The string you're parsing does not lend itself to regex. (it is customary to link to this question)

I think you are best with using endsWith. Unless your requirements change, it's simpler and easier to understand. Might perform faster too.
If there was a bit more complexity, such as you wanted to match "magic", "majik', but not "Magic" or "Majik"; or you wanted to match "magic" followed by a space and then 1 word such as "... magic spoon" but not "...magic soup spoon", then I think RegEx would be a better way to go.

Any complex parsing where you are generating a lot of Objects would be better done with RegEx when you factor in both computing power, and brainpower it takes to generate the code for that purpose. If you have a RegEx guru handy, it's almost always worthwhile as the patterns can easily be tweaked to accommodate for business rule changes without major loop refactoring which would likely be needed if you used pure java to do some of the complex things RegEx does.

If your basic line ending is the same everytime, such as with "magic", then you are better of using endsWith.
However, if you have a line that has the same base, but can have multiple values, such as:
<string> <number> <string> <string> <number>
where the strings and numbers can be anything, you're better of using RegEx.
Your lines are always ending with a string, but you don't know what that string is.

If it's as simple as endsWith, startsWith or contains, then you should use these functions. If you are processing more "complex" strings and you want to extract information from these strings, then regexp/matchers can be used.
If you have something like "commandToRetrieve someNumericArgs someStringArgs someOptionalArgs" then regexp will ease your task a lot :)

I'd never use regexes in java if I have an easier way to do it, like in this case the endsWith method. Regexes in java are as ugly as they get, probably with the only exception of the match method on String.
Usually avoiding regexes makes your core more readable and easier for other programmers. The opposite is true, complex regexes might confuse even the most experience hackers out there.
As for performance concerns: just profile. Specially in java.

If you are familiar with how regexp works you will soon find that a lot of problems are easily solved by using regexp.
Personally I look to using java String operations if that is easy, but if you start splitting strings and doing substring on those again, I'd start thinking in regular expressions.
And again, if you use regular expressions, why stop at lines. By configuring your regexp you can easily read entire files in one regular expression (Pattern.DOTALL as parameter to the Pattern.compile and your regexp don't end in the newlines). I'd combine this with Apache Commons IOUtils.toString() methods and you got something very powerful to do quick stuff with.
I would even bring out a regular expression to parse some xml if needed. (For instance in a unit test, where I want to check that some elements are present in the xml).
For instance, from some unit test of mine:
Pattern pattern = Pattern.compile(
"<Monitor caption=\"(.+?)\".*?category=\"(.+?)\".*?>"
+ ".*?<Summary.*?>.+?</Summary>"
+ ".*?<Configuration.*?>(.+?)</Configuration>"
+ ".*?<CfgData.*?>(.+?)</CfgData>", Pattern.DOTALL);
which will match all segments in this xml and pick out some segments that I want to do some sub matching on.

I would suggest using a regular expression when you know the format of an input but you are not necessarily sure on the value (or possible value(s)) of the formatted input.
What I'm saying, if you have an input all ending with, in your case, "magic" then String.endsWith() works fine (seeing you know that your possible input value will end with "magic").
If you have a format e.g a RFC 5322 message format, one cannot clearly say that all email address can end with a .com, hence you can create a regular expression that conforms to the RFC 5322 standard for verification.
In a nutshell, if you know a format structure of your input data but don't know exactly what values (or possible values) you can receive, use regular expressions for validation.

There's a saying that goes:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. (link).
For a simple test, I'd proceed exactly like you've done. If you find that it's getting more complicated, then I'd consider Regular Expressions only if there isn't another way.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.