Pattern matching : Matching a String with a pattern - java

I was trying to match a pattern within a string. I am running out of ideas how to do this in Java with good time complexity.
No its not a simple regex matching (but loved to be proved wrong)
What I am trying is,
Pattern : "1221" (Means 1 word repeat once, 2nd word repeat twice, last word is same as first word)
Valid Input: "aabbbbbbaa" (aa occurs at the beginning and at end, while middle portion is occupied by bbb repeating twice)
I tried the following approaches but failed miserably
I tried to loop input with the pattern. But that did not solve the problem, although with more loops I can achieve it, but it increases the time complexity exponentially.
Tried recursion and again no use.
What other approaches can I try?
I think Dynamic programming might be the answer, but I am not able to determine the terminating condition.
Any help would be appreciated.

You can use simple regex, e.g.:
^(.+)(.+)\2\1$
It does exactly what u want:

Related

Regex to match a fixed sub string in a String

I am trying to write a regular expression to verify the presence of a specific number in a fixed position in a String.
String: 109300300330066611111111100000000017000656052086116020170111Name 1
Number to find: 111111111 (Staring from position 17)
I have written the following regular expression:
^.{16}(?<Ones>111111111)(.*)
My understanding is:
Let first 16 characters be whatever they are
Use the Named Capturing Group to grab the specific word
Let the rest of the characters be whatever they are
I am new to regex, is there any issue with the above approach?
Can it be done in other/better way?
I am using Java 8.
Without more details of why you're doing what you're doing, there's just one possible improvement I can see. You repeated any character 16 times at the beginning of the string rather than writing out 16 .s, which is nice and readable, but then, it would be nice to do the same for the repeated 1s:
^.{16}(?<Ones>1{9})(.*)
Otherwise, the string of 1s is hard to understand without the coder manually counting how many there are in the regex.
If you want to hard-code the ones and you know the starting position and you just wnat to know if it is there, using a regex seems unnecessary. you can use this:
String s = "109300300330066611111111100000000017000656052086116020170111Name 1";
if (s.indexOf("111111111").equals(16) doSomething();
Another possible solution without regex:
if(s.substring(16,25).equals("111111111") doSomething();
Otherwise your regex looks good.

Regex Pattern Catastrophic backtracking

I have the regex shown below used in one of my old Java systems which is causing backtracking issues lately.
Quite often the backtracking threads cause the CPU of the machine to hit the upper limit and it does not return back until the application is restarted.
Could any one suggest a better way to rewrite this pattern or a tool which would help me to do so?
Pattern:
^\[(([\p{N}]*\]\,\[[\p{N}]*)*|[\p{N}]*)\]$
Values working:
[1234567],[89023432],[124534543],[4564362],[1234543],[12234567],[124567],[1234567],[1234567]
Catastrophic backtracking values — if anything is wrong in the values (an extra brace added at the end):
[1234567],[89023432],[124534543],[4564362],[1234543],[12234567],[124567],[1234567],[1234567]]
Never use * when + is what you mean. The first thing I noticed about your regex is that almost everything is optional. Only the opening and closing square brackets are required, and I'm pretty sure you don't want to treat [] as a valid input.
One of the biggest causes of runaway backtracking is to have two or more alternatives that can match the same things. That's what you've got with the |[\p{N}]* part. The regex engine has to try every conceivable path through the string before it gives up, so all those \p{N}* constructs get into an endless tug-of-war over every group of digits.
But there's no point trying to fix those problems, because the overall structure is wrong. I think this is what you're looking for:
^\[\p{N}+\](?:,\[\p{N}+\])*$
After it consumes the first token ([1234567]), if the next thing in the string is not a comma or the end of the string, it fails immediately. If it does see a comma, it must go on to match another complete token ([89023432]), or it fails immediately.
That's probably the most important thing to remember when you're creating a regex: if it's going to fail, you want it to fail as quickly as possible. You can use features like atomic groups and possessive quantifiers toward that end, but if you get the structure of the regex right, you rarely need them. Backtracking is not inevitable.

Regex unordered matches

This feels like it should be an extremely simple thing to do with regex but I can't quite seem to figure it out.
I would like to write a regex which checks to see if a list of certain words appear in a document, in any order, along with any of a set of other words in any order.
In boolean logic the check would be:
If allOfTheseWords are in this text and atLeastOneOfTheseWords are in this text, return true.
Example
I'm searching for (john and barbara) with (happy or sad).
Order does not matter.
"Happy birthday john from barbara" => VALID
"Happy birthday john" => INVALID
I simply cannot figure out how to get the and part to match in an orderless way, any help would be appreciated!
You don't really want to use a regex for this unless the text is very small, which from your description I doubt.
A simple solution would be to dump all the words into a HashSet, at which point checking to see if a word is present becomes a very quick and easy operation.
If you want to do it with regex, I'd try positive lookahead:
// searching for (john and barbara) with (happy or sad)
"^(?=.*\bjohn\b)(?=.*\bbarbara\b).*\b(happy|sad)\b"
The performance should be comparable to doing a full text search for each of the words in the allOfTheseWords group separately.
If you really need a single regex, then it would be very large and very slow due to backtracking. For your particular example of (John AND Barbara) AND (Happy or Sad), it would start like this:
\bJohn\b.*?\bBarbara\n.*?\bHappy\b|\bJohn\b.*?\bBarbara\n.*?\bSad\b|......
You'd ultimately need to put all combinations in the regex. Something like:
JBH, JBS, JHB, JSB, HJB, SJB, BJH, BJS, BHJ, BSJ, HBJ, SBJ
Again backtracking would be prohibitive, as would the explosion in the number of cases. Stay away from regexes here.
With your example, this is a regex that may help you :
Regex
(?:happy|sad).*?john.*?barbara|
(?:happy|sad).*?barbara.*?john|
barbara.*?john.*?(?:happy|sad)|
john.*?barbara.*?(?:happy|sad)|
barbara.*?(?:happy|sad).*?john|
john.*?(?:happy|sad).*?barbara
Output
happy birthday john from barbara => Matched
Happy birthday john => Not matched
As mentionned in other responses, a regex may not be well suited here.
It might be possible to do it with regexp, but it would be so complicated that it's better to use some different way (for example using a HashSet, as mentioned in the other answers).
One way to do it with regex would be to calculate all the permutations of the words which you are looking for, and then write a regex which mentions all those permutations. With 2 words there would be 2 permutations, as in (.*foo.*bar.*)|(.*bar.*foo.*) (plus word boundaries), with 3 words there would be 6 permutations, and quite soon the number of permutations would be larger than your input file.
If your data is relatively constant, and you are planning on searching a lot, using Apache Lucene will ensure better peformance.
Using information retrieval techniques, you will first index all your documents/sentences, and then search for your words, in your example you would want to search for "+(+john +barbara) +(sad happy)" [or "(john AND barbarar) AND (sad OR HAPPY)" ]
this approach will consume some time when indexing, however, searching will be much faster then any regex/hashset approach (since you don't need to iterate over all documents...)

Android: Matcher.find() never returns

First of all, here is a chunk of affected code:
// (somewhere above, data is initialized as a String with a value)
Pattern detailsPattern = Pattern.compile("**this is a valid regex, omitted due to length**", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
Matcher detailsMatcher = detailsPattern.matcher(data);
Log.i("Scraper", "Initialized pattern and matcher, data length "+data.length());
boolean found = detailsMatcher.find();
Log.i("Scraper", "Found? "+((found)?"yep":"nope"));
I omitted the regex inside Pattern.compile because it's very long, but I know it works with the given data set; or if it doesn't, it shoudn't break anything anyway.
The trouble is, I do get the feedback I/Scraper(23773): Initialized pattern and matcher, data length 18861 but I never see the "Found?" line, it is just stuck on the find() call.
Is this a known Android bug? I've tried it over and over and just can't get it to work. Somehow, I think something over the past few days broke this because my app was working fine before, and I have in the past couple days received several comments of the app not working so it is clearly affecting other users as well.
How can I further debug this?
Some regexes can take a very, very long time to evaluate. In particular, regexes that have lots of quantifiers can cause the regex engine to do a huge amount of backtracking to explore all of the possible ways that the input string might match. And if it is going to fail, it has to explore all of those possibilities.
(Here is an example:
regex = "a*a*a*a*a*a*b"; // 6 quantifiers
input = "aaaaaaaaaaaaaaaaaaaa"; // 20 characters
A typical regex engine will do in the region of 20^6 character comparisons before deciding that the input string does not match.)
If you showed us the regex and the string you are trying to match, we could give a better diagnosis, and possibly offer some alternatives. But if you are trying to extract information from HTML, then the best solution is to not use regexes at all. There are HTML parsers that are specifically designed to deal with real-world HTML.
How long is the string you are trying to parse ?
How long and how complicated is the regex you are trying to match ?
Have you tried to break down your regex down to simpler bits ? Adding up the bits one after another will let you see when it breaks and maybe why.
make some RE like [a-zA-Z]* pass it as argument to compile(),here this example allows only characters small & cap.
Read my blogpost on android validation for more info.
I had the same issue and I solved it replacing all the wildchart . with [\s\S]. I really don't know why it worked for me but it did. I come from Javascript world and I know in there that expression is faster for being evaluated.

Java Regex Engine Crashing

Regex Pattern - ([^=](\\s*[\\w-.]*)*$)
Test String - paginationInput.entriesPerPage=5
Java Regex Engine Crashing / Taking Ages (> 2mins) finding a match. This is not the case for the following test inputs:
paginationInput=5
paginationInput.entries=5
My requirement is to get hold of the String on the right-hand side of = and replace it with something. The above pattern is doing it fine except for the input mentioned above.
I want to understand why the error and how can I optimize the Regex for my requirement so as to avoid other peculiar cases.
You can use a look behind to make sure your string starts at the character after the =:
(?<=\\=)([\\s\\w\\-.]*)$
As for why it is crashing, it's the second * around the group. I'm not sure why you need that, since that sounds like you are asking for :
A single character, anything but equals
Then 0 or more repeats of the following group:
Any amount of white space
Then any amount of word characters, dash, or dot
End of string
Anyway, take out that *, and it doesn't spin forever anymore, but I'd still go for the more specific regex using the look behind.
Also, I don't know how you are using this, but why did you have the $ in there? Then you can only match the last one in the string (if you have more than one). It seems like you'd be better off with a look-ahead to the new line or the end: (?=\\n|$)
[Edit]: Update per comment below.
Try this:
=\\s*(.*)$

Categories