Java reduce dynamic regex pattern to eliminate duplicated qualifiers - java

Yes I know another regex question, MEH! Well this is kind of regex but more pattern recognition which drives regex generation...
Anyway, I'm working on a brain teaser and need to convert a binary representation of a string of characters to some other string representation. i.e. 0 = A|AA|AAA|AAAA+ and 1 = A|AA|AAA|AAAA+|B|BB|BBB|BBBB+ does 1010101010 == AAAAABBBBAAAA? Given a rather large input file.
My solution was to create regex on the fly using the pattern A+ for 0 and (A+|B+) for 1.
The issue is that as I iterate the input which can be pretty large (binary representation is up to 150 chars and the AB notation can be up to 1000 chars), I end up with a large regex pattern that is not performing quick enough for my needs (needs to be able to perform a match on a character string up to 1000 characters in less than 10 seconds)
To speed up the solution I wanted to reduce the size of the generated regex so for the input of a binary representation of 1010101010 I want the regex to be (A+(A+|B+))+ instead of my generated A+(A+|B+)A+(A+|B+)A+(A+|B+)A+(A+|B+)A+(A+|B+)
My thought was that I could detect the repeating pattern and reduce it to just the first sequence that is repeated and then generate the regex string off of that.
Any thoughts?

Instead of trying to make one big pattern that matches the whole file you could go for partial matching using a loop and matcher.find() to iterate through the individual matches of A+B+.
Pattern pattern = Pattern.compile("A+B+");
Matcher matcher = pattern.matcher(input);
while (matcher.find())
{
String part = matcher.group(); // this is the matched part
}

Related

How to find a last occurrence of set of characters in string using regex in java?

I need find the last index of set of characters in a string. Consider the set of characters be x,y,z and string as Vereador Luiz Pauly Home then I need index as 18.
So for finding the index I have created a pattern with DOTALL flag and greedy quantifier as (?s).*(x|y|z). When the pattern is applied to that string(multiline), I can find out index from the start group. The code:
int findIndex(String str){
int index = -1;
Pattern p = Pattern.compile("(?s).*(x|y|z)");
Matcher m = regex.matcher(str);
if(m.find()){
index = m.start(1);
}
return index;
}
As expected it is returning the values correctly, if there is match.
But if there is no match, then it takes too long time (17 minutes for 600000 characters) as it is a Greedy match.
I tried with other quantifiers, but can't get the desired output. So can anyone refer any better regex?
PS: I can also think about traversing the content from last and finding the index.But I hope there is some better way in regex which can do the job quickly.
There are few ways to solve the problem and the best way will depend on the size of the input and the complexity of the pattern:
Reverse the input string and possibly the pattern, this might work for non-complex patterns. Unfortunately java.util.regex doesn't allow to to match the pattern from right to left.
Instead of using a greedy quantifier simply match the pattern and loop Matcher.find() until last occurrence is found.
Use a different regex engine with better performance e.g. RE2/J: linear time regular expression matching in Java.
If option 2 is not efficient enough for your case I'd suggest to try RE2/J:
Java's standard regular expression package, java.util.regex, and many other widely used regular expression packages such as PCRE, Perl and Python use a backtracking implementation strategy: when a pattern presents two alternatives such as a|b, the engine will try to match subpattern a first, and if that yields no match, it will reset the input stream and try to match b instead.
If such choices are deeply nested, this strategy requires an exponential number of passes over the input data before it can detect whether the input matches. If the input is large, it is easy to construct a pattern whose running time would exceed the lifetime of the universe. This creates a security risk when accepting regular expression patterns from untrusted sources, such as users of a web application.
In contrast, the RE2 algorithm explores all matches simultaneously in a single pass over the input data by using a nondeterministic finite automaton.
Performance issues with the (?s).*(x|y|z) regex come from the fact the .* pattern is the first subpattern that grabs the whole string first, and then backtracking occurs to find x, y or z. If there is no match, or the match is at the start of the string, and the strings is very large, this might take a really long time.
The ([xyz])(?=[^xyz]*$) pattern seems a little bit better: it captures x, y or z and asserts there is no other x, y or z up to the end of the string, but it also is somewhat resource-consuming due to each lookahead check after a match is found.
The fastest regex to get your job done is
^(?:[^xyz]*+([xyz]))+
It matches
^ - start of string
(?:[^xyz]*+([xyz]))+ - 1 or more repetitions of
[^xyz]*+ - any 0 or more chars other than x, y and z matched possessively (no backtracking into the pattern is allowed)
([xyz]) - Group 1: x, y or z.
The Group 1 value and data will belong to the last iteration of the repeated group (as all the preceding data is re-written with each subsequent iteration).
StringBuilder both has a reverse and is a CharSequence, so searching is possible.
Pattern p = Pattern.compile("[xyz]");
StringBuilder sb = new StringBuilder(str).reverse();
Matcher m = p.matcher(sb);
return m.find() ? sb.length() - m.end() : -1;
Unfortunately reversal is costly.
A solution without regex is probably faster.
(BTW surrogate pairs are handled correctly by the reversal.)

I need a regex command that isolates all numbers not adjacent to a caret (^)

I am having a lot of trouble figuring Regex command out, and can't seem to find the right combination to fit what I want
Example:
Input: 1x^3+5x^2+6x+2
Output: 1 5 6 2
I need to isolate those values, as they are the coefficients of my polynomial. The input is a String so I figured the best way to do this was by using the .split() function with a custom Regex command.
You can use this regular expression:
(?<!\^)\d+(?!\^)
This uses a negative lookahead and lookbehind to remove characters next to ^.
Since you want to extract coefficients, it finds one or more digits. Modified the middle part if needed.
You can use it this way in Java, for example:
Matcher m = Pattern.compile("(?<!\\^)\\d+(?!\\^)").matcher("1x^3+5x^2+6x+2");
while (m.find()) {
System.out.println("Coefficient: " + m.group());
}
EDIT:
If you also want to detect negative coefficients, you can check for an optional - before digits:
(?<!\^)-?\d+(?!\^)
Keep in mind that as you try to capture more complicated patterns, regular expressions become less suitable as you may get lost in a number of cases to cover.

How can I obtain what .* matched in a regular expression?

I have thousands of different regular expressions and they look like this:
^Mozilla.*Android.*AppleWebKit.*Chrome.*OPR\/([0-9\.]+)
How do I obtain those substrings that match the .* in the regex? For example, for the above regex, I would get four substrings for four different .*s. In addition, I don't know in advance how many .*s there are, even though I can possibly find out by doing some simple operation on the given regex string, but that would impose more complexity on the program. I process a fairly big amount of data, so really focus on the efficiency here.
Replace the .*s with (.*)s and use matcher.group(n). For instance:
Pattern p = Pattern.compile("1(.*)2(.*)3");
Matcher m = p.matcher("1abc2xyz3");
m.find();
System.out.println(m.group(2));
xyz
Notice how the match of the second (.*) was returned (since m.group(2) was used).
Also, since you mentioned you won't know how many .*s your regex will contain, there is a matcher.groupCount() method you can use, if the only capturing groups in your regex will indeed be (.*)s.
For your own enlightenment, try reading about capturing groups.
How do I get those substrings that match the .* in the regex? For example, for the above regex, I would get four substrings for four different DOT STAR.
Use groups: (.*)
I addition, I don't know in advance how many DOT STARs there are
Build your regex string, then replace .* with (.*):
String myRegex = "your regex here";
myRegex = myRegex.replace(".*","(.*)");
even though I can possible find out about that by doing some simple operation on the given regex string, but that would impose more complexity on the program
If you don't know how the regex is made and the regex is not built by your application, the only way is to process it after you have it. If you are building the regex, then append (.*) to the regex string instead of appending .*

Extract all occurrences of pattern K and check if string matches "K*" in 1 pass

For a given input string and a given pattern K, I want to extract every occurrence of K (or some part of it (using groups)) from the string and check that the entire string matches K* (as in it consists of 0 or more K's with no other characters).
But I would like to do this in a single pass using regular expressions. More specifically, I'm currently finding the pattern using Matcher.find, but this is not strictly required.
How would I do this?
I already found a solution (and posted an answer), but would like to know if there is specific regex or Matcher functionality that addresses / can address this issue, or simply if there are better / different ways of doing it. But, even if not, I still think it's an interesting question.
Example:
Pattern: <[0-9]> (a single digit in <>)
Valid input: <1><2><3>
Invalid inputs:
<1><2>a<3>
<1><2>3
Oh look, a flying monkey!
<1><2><3
Code to do it in 2 passes with matches:
boolean products(String products)
{
String regex = "(<[0-9]>)";
Pattern pAll = Pattern.compile(regex + "*");
if (!pAll.matcher(products).matches())
return false;
Pattern p = Pattern.compile(regex);
Matcher matcher = p.matcher(products);
while (matcher.find())
System.out.println(matcher.group());
return true;
}
1. Defining the problem
Since it is not clear what to output when the whole string does not match pattern K*, I will redefine the problem to make it clear what to output in such case.
Given any pattern K:
Check that the string has the pattern K*.
If the string has pattern K*, then split the string into non-overlapping tokens that matches K.
If the string only has prefix that matches pattern K*, then pick the prefix that is chosen by K*+1, and split the prefix into tokens that matches K.
1 I don't know if there is anyway to get the longest prefix that matches K. Of course, you can always remove the last character one by one and test against K* until it matches, but it is obviously inefficient.
Unless specify otherwise, whatever I write below will follow my problem description above. Note that the 3rd bullet point of the problem is to resolve the ambiguity on which prefix string to take.
2. Repeated capturing group in .NET
The problem above can be solved if we have the solution to the problem:
Given a pattern (K)*, which is a repeated capturing group, get the captured text for all the repetitions, instead of only the last repetition.
In the case where the string has pattern K*, by matching against ^(K)*$, we can get all tokens that match pattern K.
In the case where the string only has prefix that matches K*, by matching against ^(K)*, we can get all tokens that match pattern K.
This is the case in .NET regex, since it keeps all the captured text for a repeated capturing group.
However, since we are using Java, we don't have access to such feature.
3. Solution in Java
Checking that the string has the pattern K* can always be done with Matcher.matches()/String.matches(), since the engine will do full-blown backtracking on the input string to somehow "unify" K* with the input string. The hard thing is to split the input string into tokens that matches pattern K.
If K* is equivalent to K*+
If the pattern K has the property:
For all strings2, K* is equivalent to K*+, i.e. how the input string is split up into tokens that match pattern K is the same.
2 You can define this condition for only the input strings you are operating on, but ensuring this pre-condition is not easy. When you define it for all strings, you only need to analyze your regex to check whether the condition holds or not.
Then a one-pass solution that solves the problem can be constructed. You can repeatedly use Matcher.find() on the pattern \GK, and checks that the last match found is right at the end of the string. This is similar to your current solution, except that you do the boundary check with code.
The + after the quantifier * in K*+ makes the quantifier possessive. Possessive quantifier will prevent the engine from backtracking, which means each repetition is always the first possible match for the pattern K. We need this property so that the solution \GK has equivalent meaning, since it will also return the first possible match for the pattern K.
If K* is NOT equivalent to K*+
Without the property above, we need 2 passes to solve the problem. First pass to call Matcher.matches()/String.matches() on the pattern K*. On second pass:
If the string does not match pattern K*, we will repeatedly use Matcher.find() on the pattern \GK until no more match can be found. This can be done due to how we define which prefix string to take when the input string does not match pattern K*.
If the string matches pattern K*, repeatedly use Matcher.find() on the pattern \GK(?=K*$) is one solution. This will result in redundant work matching the rest of the input string, though.
Note that this solution is universally applicable for any K. In other words, it also applies for the case where K* is equivalent to K*+ (but we will use the better one-pass solution for that case instead).
Here is an additional answer to the already accepted one. Below is an example code snippet that only goes through the pattern once with m.find(), which is similar to your one pass solution, but will not parse non-matching lines.
import java.util.regex.*;
class test{
public static void main(String args[]){
String t = "<1><2><3>";
Pattern pat = Pattern.compile("(<\\d>)(?=(<\\d>)*$)(?<=^(<\\d>)*)");
Matcher m = pat.matcher(t);
while (m.find()) {
System.out.println("Matches!");
System.out.println(m.group());
}
}
}
The regex explained:
<\\d> --This is your k pattern as defined above
?= -- positive lookahead (check what is ahead of K)
<\\d>* -- Match k 0 or more times
$ -- End of line
?<= -- positive lookbehind (check what is behind K)
^ -- beginning of line
<\\d>* -- followed by 0 or more Ks
Regular expressions are beautiful things.
Edit: As pointed out to me by #nhahtdh, this is just an implemented version of the answer. In fact the implementation above can be improved with the knowledge in the answer.(<\\d>)(?=(<\\d>)*$)(?<=^(<\\d>)*) can be changed to \\G<\\d>(?=(<\\d>)*$).
Below is a one-pass solution using Matcher.start and Matcher.end.
boolean products(String products)
{
String regex = "<[0-9]>";
Pattern p = Pattern.compile(regex);
Matcher matcher = p.matcher(products);
int lastEnd = 0;
while (matcher.find())
{
if (lastEnd != matcher.start())
return false;
System.out.println(matcher.group());
lastEnd = matcher.end();
}
if (lastEnd != products.length())
return false;
return true;
}
The only disadvantage is that it will print out (or process) all values prior to finding invalid data.
For example, products("<1><2>a<3>"); will print out:
<1>
<2>
prior to throwing the exception (because up until there the string is valid).
Either having this happen or having to store all of them temporarily seems to be unavoidable.
String t = "<1><2><3>";
Pattern pat = Pattern.compile("(<\\d>)*");
Matcher m = pat.matcher(t);
if (m.matches()) {
//String[] tt = t.split("(?<=>)"); // Look behind on '>'
String[] tt = t.split("(?<=(<\\d>))"); // Look behind on K
}

Using Java regex to validate date from a long string

I'm trying to write a Java routine that can parse out dates from a long string, i.e. given the string:
"Please have the report to me by 6/15, because the shipment comes in on 6/18"
The regex would find both 6/15 and 6/18. I've looked on Stack Overflow and elsewhere, and most examples of a date regex simply verify whether a given string is a date or not, rather than finding dates within a larger amount of text. Ideally, I'd want a regex that could identify all of the main ways people numerically write dates i.e 6/15, 6/15/12, 06/15/12, 15/6/12, 15/06/12, although perhaps it would be best to separate these into different regexes for the purpose of cla. I'm new to regexes (I just started learning about them two days ago) and regexes are still a bit cryptic to me, so I'd appreciate a detailed explanation of any regex suggestions.
If you're not bothering with range checking, this suffices:
(\d{1,2})/(\d{1,2})(?:/(\d{4}|\d{2}))?
To check that you can't do 2/29/2001 but can do 2/29/2000, you really want to do it after the regexp has done its job, or you're going to end up in an asylum.
EDIT: Better yet, for isolating the century, and protecting against things like 2/193 (prompted by Alex's question, even though it's a separate issue):
\b(\d{1,2})/(\d{1,2})(?:/(\d{2})?(\d{2}))?\b
You'd get 4 captures in each match: [month, day, century, year], where century and year could be empty.
\d{1,2}/\d{1,2}(?:/(?:\d{2}){1,2})?
Here's the breakdown:
\d{1,2} matches 1 or 1 digits
/ followed by a /
\d{1,2} followed 1 or 2 more digits
(?:/(?:\d{2}){1,2})? followed by an optional slash and 2 or 4 digit year
From the matches, you'll probably want to parse them with Java DateParse instead of trying to put all the validation rules in the regex.
You may want to protect against fractions as well 1/4th
This can be done by appending a negative lookahead to your regex: (?!th|rd|nd) which causes the regex to not match if followed by th, rd, or nd.
What exactly is your question? You should read some guide about regex first.
You need a method that returns every match in the String like this:
p is the regex, text is your text.
private LinkedList<String> matches(String p, String text) {
LinkedList<String> results = new LinkedList<String>();
Pattern pattern = Pattern.compile(p);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
results.add(matcher.group());
}
return results;
}
You can separate each date-pattern with |
If you put a part of your regex into braces (...), this part is treated as a "group".
So you can extract single numbers out of the matching string (if you want to).

Categories