Java Regex capture multiple groups with groups containing others

Java Regex capture multiple groups with groups containing others - java

I'm trying to build a regular expression which captures multiple groups, with some of them being contained in others. For instance, let's say I want to capture every 4-grams that follows a 'to' prefix:
input = "I want to run to get back on shape"
expectedOutput = ["run to get back", "get back on shape"]
In that case I would use this regex:
"to((?:[ ][a-zA-Z]+){4})"
But it only captures the first item in expectedOutput (with a space prefix but that's not the point).
This is quite easy to solve without regex, but I'd like to know if it is possible only using regex.

You can make use of a regex overlapping mstrings:
String s = "I want to run to get back on shape";
Pattern pattern = Pattern.compile("(?=\\bto\\b((?:\\s*[\\p{L}\\p{M}]+){4}))");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(1).trim());
}
See IDEONE demo
The regex (?=\bto\b((?:\s*[\p{L}\p{M}]+){4})) checks each location in the string (since it is a zero width assertion) and looks for:
\bto\b - a whole word to
((?:\s*[\p{L}\p{M}]+){4}) - Group 1 capturing 4 occurrences of
\s* zero or more whitespace(s)
[\p{L}\p{M}]+ - one or more letters or diacritics
If you want to allow capturing fewer than 4 ngrams, use a {0,4} (or {1,4} to require at least one) greedy limiting quantifier instead of {4}.

It is the order of groups in Regex
1 ((A)(B(C))) // first group (surround two other inside this)
2 (A) // second group ()
3 (B(C)) // third group (surrounded one other group)
4 (C) // forth group ()

Related

Capture Regex repeating string between slashes in URL

I have following partial URL that can be
/it/xyz/test/param+1/param-2/1234/gfd4
Basically two letter at the beginning a slash another unknown string and then a series of repeatable strings between slashes
I need to capture every string (I know a split with / delimiter would be fine but I am interested to know how can I extract with regex). I came out first with this:
^\/([a-zA-Z]{2})\/([a-zA-Z]{1,10})(\/[a-zA-Z1-9\+\-]+)
but it only capture
group1: it
group2: xyz
group3: /test
and of course it ignores the rest of the string.
If I add a * sign at the end it only captures the last sentence:
^\/([a-zA-Z]{2})\/([a-zA-Z]{1,10})(\/[a-zA-Z1-9\+\-]+)*
group1: it
group2: xyz
group3: /gfd4
So, I am obviously missing some fundamentals, so in addition to the proper regex I would like to have an explanation.
I tagged as Java because the engine which parses the regex is the JDK 7. It is my knowledge that each engine may have differences.

As mentioned here, this is expected:
With one group in the pattern, you can only get one exact result in that group.
If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.
I would rather capture the rest of the string in group3 ((\/.*$), as in this demo), then use a split around '/'. Or apply yhat pattern on the rest of the string:
Pattern p = Pattern.compile("(\/[a-zA-Z1-9\+\-]+)");
Matcher m = p.matcher(str);
while (m.find()) {
String place = m.group(1);
...
}

Java regex backreference for two digits

I am working with a regex and I want to use it on the replaceAll method of the String class in Java.
My regex works fine and groupCount() returns 11. So, when I try to replace my text using backreference pointing to the eleventh group, I am getting the first group with a "1" attached to it, instead of the group eleven.
String regex = "(>[^<]*?)((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})|(\d{6,16})([;,\.]{1,3}\d{3,}#?)?)([^<]*<)";
String text = "<span style=\"font-size:11.0pt\">675-441-3144;;;78888464#<o:p></o:p></span>":
String replacement = text.replaceAll(regex, $1$2$11");
I am expecting to get the following result:
<span style=\"font-size:11.0pt\">675-441-3144;;;78888464#<o:p></o:p></span>
But the $11 backreference is not returning the 11th group, it is returning the first group with a 1 attached to it, and instead I am getting the following result:
<span style="font-size:11.0pt">675-441-3144>1o:p></o:p></span>
Can someone please tell me how to access the eleventh group of my pattern?
Thanks.

Short Answer
The way you access the eleventh group of a match in the replacement is with $11.
Explanation:
As the corresponding Javadoc* states:
The replacement string may contain references to subsequences captured
during the previous match: Each occurrence of ${name} or $g will be
replaced by the result of evaluating the corresponding group(name) or
group(g) respectively. For $g, the first number after the $ is always
treated as part of the group reference. Subsequent numbers are
incorporated into g if they would form a legal group reference.
So generally speaking, as long as have at least eleven groups, then "$11" will evaluate to group(11). However, if you do not have at least eleven groups, then "$11" will evaluate to group(1) + "1".
* This quote is from Matcher#appendReplacement(StringBuffer,String), which is where the chain of relevant citations from String#replaceAll(String,String) leads to.
Actual Answer
Your regex does not do what you think it does.
Part 1
The Problem
Let's divide your regex into its three top-level groups. These are groups 1, 2, and 11, respectively.
Group 1:
(>[^<]*?)
Group 2:
((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})|(\d{6,16})([;,\.]{1,3}\d{3,}#?)?)
Group 11:
([^<]*<)
Group 2 is the main body of your regex, and it consists of a top-level alternation over two options. These two options consist of groups 3-8 and 9-10, respectively.
First option:
((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})
Second option:
(\d{6,16})([;,\.]{1,3}\d{3,}#?)?)
Now, given the text string, here is what is going on:
Group 1 executes. It matches the first ">".
Group 2 executes. It evaluates the options of its alternation in order.
The first option of group 2's alternation executes. It matches "675-441-3144".
Group 2's alternation successfully short-circuits upon the match of one of its options.
Group 2 as a whole is now equal to the option that matched, which is "675-441-3144".
The cursor is now positioned immediately after "675-441-3144", which is immediately before ";;;78888464#".
Group 11 executes. It matches everything up through the next "<", which is all of ";;;78888464#<".
Thus, some of the content that you want to be in group 2 is actually in group 11 instead.
The Solution
Do both of the following two things:
Convert the contents of group 2 from
option1|option2
to
option1(option2)?|option2
Change $11 in your replacement pattern to $12.
This will greedy match one or both options, rather than only one option. The modification to the replacement pattern is because we have added a group.
Part 2
The Problem
Now that we have modified the regex, our original "option 2" no longer makes sense. Given our new pattern template option1(option2)?|option2, it will be impossible for group 2 to match "675-441-3144;;;78888464#". This is because our original "option 1" will match all of "675-441-3144" and then stop. Our original "option 2" will then attempt to match ";;;78888464#", but will be unable to because it begins with a mandatory capture group of 6-10 digits: (\d{6,16}), but ";;;78888464#" begins with a semicolon.
The Solution
Convert the contents of our original "option 2" from
(\d{6,16})([;,\.]{1,3}\d{3,}#?)?
to
([;,\.]{1,3}\d{3,}#?)?
Part 3
The Problem
We have one final problem to solve. Now that our original "option 2" consists only of a single group with the ? quantifier, it is possible for it to successfully match a zero-length substring. So our pattern template option1(newoption2)?|newoption2 could result in a zero-length match, which does not fulfill the intended purpose of matching phone numbers.
The Solution
Do both of the following:
Convert the contents of our new "option 2" from
([;,.]{1,3}\d{3,}#?)?
to
[;,.]{1,3}\d{3,}#?
Change $12 in our replacement string to $10, since we have now removed one group in two locations.
The Final Solution
Putting everything together, our final solution is as follows.
Search regex:
(>[^<]*?)((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})([;,\.]{1,3}\d{3,}#?)?|[;,\.]{1,3}\d{3,}#?)([^<]*<)
Replacement regex:
$1$2$10
Java:
final String searchRegex = "(>[^<]*?)((\\+?\\d{1,4}[ \\t\\f\\-\\.](\\d[ \\t\\f\\-\\.])?)?(\\(\\d{1,4}([\\s-]\\d{1,4})?\\)[\\.\\- \\t\\f])?((\\d{2,6}[\\.\\- \\t\\f])+\\d{2,6})([;,\\.]{1,3}\\d{3,}#?)?|[;,\\.]{1,3}\\d{3,}#?)([^<]*<)";
final String replacementRegex = "$1$2$10";
String text = "<span style=\"font-size:11.0pt\">675-441-3144;;;78888464#<o:p></o:p></span>";
String replacement = text.replaceAll(searchRegex, replacementRegex);
Proof of correctness

Well, after trying to do it with replaceall without success, I had to implement the replacement method by myself:
public static String parsePhoneNumbers(String html){
StringBuilder regex = new StringBuilder(120);
regex.append("(>[^<]*?)(")
.append("((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?")
.append("(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?")
.append("((\d{2,6}[\.\- \t\f])+\d{2,6})|(\d{6,16})")
.append("([;,\.]{1,3}\d{3,}#?)?)")
.append(")+([^<]*<)");
StringBuilder mutableHtml = new StringBuilder(html.length());
Pattern pattern = Pattern.compile(regex.toString());
Matcher matcher = pattern.matcher(html);
int start = 0;
while(matcher.find()){
mutableHtml.append(html.substring(start, matcher.start()));
mutableHtml.append(matcher.group(1)).append("<a href=\"tel:")
.append(matcher.group(2)).append("\">").append(matcher.group(2))
.append("</a>").append(matcher.group(matcher.groupCount()));
start = matcher.end();
}
mutableHtml.append(html.substring(start));
return mutableHtml.toString();
}

How do I capture the text that is before and after a multiple regex matches in java?

Given a test string of:
I have a 1234 and a 2345 and maybe a 3456 id.
I would like to match all the IDs (the four digit numbers) AND at the same time get 12 characters of their surrounding text (before and after) (if any!)
So the matches should be:
BEFORE MATCH AFTER
Match #1: I have a- 1234 -and a 2345-
Match #2: -1234 and a- 2345 -and maybe a
Match #3: and maybe a- 3456 -id.
This (-) is a space character
Note:
The BEFORE match of Match #1 is not 12 characters long (not many characters at the beginning of the string). Same with the AFTER match of Match #3 (not many characters after the last match)
Can I achieve these matches with a single regex in java?
My best attempt so far is to use a positive look behind and an atomic group (to get the surrounding text) but it fails in the beginning and the end of the string when there are not enough characters (like my note above)
(?<=(.{12}))(\d{4})(?>(.{12}))
This matches only 2345. If I use a small enough value for the quantifiers (2 instead of 12, for example) then I correctly match all IDs.
Here is a link to my regex playground where I was trying my regex's:
http://regex101.com/r/cZ6wG4

When you look at the MatchResult (http://docs.oracle.com/javase/7/docs/api/java/util/regex/MatchResult.html) interface implemented by the Matcher class (http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html) you will find the functions start() and end() which give you the index of the first / last character of the match within the input string. Once you have the indicies, you can use some simple math and the substring function to extract the parts you want.
I hope this helps you, because I won't write the entire code for you.
There might be a possibility to do what you want purely with regex. But I think using the indicies and substring is easier (and probably more reliable)

You can do it in a single regex:
Pattern regex = Pattern.compile("(?<=^.{0,10000}?(.{0,12}))(\\d+)(?=(.{0,12}))");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
before = regexMatcher.group(1);
match = regexMatcher.group(2);
after = regexMatcher.group(3);
}
Explanation:
(?<= # Assert that the following can be matched before current position
^.{0,10000}? # Match as few characters as possible from the start of the string
(.{0,12}) # Match and capture up to 12 chars in group 1
) # End of lookbehind
(\d+) # Match and capture in group 2: Any number
(?= # Assert that the following can be matched here:
(.*) # Match and capture up to 12 chars in group 3
) # End of lookahead

You don't need a lookbehind or an atomic group for this, but you do need a lookahead:
(.{0,12}?)\b(\d+)\b(?=(.{0,12}))
I'm assuming your ID's are not enclosed in longer words (thus the \b). I used a reluctant quantifier in the leading portion ({0,12}?) to prevent it consuming more than one ID when they're spaced close to each other, and in:
I have a 1234, 2345 and 1456 id.

How to determine substring OR match?

I have a regex that has an | (or) in it and I would like to determine what part of the or matched in the regex:
Possible Inputs:
-- Input 1 --
Stuff here to keep.
-- First --
all of this below
gets
deleted
-- Input 2 --
Stuff here to keep.
-- Second --
all of this below
gets
deleted
Regex to match part of an incoming input source and determine what part of the | (or) was matched? "-- First --" or "-- Second --"
Pattern PATTERN = Pattern.compile("^(.*?)-+ *(?:First|Second) *-+", Pattern.DOTALL);
Matcher m = PATTERN.matcher(text);
if (m.find()) {
// How can I tell if the regex matched "First" or "Second"?
}
How can I tell which input was matched (First or Second)?

The regular expression does not contain that information. However, you could use some additional groups to figure it out.
Example pattern: (?:(First)|(Second))
On the string First the second capture group will be empty and with Second the first one will be empty. A simple inspection of the groups returned to Java will tell you which part of the regex matched.
EDIT: I assumed that First and Second were used as placeholders for the sake of simplicity and actually represent more complex expressions. If you are really looking to find which of two strings was matched, then having a single capture group (like this: (First|Second)) and comparing its content with First will do the job just fine.

Because RegExes are stateless there is no way to tell by using only one regex.
The solution is to use two different RegExes and make a case decision.
However, you can use group() which returns the last match as String.
You can test this for .contains("First").
if(m.group().contains("First")) {
// case 1
} else {
// case 2
}

Java - Extract strings with Regex

I've this string
String myString ="A~BC~FGH~~zuzy|XX~ 1234~ ~~ABC~01/01/2010 06:30~BCD~01/01/2011 07:45";
and I need to extract these 3 substrings
1234
06:30
07:45
If I use this regex \\d{2}\:\\d{2} I'm only able to extract the first hour 06:30
Pattern depArrHours = Pattern.compile("\\d{2}\\:\\d{2}");
Matcher matcher = depArrHours.matcher(myString);
String firstHour = matcher.group(0);
String secondHour = matcher.group(1); (IndexOutOfBoundException no Group 1)
matcher.group(1) throws an exception.
Also I don't know how to extract 1234. This string can change but it always comes after 'XX~ '
Do you have any idea on how to match these strings with regex expressions?
UPDATE
Thanks to Adam suggestion I've now this regex that match my string
Pattern p = Pattern.compile(".*XX~ (\\d{3,4}).*(\\d{1,2}:\\d{2}).*(\\d{1,2}:\\d{2})";
I match the number, and the 2 hours with matcher.group(1); matcher.group(2); matcher.group(3);

The matcher.group() function expects to take a single integer argument: The capturing group index, starting from 1. The index 0 is special, which means "the entire match". A capturing group is created using a pair of parenthesis "(...)". Anything within the parenthesis is captures. Groups are numbered from left to right (again, starting from 1), by opening parenthesis (which means that groups can overlap). Since there are no parenthesis in your regular expression, there can be no group 1.
The javadoc on the Pattern class covers the regular expression syntax.
If you are looking for a pattern that might recur some number of times, you can use Matcher.find() repeatedly until it returns false. Matcher.group(0) once on each iteration will then return what matched that time.
If you want to build one big regular expression that matches everything all at once (which I believe is what you want) then around each of the three sets of things that you want to capture, put a set of capturing parenthesis, use Matcher.match() and then Matcher.group(n) where n is 1, 2 and 3 respectively. Of course Matcher.match() might also return false, in which case the pattern did not match, and you can't retrieve any of the groups.
In your example, what you probably want to do is have it match some preceding text, then start a capturing group, match for digits, end the capturing group, etc...I don't know enough about your exact input format, but here is an example.
Lets say I had strings of the form:
Eat 12 carrots at 12:30
Take 3 pills at 01:15
And I wanted to extract the quantity and times. My regular expression would look something like:
"\w+ (\d+) [\w ]+ (\d{1,2}:\d{2})"
The code would look something like:
Pattern p = Pattern.compile("\\w+ (\\d+) [\\w ]+ (\\d{2}:\\d{2})");
Matcher m = p.matcher(oneline);
if(m.matches()) {
System.out.println("The quantity is " + m.group(1));
System.out.println("The time is " + m.group(2));
}
The regular expression means "a string containing a word, a space, one or more digits (which are captured in group 1), a space, a set of words and spaces ending with a space, followed by a time (captured in group 2, and the time assumes that hour is always 0-padded out to 2 digits). I would give a closer example to what you are looking for, but the description of the possible input is a little vague.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Regex capture multiple groups with groups containing others - java

It is the order of groups in Regex 1 ((A)(B(C))) // first group (surround two other inside this) 2 (A) // second group () 3 (B(C)) // third group (surrounded one other group) 4 (C) // forth group ()

Related

Capture Regex repeating string between slashes in URL

Java regex backreference for two digits

How do I capture the text that is before and after a multiple regex matches in java?

How to determine substring OR match?

Java - Extract strings with Regex

Categories

Resources