How to split a string in two using a pivot a number? - java

I'm trying to split a string into two parts using regex, but apparently, the regex is greedy so in the first group it adds a little bit more.
Example of string: "This is a phrase 22ext"
Desired result:
Group 0 = "This is a phrase"
Group 1 = "22"
The "ex"t iss discarded.
I'm using the following Regex (in java):
[^0-9]*([0-9]+).*
It works for Group 1, but in Group 0, it includes "22ext" as well.
How can I avoid it?

Your regex doesn't give the desired output because you didn't add the first part of it in a group, so you only have one group in your regex 1. You can fix that by using:
([^0-9]*)([0-9]+).*
And then you can find your two strings in "Group 1" and "Group 2". Note that "Group 0" is the full match.
Here's a demo.
A better and shorter way is to use the following regex:
(\D*)(\d+)
Which matches any non-numeric characters in the first group (until it reaches the first numeric characters) and then it matches the upcoming numeric characters including all Unicode digits in the second group.
And you can decide whether or not to include the .* at the end.
Try it online.
References:
Difference between [0-9] and \d.
1 "Group 0" is the full match for the entire pattern, so you need to use "Group 1" and "Group 2".

Related

Regex capturing: get only result from second group

I have a following string:
'pp_3', 365]
What comes after pp_ may have different length. What comes after , and is before ] is what I'd like to capture (and only it). Its length varies but it is always a number.
I've come up with (?<=pp_).*,(.*)(?=]). It gives 3', 365 as a full match and in group 1 there is what I want '365'. How do I get only 365 as a full match?
Please let me know if I am unable to explain my doubts. Thanks
Try this:
[^_]*_(\d*)'\s*,\s*(\-?\d+)\s*].
This regex captures 2 groups, that correspond to each of the numbers, the first after pp_ and the second after ', (which may be negative). If you don't want to capture the first one as a group, instead of (\d*), just write (?:\d*).
Try this expression. The second group should be what you're after:
(?<='pp_)(\d*', )(\d*)]
To match the digits only and if you want to make use of a positive lookbehind, you could make use of a quantifier in the lookbehind (which you can specify yourself) which is supported by Java
(?<=pp_[^,]{0,1000}, )\d+(?=])
Explanation
(?<= Positive lookbehind, assert what is on the left is
pp_[^,]{0,1000} Match pp_, match any char except , 0-1000 times
, Match a comma and space
) Close lookbehind
\d+ Match 1+ digits
(?=]) Positive lookahead, assert what is on the right is ]
In Java
String regex = "(?<=pp_[^,]{0,1000}, )\\d+(?=])";
Java demo
You could also use a capturing group instead:
pp_[^,]*, (\d+)]
Regex demo

Java regex backreference for two digits

I am working with a regex and I want to use it on the replaceAll method of the String class in Java.
My regex works fine and groupCount() returns 11. So, when I try to replace my text using backreference pointing to the eleventh group, I am getting the first group with a "1" attached to it, instead of the group eleven.
String regex = "(>[^<]*?)((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})|(\d{6,16})([;,\.]{1,3}\d{3,}#?)?)([^<]*<)";
String text = "<span style=\"font-size:11.0pt\">675-441-3144;;;78888464#<o:p></o:p></span>":
String replacement = text.replaceAll(regex, $1$2$11");
I am expecting to get the following result:
<span style=\"font-size:11.0pt\">675-441-3144;;;78888464#<o:p></o:p></span>
But the $11 backreference is not returning the 11th group, it is returning the first group with a 1 attached to it, and instead I am getting the following result:
<span style="font-size:11.0pt">675-441-3144>1o:p></o:p></span>
Can someone please tell me how to access the eleventh group of my pattern?
Thanks.
Short Answer
The way you access the eleventh group of a match in the replacement is with $11.
Explanation:
As the corresponding Javadoc* states:
The replacement string may contain references to subsequences captured
during the previous match: Each occurrence of ${name} or $g will be
replaced by the result of evaluating the corresponding group(name) or
group(g) respectively. For $g, the first number after the $ is always
treated as part of the group reference. Subsequent numbers are
incorporated into g if they would form a legal group reference.
So generally speaking, as long as have at least eleven groups, then "$11" will evaluate to group(11). However, if you do not have at least eleven groups, then "$11" will evaluate to group(1) + "1".
* This quote is from Matcher#appendReplacement(StringBuffer,String), which is where the chain of relevant citations from String#replaceAll(String,String) leads to.
Actual Answer
Your regex does not do what you think it does.
Part 1
The Problem
Let's divide your regex into its three top-level groups. These are groups 1, 2, and 11, respectively.
Group 1:
(>[^<]*?)
Group 2:
((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})|(\d{6,16})([;,\.]{1,3}\d{3,}#?)?)
Group 11:
([^<]*<)
Group 2 is the main body of your regex, and it consists of a top-level alternation over two options. These two options consist of groups 3-8 and 9-10, respectively.
First option:
((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})
Second option:
(\d{6,16})([;,\.]{1,3}\d{3,}#?)?)
Now, given the text string, here is what is going on:
Group 1 executes. It matches the first ">".
Group 2 executes. It evaluates the options of its alternation in order.
The first option of group 2's alternation executes. It matches "675-441-3144".
Group 2's alternation successfully short-circuits upon the match of one of its options.
Group 2 as a whole is now equal to the option that matched, which is "675-441-3144".
The cursor is now positioned immediately after "675-441-3144", which is immediately before ";;;78888464#".
Group 11 executes. It matches everything up through the next "<", which is all of ";;;78888464#<".
Thus, some of the content that you want to be in group 2 is actually in group 11 instead.
The Solution
Do both of the following two things:
Convert the contents of group 2 from
option1|option2
to
option1(option2)?|option2
Change $11 in your replacement pattern to $12.
This will greedy match one or both options, rather than only one option. The modification to the replacement pattern is because we have added a group.
Part 2
The Problem
Now that we have modified the regex, our original "option 2" no longer makes sense. Given our new pattern template option1(option2)?|option2, it will be impossible for group 2 to match "675-441-3144;;;78888464#". This is because our original "option 1" will match all of "675-441-3144" and then stop. Our original "option 2" will then attempt to match ";;;78888464#", but will be unable to because it begins with a mandatory capture group of 6-10 digits: (\d{6,16}), but ";;;78888464#" begins with a semicolon.
The Solution
Convert the contents of our original "option 2" from
(\d{6,16})([;,\.]{1,3}\d{3,}#?)?
to
([;,\.]{1,3}\d{3,}#?)?
Part 3
The Problem
We have one final problem to solve. Now that our original "option 2" consists only of a single group with the ? quantifier, it is possible for it to successfully match a zero-length substring. So our pattern template option1(newoption2)?|newoption2 could result in a zero-length match, which does not fulfill the intended purpose of matching phone numbers.
The Solution
Do both of the following:
Convert the contents of our new "option 2" from
([;,.]{1,3}\d{3,}#?)?
to
[;,.]{1,3}\d{3,}#?
Change $12 in our replacement string to $10, since we have now removed one group in two locations.
The Final Solution
Putting everything together, our final solution is as follows.
Search regex:
(>[^<]*?)((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})([;,\.]{1,3}\d{3,}#?)?|[;,\.]{1,3}\d{3,}#?)([^<]*<)
Replacement regex:
$1$2$10
Java:
final String searchRegex = "(>[^<]*?)((\\+?\\d{1,4}[ \\t\\f\\-\\.](\\d[ \\t\\f\\-\\.])?)?(\\(\\d{1,4}([\\s-]\\d{1,4})?\\)[\\.\\- \\t\\f])?((\\d{2,6}[\\.\\- \\t\\f])+\\d{2,6})([;,\\.]{1,3}\\d{3,}#?)?|[;,\\.]{1,3}\\d{3,}#?)([^<]*<)";
final String replacementRegex = "$1$2$10";
String text = "<span style=\"font-size:11.0pt\">675-441-3144;;;78888464#<o:p></o:p></span>";
String replacement = text.replaceAll(searchRegex, replacementRegex);
Proof of correctness
Well, after trying to do it with replaceall without success, I had to implement the replacement method by myself:
public static String parsePhoneNumbers(String html){
StringBuilder regex = new StringBuilder(120);
regex.append("(>[^<]*?)(")
.append("((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?")
.append("(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?")
.append("((\d{2,6}[\.\- \t\f])+\d{2,6})|(\d{6,16})")
.append("([;,\.]{1,3}\d{3,}#?)?)")
.append(")+([^<]*<)");
StringBuilder mutableHtml = new StringBuilder(html.length());
Pattern pattern = Pattern.compile(regex.toString());
Matcher matcher = pattern.matcher(html);
int start = 0;
while(matcher.find()){
mutableHtml.append(html.substring(start, matcher.start()));
mutableHtml.append(matcher.group(1)).append("<a href=\"tel:")
.append(matcher.group(2)).append("\">").append(matcher.group(2))
.append("</a>").append(matcher.group(matcher.groupCount()));
start = matcher.end();
}
mutableHtml.append(html.substring(start));
return mutableHtml.toString();
}

Match regex but only replace the first section - Java

I'm trying to take a phone number which can be in the format either +44 or +4 followed by any number of digits or hyphens, and replace the +44 or +4 with +44 or +4 followed by a space.
I believe I need a look around to match the full number but only replace the initial prefix, what I'm trying atm is
^[+]\d[0-9](?:([0-9]+))?
which matches the number (without hyphens) however I thought the lookahead would only match the number and not capture the extra digits however it seems to capture the whole thing.
Can anyone point me in the right direction as to what I've done wrong?
EDIT:
To be clearer my Java code is
Pattern pattern = Pattern.compile("^[+]\\d[0-9](?:([0-9]+))?");
if(pattern.matcher("+441234567890").matches())
String num = pattern.matcher(title).replaceFirst("$0 $1");
Thanks.
If you want to match whole number, but replace only part of it, you should not use positive lookahead, but just gruping, like in:
(^\+\d\d)([\d-]+)?
prefix will be in group 1, and the rest of number in group 2, so to add a space between these parts, just use something like group1 + space + group2.
In your example it should look like this:
Pattern pattern = Pattern.compile("(^\\+\\d\\d)([\\d-]+)?");
if(pattern.matcher("+441234567890").matches()) {
num = pattern.matcher(title).replaceFirst("$1 $2");
}
However this regex will always capture two digits in prefix, if you want to match +44 or +4 you should use:
(^\+(44|4))([\d-]+)?
so if you have more possible prefixes, you need to change this regex also.
You regex didn't work as you expected because (?:([0-9]+))? is a non capturing group, so the fragment matched by this part of regex was not captured, but it was still matched by whole regex. So $0 returned whole regex, and $1 should not return anything.

regex to match a recurring pattern

I am trying to write a regex for java that will match the following string:
number,number,number (it could be this simple or it could have a variable number of numbers, but each number has to have a comma after it there will not be any white space though)
here was my attempt:
[[0-9],[0-9]]+
but it seems to match anything with a number in it
You could try something along the lines of ([0-9]+,)*[0-9]+
This will match:
Only one number, e.g.: 7
Two numbers, e.g.: 7,52
Three numbers, e.g.: 7,52,999
etc.
This will not match:
Things with spaces, e.g.: 7, 52
A list ending with a comma, e.g.: 7, 52,
Many other things out of the scope of this problem.
I think this would work
\d+,(\d+,)+
Note that as you want, that will only capture number followed by a comma
I guess you are starting with a String. Why don't you just use String.split(",") ?
^ means the start of a string and $ means the end. If you don't use those, you could match something in the middle (b matched "abc").
The + works on the element before it. b is an element, [0-9] is an element, and so are groups (things wrapped in parenthesis).
So, the regex you want matches:
The start of the string ^
a number [0-9]
any amount of comas flowed by numbers (,[0-9])+
the end of the string $
or, ^[0-9](,[0-9])+$
Try regex as [\d,]* string representation as [\\d,]* e.g. below:
Pattern p4 = Pattern.compile("[\\d,]*");
Matcher m4 = p4.matcher("12,1212,1212ad,v");
System.out.println(m4.find()); //prints true
System.out.println(m4.group());//prints 12,1212,1212
If you want to match minimum one comma (,) and two numbers e.g. 12,1212 then you may want to use regex as (\d+,)+\d+ with string representation as \\d+,)+\\d+. This regex matches a a region with a number minimum one digit followed by one comma(,) followed by minimum one digit number.

Java - Extract strings with Regex

I've this string
String myString ="A~BC~FGH~~zuzy|XX~ 1234~ ~~ABC~01/01/2010 06:30~BCD~01/01/2011 07:45";
and I need to extract these 3 substrings
1234
06:30
07:45
If I use this regex \\d{2}\:\\d{2} I'm only able to extract the first hour 06:30
Pattern depArrHours = Pattern.compile("\\d{2}\\:\\d{2}");
Matcher matcher = depArrHours.matcher(myString);
String firstHour = matcher.group(0);
String secondHour = matcher.group(1); (IndexOutOfBoundException no Group 1)
matcher.group(1) throws an exception.
Also I don't know how to extract 1234. This string can change but it always comes after 'XX~ '
Do you have any idea on how to match these strings with regex expressions?
UPDATE
Thanks to Adam suggestion I've now this regex that match my string
Pattern p = Pattern.compile(".*XX~ (\\d{3,4}).*(\\d{1,2}:\\d{2}).*(\\d{1,2}:\\d{2})";
I match the number, and the 2 hours with matcher.group(1); matcher.group(2); matcher.group(3);
The matcher.group() function expects to take a single integer argument: The capturing group index, starting from 1. The index 0 is special, which means "the entire match". A capturing group is created using a pair of parenthesis "(...)". Anything within the parenthesis is captures. Groups are numbered from left to right (again, starting from 1), by opening parenthesis (which means that groups can overlap). Since there are no parenthesis in your regular expression, there can be no group 1.
The javadoc on the Pattern class covers the regular expression syntax.
If you are looking for a pattern that might recur some number of times, you can use Matcher.find() repeatedly until it returns false. Matcher.group(0) once on each iteration will then return what matched that time.
If you want to build one big regular expression that matches everything all at once (which I believe is what you want) then around each of the three sets of things that you want to capture, put a set of capturing parenthesis, use Matcher.match() and then Matcher.group(n) where n is 1, 2 and 3 respectively. Of course Matcher.match() might also return false, in which case the pattern did not match, and you can't retrieve any of the groups.
In your example, what you probably want to do is have it match some preceding text, then start a capturing group, match for digits, end the capturing group, etc...I don't know enough about your exact input format, but here is an example.
Lets say I had strings of the form:
Eat 12 carrots at 12:30
Take 3 pills at 01:15
And I wanted to extract the quantity and times. My regular expression would look something like:
"\w+ (\d+) [\w ]+ (\d{1,2}:\d{2})"
The code would look something like:
Pattern p = Pattern.compile("\\w+ (\\d+) [\\w ]+ (\\d{2}:\\d{2})");
Matcher m = p.matcher(oneline);
if(m.matches()) {
System.out.println("The quantity is " + m.group(1));
System.out.println("The time is " + m.group(2));
}
The regular expression means "a string containing a word, a space, one or more digits (which are captured in group 1), a space, a set of words and spaces ending with a space, followed by a time (captured in group 2, and the time assumes that hour is always 0-padded out to 2 digits). I would give a closer example to what you are looking for, but the description of the possible input is a little vague.

Categories