Java pattern to find two groups of two letters in `ABC`

Java pattern to find two groups of two letters in `ABC` - java

I have a pattern defined like this:
private static final Pattern PATTERN = Pattern.compile("[a-zA-Z]{2}");
And in my code I'm doing this:
Matcher matcher = PATTERN.matcher(myString);
and using a while loop to find all matches.
while (matcher.find()){
//do something here
}
If myString is 12345AB3CD45 the matcher is finding those two groups of two letters (AB and CD). The problem is that I have sometimes myString as 12345ABC356 so I would like the matcher to find, first AB and then BC (is only finding `AB).
Am I doing this wrong or the regex is wrong or the matcher doesn't work this way?

You can't match a same position several times with a regex, but you can use a trick.
To do that you need to enclose your pattern in a lookahead and a capture group:
(?=([A-Za-z]{2})), because a lookahead matches no characters and consumes only one position.
The result you are looking for is in the capture group 1.

Fragment of text which was placed in group 0 (entire match) can't be reused in next match to be part of group 0.
12345ABC356
^^ - AB was placed in standard match (group 0)
^^ - B can't be reused here as part of standard match
You can solve this problem with look-around mechanisms like look-ahead, which doesn't consume matched part (they are zero-length), but you can place their content in separate capturing group which you will be able to access.
So your code can look like
private static final Pattern PATTERN = Pattern.compile("[a-zA-Z](?=([a-zA-Z]))");
// ^^^^^^^^ ^^^^^^^^^^
// group 0 group 1
//...
Matcher matcher = PATTERN.matcher(myString);
while (matcher.find()){
String match = matcher.group() + matcher.group(1);
//...
}

Related

How does Java's Matcher.group (int) method avoid match the contents of sub-braces inside parentheses

I have a string like
String str = "美国临时申请No.62004615";
And a regex like
String regex = "(((美国|PCT|加拿大){0,1})([\\u4E00-\\u9FA5]{1,8})((NO.|NOS.){1})([\\d]{5,}))";
And other code is
Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println("1:"+matcher.group(1)+"\n"
+"2:"+matcher.group(2)+"\n"
+"3:"+matcher.group(3)+"\n"
+"4:"+matcher.group(4)+"\n"
+"5:"+matcher.group(5)+"\n"
+"6:"+matcher.group(6)+"\n"
+"7:"+matcher.group(7));
}
I know Parenthesis () are used to enable grouping of regex phrases. And group 1 is the big group.
The second group is ((美国|PCT|加拿大){0,1}) to match the "美国" or "PCT" or "加拿大".
The third group is ([\u4E00-\u9FA5]{1,8}) to match the chinese character which length is one to eight.
The fouth group is ((NO.|NOS.){1}) to match the NO. or NOS.
The fifth group is ([\d]{5,}) to match the number
But the console is
1:美国临时申请No.62004615 2:美国 3:美国 4:临时申请 5:No. 6:No. 7:62004615
The group (2) is the same as group (3).The group (5) is the same as group (6)
It seems that group (3) rematches the sub-parentheses inside the parentheses again. I wonder if there is a way to match only the outermost parentheses。
The ideal result should be
1:美国临时申请No.62004615 2:美国 3:临时申请 4:No. 5:62004615

It sounds like you want a non-capturing group. From the Pattern documentation:
(?:X) X, as a non-capturing group
So, change this:
(美国|PCT|加拿大)
to this:
(?:美国|PCT|加拿大)
… and then it will not be represented as a group at all in the Matcher.
Some side notes:
{0,1} is the same as writing ?.
{1} does nothing and can be removed entirely.
[\\d] is the same as just \\d.

how to exclude "<" in regex match

I have a String which looks like "<name><address> and <Phone_1>". I have get to get the result like
1) <name>
2) <address>
3) <Phone_1>
I have tried using regex "<(.*)>" but it returns just one result.

The regex you want is
<([^<>]+?)><([^<>]+?)> and <([^<>]+?)>
Which will then spit out the stuff you want in the 3 capture groups. The full code would then look something like this:
Matcher m = Pattern.compile("<([^<>]+?)><([^<>]+?)> and <([^<>]+?)>").matcher(string);
if (m.find()) {
String name = m.group(1);
String address = m.group(2);
String phone = m.group(3);
}

The pattern .* in a regex is greedy. It will match as many characters as possible between the first < it finds and the last possible > it can find. In the case of your string it finds the first <, then looks for as much text as possible until a >, which it will find at the very end of the string.
You want a non-greedy or "lazy" pattern, which will match as few characters as possible. Simply <(.+?)>. The question mark is the syntax for non-greedy. See also this question.

This will work if you have dynamic number of groups.
Pattern p = Pattern.compile("(<\\w+>)");
Matcher m = p.matcher("<name><address> and <Phone_1>");
while (m.find()) {
System.out.println(m.group());
}

Find characters that match a regex's set

I have a regex w_p[a-z]
It would match input like w_pa, w_pb ... w_pz. I like to find which character exactly was matched i.e. a,b or z for the above input. Is this possible with java regex?

Yes, you need to capture:
final Pattern pattern = Pattern.compile("w_p([a-z])");
final Matcher m = pattern.matcher(input);
if (m.find())
// what is matched is in m.group(1)

Sure, use Regexpr groups. w_p([a-z]) defines a group for the character you are looking for.
Pattern p = Pattern.compile("w_p([a-z])");
Matcher matcher = p.matcher(input);
if (matcher.find()) {
String character = matcher.group(1)
}
matcher.group(0) contains all that was matched (w_pa or w_pb etc.)
matcher.group(1) contains what was found in the first () pair.
See the documentation for more information.

The REGEX will be something like this:
w_p([a-z])
So you will create a group from wich you can get the value

Find pattern in string with regex -> how to improve my solution

i would like to parse a string and get the "stringIAmLookingFor"-part of it, which is surrounded by "\_" at the end and the beginning. I'm using a regex to match that and then remove the "\_" in the found string. This is working, but I'm wondering if there is a more elegant approach to this problem?
String test = "xyz_stringIAmLookingFor_zxy";
Pattern p = Pattern.compile("_(\\w)*_");
Matcher m = p.matcher(test);
while (m.find()) { // find next match
String match = m.group();
match = match.replaceAll("_", "");
System.out.println(match);
}

Solution (partial)
Please also check the next section. Don't just read the solution here.
Just modify your code a bit:
String test = "xyz_stringIAmLookingFor_zxy";
// Make the capturing group capture the text in between (\w*)
// A capturing group is enclosed in (pattern), denoting the part of the
// pattern whose text you want to get separately from the main match.
// Note that there is also non-capturing group (?:pattern), whose text
// you don't need to capture.
Pattern p = Pattern.compile("_(\\w*)_");
Matcher m = p.matcher(test);
while (m.find()) { // find next match
// The text is in the capturing group numbered 1
// The numbering is by counting the number of opening
// parentheses that makes up a capturing group, until
// the group that you are interested in.
String match = m.group(1);
System.out.println(match);
}
Matcher.group(), without any argument will return the text matched by the whole regex pattern. Matcher.group(int group) will return the text matched by capturing group with the specified group number.
If you are using Java 7, you can make use of named capturing group, which makes the code slightly more readable. The string matched by the capturing group can be accessed with Matcher.group(String name).
String test = "xyz_stringIAmLookingFor_zxy";
// (?<name>pattern) is similar to (pattern), just that you attach
// a name to it
// specialText is not a really good name, please use a more meaningful
// name in your actual code
Pattern p = Pattern.compile("_(?<specialText>\\w*)_");
Matcher m = p.matcher(test);
while (m.find()) { // find next match
// Access the text captured by the named capturing group
// using Matcher.group(String name)
String match = m.group("specialText");
System.out.println(match);
}
Problem in pattern
Note that \w also matches _. The pattern you have is ambiguous, and I don't know what your expected output is for the cases where there are more than 2 _ in the string. And do you want to allow underscore _ to be part of the output?

You can define the group you actually want, since you're already using parentheses. You just need to tweak your pattern a bit.
String test = "xyz_stringIAmLookingFor_zxy";
Pattern p = Pattern.compile("_(\\w*)_");
Matcher m = p.matcher(test);
while (m.find()) { // find next match
System.out.println(m.group(1));
}

Use group(1) instead of group() because group() will get you the entire pattern and not the matching group.
Reference : http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#group(int)

"xyz_stringIAmLookingFor_zxy".replaceAll("_(\\w)*_", "$1");
will replace everything by this group in parenthesis

a simpler regex, no group needed:
"(?<=_)[^_]*"
if you want it more strict:
"(?<=_)[^_]+(?=_)"

try
String s = "xyz_stringIAmLookingFor_zxy".replaceAll(".*_(\\w*)_.*", "$1");
System.out.println(s);
output
stringIAmLookingFor

java regex: extract text after delimeter?

i am new to regular expressions in Java. I like to extract a string by using regular expressions.
This is my String: "Hello,World"
I like to extract the text after ",". The result would be "World". I tried this:
final Pattern pattern = Pattern.compile(",(.+?)");
final Matcher matcher = pattern.matcher("Hello,World");
matcher.find();
But what would be the next step?

You don't need Regex for this. You can simply split on comma and get the 2nd element from the array: -
System.out.println("Hello,World".split(",")[1]);
OUTPUT: -
World
But if you want to use Regex, you need to remove ? from your Regex.
? after + is used for Reluctant matching. It will only match W and stop there.
You don't need that here. You need to match until it can match.
So use greedy matching instead.
Here's the code with modified Regex: -
final Pattern pattern = Pattern.compile(",(.+)");
final Matcher matcher = pattern.matcher("Hello,World");
if (matcher.find()) {
System.out.println(matcher.group(1));
}
OUTPUT: -
World

Extending what you have, you need to remove the ? sign from your pattern to use the greedy matching and then process the matched group:
final Pattern pattern = Pattern.compile(",(.+)"); // removed your '?'
final Matcher matcher = pattern.matcher("Hello,World");
while (matcher.find()) {
String result = matcher.group(1);
// work with result
}
Other answers suggest different approaches to your problem and might offer better solution for what you need.

System.out.println( "Hello,World".replaceAll(".*,(.*)","$1") ); // output is "World"

You are using a reluctant expression and will only select a single character W, whereas you can use a greedy one and print your matched group content:
final Pattern pattern = Pattern.compile(",(.+)");
final Matcher matcher = pattern.matcher("Hello,World");
if (matcher.find()) {
System.out.println(matcher.group(1));
}
Output:
World
See Regex Pattern doc

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java pattern to find two groups of two letters in `ABC` - java

Related

How does Java's Matcher.group (int) method avoid match the contents of sub-braces inside parentheses

how to exclude "<" in regex match

Find characters that match a regex's set

Find pattern in string with regex -> how to improve my solution

java regex: extract text after delimeter?

Categories

Resources