Java Matching multiple tokens with Regex - java

I found a regular expression which matches tokens surrounded with {} but it only seems find the first found item.
How can the following code be changed so that all of the tokens will be found rather than just {World}, would i need to use loops?
// The search string
String str = "Hello {World} this {is} a {Tokens} test";
// The Regular expression (Finds {word} tokens)
Pattern pt = Pattern.compile("\\{([^}]*)\\}");
// Match the string with the pattern
Matcher m = pt.matcher(str);
// If results are found
if (m.find()) {
System.out.println(m);
System.out.println(m.groupCount()); // 1
System.out.println(m.group(0)); // {World}
System.out.println(m.group(1)); // World (Get without {})
}

The groupCount() method doesn't return the number of matches, it returns the number of capturing groups in this matcher's pattern. You defined one group in your pattern, hence this method returns 1.
You can find a next match to your pattern by calling find() again; it will attempt to find the next subsequence of the input sequence that matches the pattern. When it returns false, you'll know there are no more matches.
Thus, you should iterate through your matches like this:
while (m.find()) {
System.out.println(m.group(0));
}

Yes, in your code you just do one match, and get the groups captured in that single match.
If you want to get the other matches, you have to continue matching in a loop until find() returns false.
So basically all you need is to replace if with while and you're there.

Related

Java: Need to extract a number from a string

I have a string containing a number. Something like "Incident #492 - The Title Description".
I need to extract the number from this string.
Tried
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher(theString);
String substring =m.group();
By getting an error
java.lang.IllegalStateException: No match found
What am I doing wrong?
What is the correct expression?
I'm sorry for such a simple question, but I searched a lot and still not found how to do this (maybe because it's too late here...)
You are getting this exception because you need to call find() on the matcher before accessing groups:
Matcher m = p.matcher(theString);
while (m.find()) {
String substring =m.group();
System.out.println(substring);
}
Demo.
There are two things wrong here:
The pattern you're using is not the most ideal for your scenario, it's only checking if a string only contains numbers. Also, since it doesn't contain a group expression, a call to group() is equivalent to calling group(0), which returns the entire string.
You need to be certain that the matcher has a match before you go calling a group.
Let's start with the regex. Here's what it looks like now.
Debuggex Demo
That will only ever match a string that contains all numbers in it. What you care about is specifically the number in that string, so you want an expression that:
Doesn't care about what's in front of it
Doesn't care about what's after it
Only matches on one occurrence of numbers, and captures it in a group
To that, you'd use this expression:
.*?(\\d+).*
Debuggex Demo
The last part is to ensure that the matcher can find a match, and that it gets the correct group. That's accomplished by this:
if (m.matches()) {
String substring = m.group(1);
System.out.println(substring);
}
All together now:
Pattern p = Pattern.compile(".*?(\\d+).*");
final String theString = "Incident #492 - The Title Description";
Matcher m = p.matcher(theString);
if (m.matches()) {
String substring = m.group(1);
System.out.println(substring);
}
You need to invoke one of the Matcher methods, like find, matches or lookingAt to actually run the match.

Regex to match only letters and numbers

Can you help with this code?
It seems easy, but always fails.
#Test
public void normalizeString(){
StringBuilder ret = new StringBuilder();
//Matcher matches = Pattern.compile( "([A-Z0-9])" ).matcher("P-12345678-P");
Matcher matches = Pattern.compile( "([\\w])" ).matcher("P-12345678-P");
for (int i = 1; i < matches.groupCount(); i++)
ret.append(matches.group(i));
assertEquals("P12345678P", ret.toString());
}
Constructing a Matcher does not automatically perform any matching. That's in part because Matcher supports two distinct matching behaviors, differing in whether the match is implicitly anchored to the beginning of the Matcher's region. It appears that you could achieve your desired result like so:
#Test
public void normalizeString(){
StringBuilder ret = new StringBuilder();
Matcher matches = Pattern.compile( "[A-Z0-9]+" ).matcher("P-12345678-P");
while (matches.find()) {
ret.append(matches.group());
}
assertEquals("P12345678P", ret.toString());
}
Note in particular the invocation of Matcher.find(), which was a key omission from your version. Also, the nullary Matcher.group() returns the substring matched by the last find().
Furthermore, although your use of Matcher.groupCount() isn't exactly wrong, it does lead me suspect that you have the wrong idea about what it does. In particular, in your code it will always return 1 -- it inquires about the pattern, not about matches to it.
First of all you don't need to add any group because entire match can be always accessed by group 0, so instead of
(regex) and group(1)
you can use
regex and group(0)
Next thing is that \\w is already character class so you don't need to surround it with another [ ], because it will be similar to [[a-z]] which is same as [a-z].
Now in your
for (int i = 1; i < matches.groupCount(); i++)
ret.append(matches.group(i));
you will iterate over all groups from 1 but you will exclude last group, because they are indexed from 1 so n so i<n will not include n. You would need to use i <= matches.groupCount() instead.
Also it looks like you are confusing something. This loop will not find all matches of regex in input. Such loop is used to iterate over groups in used regex after match for regex was found.
So if regex would be something like (\w(\w))c and your match would be like abc then
for (int i = 1; i < matches.groupCount(); i++)
System.out.println(matches.group(i));
would print
ab
b
because
first group contains two characters (\w(\w)) before c
second group is the one inside first one, right after first character.
But to print them you actually would need to first let regex engine iterate over your input and find() match, or check if entire input matches() regex, otherwise you would get IllegalStateException because regex engine can't know from which match you want to get your groups (there can be many matches of regex in input).
So what you may want to use is something like
StringBuilder ret = new StringBuilder();
Matcher matches = Pattern.compile( "[A-Z0-9]" ).matcher("P-12345678-P");
while (matches.find()){//find next match
ret.append(matches.group(0));
}
assertEquals("P12345678P", ret.toString());
Other way around (and probably simpler solution) would be actually removing all characters you don't want from your input. So you could just use replaceAll and negated character class [^...] like
String input = "P-12345678-P";
String result = input.replaceAll("[^A-Z0-9]+", "");
which will produce new string in which all characters which are not A-Z0-9 will be removed (replaced with "").

Find pattern in string with regex -> how to improve my solution

i would like to parse a string and get the "stringIAmLookingFor"-part of it, which is surrounded by "\_" at the end and the beginning. I'm using a regex to match that and then remove the "\_" in the found string. This is working, but I'm wondering if there is a more elegant approach to this problem?
String test = "xyz_stringIAmLookingFor_zxy";
Pattern p = Pattern.compile("_(\\w)*_");
Matcher m = p.matcher(test);
while (m.find()) { // find next match
String match = m.group();
match = match.replaceAll("_", "");
System.out.println(match);
}
Solution (partial)
Please also check the next section. Don't just read the solution here.
Just modify your code a bit:
String test = "xyz_stringIAmLookingFor_zxy";
// Make the capturing group capture the text in between (\w*)
// A capturing group is enclosed in (pattern), denoting the part of the
// pattern whose text you want to get separately from the main match.
// Note that there is also non-capturing group (?:pattern), whose text
// you don't need to capture.
Pattern p = Pattern.compile("_(\\w*)_");
Matcher m = p.matcher(test);
while (m.find()) { // find next match
// The text is in the capturing group numbered 1
// The numbering is by counting the number of opening
// parentheses that makes up a capturing group, until
// the group that you are interested in.
String match = m.group(1);
System.out.println(match);
}
Matcher.group(), without any argument will return the text matched by the whole regex pattern. Matcher.group(int group) will return the text matched by capturing group with the specified group number.
If you are using Java 7, you can make use of named capturing group, which makes the code slightly more readable. The string matched by the capturing group can be accessed with Matcher.group(String name).
String test = "xyz_stringIAmLookingFor_zxy";
// (?<name>pattern) is similar to (pattern), just that you attach
// a name to it
// specialText is not a really good name, please use a more meaningful
// name in your actual code
Pattern p = Pattern.compile("_(?<specialText>\\w*)_");
Matcher m = p.matcher(test);
while (m.find()) { // find next match
// Access the text captured by the named capturing group
// using Matcher.group(String name)
String match = m.group("specialText");
System.out.println(match);
}
Problem in pattern
Note that \w also matches _. The pattern you have is ambiguous, and I don't know what your expected output is for the cases where there are more than 2 _ in the string. And do you want to allow underscore _ to be part of the output?
You can define the group you actually want, since you're already using parentheses. You just need to tweak your pattern a bit.
String test = "xyz_stringIAmLookingFor_zxy";
Pattern p = Pattern.compile("_(\\w*)_");
Matcher m = p.matcher(test);
while (m.find()) { // find next match
System.out.println(m.group(1));
}
Use group(1) instead of group() because group() will get you the entire pattern and not the matching group.
Reference : http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#group(int)
"xyz_stringIAmLookingFor_zxy".replaceAll("_(\\w)*_", "$1");
will replace everything by this group in parenthesis
a simpler regex, no group needed:
"(?<=_)[^_]*"
if you want it more strict:
"(?<=_)[^_]+(?=_)"
try
String s = "xyz_stringIAmLookingFor_zxy".replaceAll(".*_(\\w*)_.*", "$1");
System.out.println(s);
output
stringIAmLookingFor

Java Matcher. Return several entries from one sequence

For example I have following regexp: \d{2} (2 digits). And when I using
Matcher matcher = Pattern.compile("\\d{2}").matcher("123");
matcher.find();
String result = matcher.group();
In result variable I get only first entry, i.e. 12. But I want to get ALL possible entries, i.e. 12 and 23.
How to achieve this?
You'll need the help of a capture group within a positive lookahead:
Matcher m = Pattern.compile("(?=(\\d{2}))").matcher("1234");
while (m.find()) System.out.println(m.group(1));
prints
12
23
34
That's not how regular expression matching works. The matcher starts at the beginning of the string, and each time it finds a match it continues looking from the character following the end of that match - it will not give you overlapping matches.
If you want to find overlapping matches of an arbitrary regular expression without needing to use lookaheads and capturing groups you can do this by resetting the matcher's "region" after each match
Matcher matcher = Pattern.compile(theRegex).matcher(str);
// prevent ^ and $ from matching the beginning/end of the region when this is
// smaller than the whole string
matcher.useAnchoringBounds(false);
// allow lookaheads/behinds to look outside the current region
matcher.useTransparentBounds(true);
while(matcher.find()) {
System.out.println(matcher.group());
if(matcher.start() < str.length()) {
// start looking again from the character after the _start_ of the previous
// match, instead of the character following the _end_ of the match
matcher.region(matcher.start() + 1, str.length());
}
}
some thing like this
^(?=[1-3]{2}$)(?!.*(.).*\1).*$
Test and experiment here

My Java regex isn't capturing the group

I'm trying to match the username with a regex. Please don't suggest a split.
USERNAME=geo
Here's my code:
String input = "USERNAME=geo";
Pattern pat = Pattern.compile("USERNAME=(\\w+)");
Matcher mat = pat.matcher(input);
if(mat.find()) {
System.out.println(mat.group());
}
why doesn't it find geo in the group? I noticed that if I use the .group(1), it finds the username. However the group method contains USERNAME=geo. Why?
Because group() is equivalent to group(0), and group 0 denotes the entire pattern.
From the documentation:
public String group(int group)
Group zero denotes the entire pattern, so the expression m.group(0) is equivalent to m.group()
As you've found out, with your pattern, group(1) gives you what you want.
If you insist on using group(), you'd have to modify the pattern to something like "(?<=USERNAME=)\\w+".
As Matcher.group() javadoc says, it "returns the input subsequence matched by the previous match", and the previous match in your case was "USERNAME=geo" since you've called find().
In contrast, the method group(int) returns specific group. Capturing groups are numbered by counting their opening parentheses from left to right, so the first group would match "geo" in your case.
So the VAR.group( int i ) will return the ith capture group of the regex.
With 0 being the full string. You need to call .group( 1 )
For your solution, here's what works:
public static void main(String[] args) {
String input = "USERNAME=geo";
Pattern pat = Pattern.compile("USERNAME=(\\w+)");
Matcher mat = pat.matcher(input);
if(mat.find()) {
System.out.println(mat.group(1));
}
}
Output
geo
Reason
String java.util.regex.Matcher.group(int
group)
Returns the input subsequence
captured by the given group during the
previous match operation.
For a matcher m, input sequence s, and
group index g, the expressions
m.group(g) and s.substring(m.start(g),
m.end(g)) are equivalent.
That's because group is supposed to return the string matching the pattern in its entirety. For getting a group within that string, you need to pass the group number that you want.
See here for details, paraphrased below:
group
public String group()
Returns the input subsequence matched by the previous match.
public String group(int group)
Returns the input subsequence captured by the given group during the previous match operation.
Capturing groups are indexed from left to right, starting at one. Group zero denotes the entire pattern, so the expression m.group(0) is equivalent to m.group().

Categories