JAVA matchers group [duplicate] - java

This question already has answers here:
Java regex capture not working
(4 answers)
Closed 6 years ago.
I'm building a simple twitter user mention finder using regex.
public static Set<String> getMentionedUsers(List<Tweet> tweets) {
Set<String> mentionedUsers = new TreeSet<>();
String regex = "(?<=^|(?<=[^a-zA-Z0-9-_\\\\.]))#([A-Za-z][A-Za-z0-9_]+)";
for(Tweet tweet : tweets){
Matcher matcher = Pattern.compile(regex).matcher(tweet.getText().toLowerCase());
if(matcher.find()) {
mentionedUsers.add(matcher.group(0));
}
}
return mentionedUsers;
}
And it fails to find match if the expression is in the end of text for example "#glover tell me about #GREG" it returns only "#glover".

You have to keep looping with matcher.find() over a single tweet until you do not find any more matches, you currently check each tweet only once.
(Sidenote: You should compile the pattern outside of your for-loop, even better would be to compile it outside of the method)
public static Set<String> getMentionedUsers(List<Tweet> tweets) {
Set<String> mentionedUsers = new TreeSet<>();
String regex = "(?<=^|(?<=[^a-zA-Z0-9-_\\\\.]))#([A-Za-z][A-Za-z0-9_]+)";
Pattern p = Pattern.compile(regex);
for(Tweet tweet : tweets){
Matcher matcher = p.matcher(tweet.getText().toLowerCase());
while (matcher.find()) {
mentionedUsers.add(matcher.group(0));
}
}
return mentionedUsers;
}

You are adding matcher.group(0) to your Set, take a look to the Java Docs
Group zero denotes the entire pattern, so the expression m.group(0) is equivalent to m.group().
The capturing group start from 1, see the reference
Group number
Capturing groups are numbered by counting their opening parentheses
from left to right. In the expression ((A)(B(C))), for example, there
are four such groups:
1 ((A)(B(C)))
2 (A)
3 (B(C))
4 (C)
Group zero always stands for the entire expression.
Capturing groups are so named because, during a match, each
subsequence of the input sequence that matches such a group is saved.
The captured subsequence may be used later in the expression, via a
back reference, and may also be retrieved from the matcher once the
match operation is complete.

Related

Find ALL matches of a regex pattern in Java - even overlapping ones [duplicate]

This question already has answers here:
Matcher not finding overlapping words?
(4 answers)
Closed 4 years ago.
I have a String of the form:
1,2,3,4,5,6,7,8,...
I am trying to find all substrings in this string that contain exactly 4 digits. For this I have the regex [0-9],[0-9],[0-9],[0-9]. Unfortunately when I try to match the regex against my String, I never obtain all the substrings, only a part of all the possible substrings. For instance, in the example above I would only get:
1,2,3,4
5,6,7,8
although I expect to get:
1,2,3,4
2,3,4,5
3,4,5,6
...
How would I go about finding all matches corresponding to my regex?
for info, I am using Pattern and Matcher to find the matches:
Pattern pattern = Pattern.compile([0-9],[0-9],[0-9],[0-9]);
Matcher matcher = pattern.matcher(myString);
List<String> matches = new ArrayList<String>();
while (matcher.find())
{
matches.add(matcher.group());
}
By default, successive calls to Matcher.find() start at the end of the previous match.
To find from a specific location pass a start position parameter to find of one character past the start of the previous find.
In your case probably something like:
while (matcher.find(matcher.start()+1))
This works fine:
Pattern p = Pattern.compile("[0-9],[0-9],[0-9],[0-9]");
public void test(String[] args) throws Exception {
String test = "0,1,2,3,4,5,6,7,8,9";
Matcher m = p.matcher(test);
if(m.find()) {
do {
System.out.println(m.group());
} while(m.find(m.start()+1));
}
}
printing
0,1,2,3
1,2,3,4
...
If you are looking for a pure regex based solution then you may use this lookahead based regex for overlapping matches:
(?=((?:[0-9],){3}[0-9]))
Note that your matches are available in captured group #1
RegEx Demo
Code:
final String regex = "(?=((?:[0-9],){3}[0-9]))";
final String string = "0,1,2,3,4,5,6,7,8,9";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
Code Demo
output:
0,1,2,3
1,2,3,4
2,3,4,5
3,4,5,6
4,5,6,7
5,6,7,8
6,7,8,9
Some sample code without regex (since it seems not useful to me). Also I would assume regex to be slower in this case. Yet it will only work as it is as long as the numbers are only 1 character long.
String s = "a,b,c,d,e,f,g,h";
for (int i = 0; i < s.length() - 8; i+=2) {
System.out.println(s.substring(i, i + 7));
}
Ouput for this string:
a,b,c,d
b,c,d,e
c,d,e,f
d,e,f,g
As #OldCurmudgeon pointed out, find() by default start looking from the end of the previous match. To position it right after the first matched element, introduce the first matched region as a capturing group, and use it's end index:
Pattern pattern = Pattern.compile("(\\d,)\\d,\\d,\\d");
Matcher matcher = pattern.matcher("1,2,3,4,5,6,7,8,9");
List<String> matches = new ArrayList<>();
int start = 0;
while (matcher.find(start)) {
start = matcher.end(1);
matches.add(matcher.group());
}
System.out.println(matches);
results in
[1,2,3,4, 2,3,4,5, 3,4,5,6, 4,5,6,7, 5,6,7,8, 6,7,8,9]
This approach would also work if your matching region is longer than one digit

java split by bracket and keep the delmiter - RegEx [duplicate]

This question already has answers here:
How do I split a string in Java?
(39 answers)
Closed 6 years ago.
i am trying to split the string using regex with closing bracket as a delimiter and have to keep the bracket..
i/p String: (GROUP=test1)(GROUP=test2)(GROUP=test3)(GROUP=test4)
needed o/p:
(GROUP=test1)
(GROUP=test2)
(GROUP=test3)
(GROUP=test4)
I am using the java regex - "\([^)]*?\)" and it is throwing me the error..Below is the code I am using and when I try to get the group, its throwing the error..
Pattern splitDelRegex = Pattern.compile("\\([^)]*?\\)");
Matcher regexMatcher = splitDelRegex.matcher("(GROUP=test1)(GROUP=test2) (GROUP=test3)(GROUP=test4)");
List<String> matcherList = new ArrayList<String>();
while(regexMatcher.find()){
String perm = regexMatcher.group(1);
matcherList.add(perm);
}
any help is appreciated..Thanks
You simply forgot to put capturing parentheses around the entire regex. You are not capturing anything at all. Just change the regex to
Pattern splitDelRegex = Pattern.compile("(\\([^)]*?\\))");
^ ^
I tested this in Eclipse and got your desired output.
You could use
str.split(")")
That would return an array of strings which you would know are lacking the closing parentheses and so could add them back in afterwards. Thats seems much easier and less error prone to me.
You could try changing this line :
String perm = regexMatcher.group(1);
To this :
String perm = regexMatcher.group();
So you read the last found group.
I'm not sure why you need to split the string at all. You can capture each of the bracketed groups with a regex.
Try this regex (\\([a-zA-Z0-9=]*\\)). I have a capturing group () that looks for text that starts with a literal \\(, contains [a-zA-Z0-9=] zero or many times * and ends with a literal \\). This is a pretty loose regex, you could tighten up the match if the text inside the brackets will be predictable.
String input = "(GROUP=test1)(GROUP=test2)(GROUP=test3)(GROUP=test4)";
String regex = "(\\([a-zA-Z0-9=]*\\))";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
while(matcher.find()) { // find the next match
System.out.println(matcher.group()); // print the match
}
Output:
(GROUP=test1)
(GROUP=test2)
(GROUP=test3)
(GROUP=test4)

Java Regular expressions issue - Can't match two strings in the same line [duplicate]

This question already has answers here:
What do 'lazy' and 'greedy' mean in the context of regular expressions?
(13 answers)
Closed 8 years ago.
just experiencing some problems with Java Regular expressions.
I have a program that reads through an HTML file and replaces any string inside the #VR# characters, i.e. #VR#Test1 2 3 4#VR#
However my issue is that, if the line contains more than two strings surrounded by #VR#, it does not match them. It would match the leftmost #VR# with the rightmost #VR# in the sentence and thus take whatever is in between.
For example:
#VR#Google#VR#
My code would match
URL-GOES-HERE#VR#" target="_blank" style="color:#f4f3f1; text-decoration:none;" title="ContactUs">#VR#Google
Here is my Java code. Would appreciate if you could help me to solve this:
Pattern p = Pattern.compile("#VR#.*#VR#");
Matcher m;
Scanner scanner = new Scanner(htmlContent);
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
m = p.matcher(line);
StringBuffer sb = new StringBuffer();
while (m.find()) {
String match_found = m.group().replaceAll("#VR#", "");
System.out.println("group: " + match_found);
}
}
I tried replacing m.group() with m.group(0) and m.group(1) but nothing. Also m.groupCount() always returns zero, even if there are two matches as in my example above.
Thanks, your help will be very much appreciated.
Your problem is that .* is "greedy"; it will try to match as long a substring as possible while still letting the overall expression match. So, for example, in #VR# 1 #VR# 2 #VR# 3 #VR#, it will match 1 #VR# 2 #VR# 3.
The simplest fix is to make it "non-greedy" (matching as little as possible while still letting the expression match), by changing the * to *?:
Pattern p = Pattern.compile("#VR#.*?#VR#");
Also m.groupCount() always returns zero, even if there are two matches as in my example above.
That's because m.groupCount() returns the number of capture groups (parenthesized subexpressions, whose corresponding matched substrings retrieved using m.group(1) and m.group(2) and so on) in the underlying pattern. In your case, your pattern has no capture groups, so m.groupCount() returns 0.
You can try the regular expression:
#VR#(((?!#VR#).)+)#VR#
Demo:
private static final Pattern REGEX_PATTERN =
Pattern.compile("#VR#(((?!#VR#).)+)#VR#");
public static void main(String[] args) {
String input = "#VR#Google#VR# ";
System.out.println(
REGEX_PATTERN.matcher(input).replaceAll("$1")
); // prints "Google "
}

Java regex for matching multiple keys in a string

Consider an input string like
Number ONE=1 appears before TWO=2 and THREE=3 comes before FOUR=4 and FIVE=5
and the regular expression
\b(TWO|FOUR)=([^ ]*)\b
Using this regular expression, the following code can extract the 2 specific key-value pairs out of the 5 total ones (i.e., only some predefined key-value pairs should be extracted).
public static void main(String[] args) throws Exception {
String input = "Number ONE=1 appears before TWO=2 and THREE=3 comes before FOUR=4 and FIVE=5";
String regex = "\\b(TWO|FOUR)=([^ ]*)\\b";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println("\t" + matcher.group(1) + " = " + matcher.group(2));
}
}
More specifically, the main() method above prints
TWO = 2
FOUR = 4
but every time find() is invoked, the whole regular expression is evaluated for the part of the string remaining after the latest match, left to right.
Also, if the keys are not mutually distinct (or, if a regular expression with overlapping matches was used in the place of each key), there will be multiple matches. For instance, if the regex becomes
\b(O.*?|T.*?)=([^ ]*)\b
the above method yields
ONE = 1
TWO = 2
THREE = 3
If the regex was not fully re-evaluated but each alternative part was somehow examined once (or, if an appropriately modified regex was used), the output would have been
ONE = 1
TWO = 2
So, two questions:
Is there a more efficient way of extracting a selected set of unique keys and their values, compared to the original regular expression?
Is there a regular expression that can match every alternative part of the OR (|) sub-expression exactly once and not evaluate it again?
Java Returns a Match Position: You can Use Dynamically-Generated Regex on Remaining Substrings
With the understanding that it can be generalized to a more complex and useful scenario, let's take a variation on your first example: \b(TWO|FOUR|SEVEN)=([^ ]*)\b
You can use it like this:
Pattern regex = Pattern.compile("\\b(TWO|FOUR|SEVEN)=([^ ]*)\\b");
Matcher regexMatcher = regex.matcher(yourString);
if (regexMatcher.find()) {
String theMatch = regexMatcher.group();
String FoundToken = = regexMatcher.group(1);
String EndPosition = regexMatcher.end();
}
You could then:
Test the value contained by FoundToken
Depending on that value, dynamically generate a regex testing for the remaining possible tokens. For instance, if you found FOUR, your new regex would be \\b(TWO|SEVEN)=([^ ]*)\\b
Using EndPosition, apply that regex to the end of the string.
Discussion
This approach would serve your goal of not re-evaluating parts of the OR that have already matched.
It also serves your goal of avoiding duplicates.
Would that be faster? Not in this simple case. But you said you are dealing with a real problem, and it will be a valid approach in some cases.

My Java regex isn't capturing the group

I'm trying to match the username with a regex. Please don't suggest a split.
USERNAME=geo
Here's my code:
String input = "USERNAME=geo";
Pattern pat = Pattern.compile("USERNAME=(\\w+)");
Matcher mat = pat.matcher(input);
if(mat.find()) {
System.out.println(mat.group());
}
why doesn't it find geo in the group? I noticed that if I use the .group(1), it finds the username. However the group method contains USERNAME=geo. Why?
Because group() is equivalent to group(0), and group 0 denotes the entire pattern.
From the documentation:
public String group(int group)
Group zero denotes the entire pattern, so the expression m.group(0) is equivalent to m.group()
As you've found out, with your pattern, group(1) gives you what you want.
If you insist on using group(), you'd have to modify the pattern to something like "(?<=USERNAME=)\\w+".
As Matcher.group() javadoc says, it "returns the input subsequence matched by the previous match", and the previous match in your case was "USERNAME=geo" since you've called find().
In contrast, the method group(int) returns specific group. Capturing groups are numbered by counting their opening parentheses from left to right, so the first group would match "geo" in your case.
So the VAR.group( int i ) will return the ith capture group of the regex.
With 0 being the full string. You need to call .group( 1 )
For your solution, here's what works:
public static void main(String[] args) {
String input = "USERNAME=geo";
Pattern pat = Pattern.compile("USERNAME=(\\w+)");
Matcher mat = pat.matcher(input);
if(mat.find()) {
System.out.println(mat.group(1));
}
}
Output
geo
Reason
String java.util.regex.Matcher.group(int
group)
Returns the input subsequence
captured by the given group during the
previous match operation.
For a matcher m, input sequence s, and
group index g, the expressions
m.group(g) and s.substring(m.start(g),
m.end(g)) are equivalent.
That's because group is supposed to return the string matching the pattern in its entirety. For getting a group within that string, you need to pass the group number that you want.
See here for details, paraphrased below:
group
public String group()
Returns the input subsequence matched by the previous match.
public String group(int group)
Returns the input subsequence captured by the given group during the previous match operation.
Capturing groups are indexed from left to right, starting at one. Group zero denotes the entire pattern, so the expression m.group(0) is equivalent to m.group().

Categories