This question already has answers here:
Matcher not finding overlapping words?
(4 answers)
Closed 4 years ago.
I have a String of the form:
1,2,3,4,5,6,7,8,...
I am trying to find all substrings in this string that contain exactly 4 digits. For this I have the regex [0-9],[0-9],[0-9],[0-9]. Unfortunately when I try to match the regex against my String, I never obtain all the substrings, only a part of all the possible substrings. For instance, in the example above I would only get:
1,2,3,4
5,6,7,8
although I expect to get:
1,2,3,4
2,3,4,5
3,4,5,6
...
How would I go about finding all matches corresponding to my regex?
for info, I am using Pattern and Matcher to find the matches:
Pattern pattern = Pattern.compile([0-9],[0-9],[0-9],[0-9]);
Matcher matcher = pattern.matcher(myString);
List<String> matches = new ArrayList<String>();
while (matcher.find())
{
matches.add(matcher.group());
}
By default, successive calls to Matcher.find() start at the end of the previous match.
To find from a specific location pass a start position parameter to find of one character past the start of the previous find.
In your case probably something like:
while (matcher.find(matcher.start()+1))
This works fine:
Pattern p = Pattern.compile("[0-9],[0-9],[0-9],[0-9]");
public void test(String[] args) throws Exception {
String test = "0,1,2,3,4,5,6,7,8,9";
Matcher m = p.matcher(test);
if(m.find()) {
do {
System.out.println(m.group());
} while(m.find(m.start()+1));
}
}
printing
0,1,2,3
1,2,3,4
...
If you are looking for a pure regex based solution then you may use this lookahead based regex for overlapping matches:
(?=((?:[0-9],){3}[0-9]))
Note that your matches are available in captured group #1
RegEx Demo
Code:
final String regex = "(?=((?:[0-9],){3}[0-9]))";
final String string = "0,1,2,3,4,5,6,7,8,9";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
Code Demo
output:
0,1,2,3
1,2,3,4
2,3,4,5
3,4,5,6
4,5,6,7
5,6,7,8
6,7,8,9
Some sample code without regex (since it seems not useful to me). Also I would assume regex to be slower in this case. Yet it will only work as it is as long as the numbers are only 1 character long.
String s = "a,b,c,d,e,f,g,h";
for (int i = 0; i < s.length() - 8; i+=2) {
System.out.println(s.substring(i, i + 7));
}
Ouput for this string:
a,b,c,d
b,c,d,e
c,d,e,f
d,e,f,g
As #OldCurmudgeon pointed out, find() by default start looking from the end of the previous match. To position it right after the first matched element, introduce the first matched region as a capturing group, and use it's end index:
Pattern pattern = Pattern.compile("(\\d,)\\d,\\d,\\d");
Matcher matcher = pattern.matcher("1,2,3,4,5,6,7,8,9");
List<String> matches = new ArrayList<>();
int start = 0;
while (matcher.find(start)) {
start = matcher.end(1);
matches.add(matcher.group());
}
System.out.println(matches);
results in
[1,2,3,4, 2,3,4,5, 3,4,5,6, 4,5,6,7, 5,6,7,8, 6,7,8,9]
This approach would also work if your matching region is longer than one digit
I have this code to find this pattern: 201409250200131738007947036000 - 1 ,inside the text
final String patternStr = "(\\d{30} - \\d{1})";
final Pattern p = Pattern.compile(patternStr);
final Matcher m = p.matcher(page);
if (m.matches()) {
System.out.println("SUCCESS");
}
But for any strange reasson in Java did't work, Can somebody help me where is the error please?
The reason is that the matches method checks for the entire given string to match the regex.
So i.e. if your string is 123456123412345612341234561234 - 8 it will match, if it is my number 123456123412345612341234561234 - 8 is inside other text it won't.
Use the find method to accomplish your task:
if (m.find()) {
System.out.println("SUCCESS");
}
It will search inside the given string instead of attempting to match the entire string.
From the documentation for Matcher, matches:
Attempts to match the entire region against the pattern.
As opposed to find which:
Attempts to find the next subsequence of the input sequence that matches the pattern.
So use matches to match an entire String against a pattern, use find to locate a pattern inside a String.
Try:
final String patternStr = "\\d{30}+\\s-\\s\\d";
final Pattern p = Pattern.compile(patternStr);
final Matcher m = p.matcher(page);
while (m.find()) {
System.out.printf("FOUND A MATCH: %s%n", matcher.group());
}
I edited your pattern slightly to make it more robust. This will print each match that it finds.
I've faced with strange behavior of java.util.regex.Matcher.
Lets consider example:
Pattern p = Pattern.compile("\\d*");
String s = "a1b";
Matcher m = p.matcher(s);
while(m.find())
{
System.out.println(m.start()+" "+m.end());
}
It produces output:
0 0
1 2
2 2
3 3
I can understant all lines except last. Matcher creates extra group (3,3) out of string.
But javadoc for method start() confirms:
start() Returns the start index of the previous match.
The same case for dot-star pattern:
Pattern p = Pattern.compile(".*");
String s = "a1b";
Matcher m = p.matcher(s);
while(m.find())
{
System.out.println(m.start()+" "+m.end());
}
Output:
0 3
3 3
But if specify line boundaries
Pattern p = Pattern.compile("^.*$");
The output will be "right":
0 3
Can someone explain me а reason of such behavior?
The pattern "\\d*" matches 0 or more digits. Same stands for ".*". It matches 0 or more occurrence of any character except newline.
The last match that you get is the empty string at the end of your string, after "b". The empty string satisfies the pattern \\d*. If you change the pattern to \\d+, you'll get expected result.
Similarly, the pattern .* matches everything from first character to last character. Thus it first matches "a1b". After that the cursor is after b: "a1b|". Now, matcher.find() again runs, and finds a zero-length string at the cursor, which satisifies the pattern .*, so it considers it as a match.
The reason why it gives expected output with "^.*$" is that the last empty string doesn't satisfy the ^ anchor. It is not at the beginning of the string, so it fails to match.
i would like to parse a string and get the "stringIAmLookingFor"-part of it, which is surrounded by "\_" at the end and the beginning. I'm using a regex to match that and then remove the "\_" in the found string. This is working, but I'm wondering if there is a more elegant approach to this problem?
String test = "xyz_stringIAmLookingFor_zxy";
Pattern p = Pattern.compile("_(\\w)*_");
Matcher m = p.matcher(test);
while (m.find()) { // find next match
String match = m.group();
match = match.replaceAll("_", "");
System.out.println(match);
}
Solution (partial)
Please also check the next section. Don't just read the solution here.
Just modify your code a bit:
String test = "xyz_stringIAmLookingFor_zxy";
// Make the capturing group capture the text in between (\w*)
// A capturing group is enclosed in (pattern), denoting the part of the
// pattern whose text you want to get separately from the main match.
// Note that there is also non-capturing group (?:pattern), whose text
// you don't need to capture.
Pattern p = Pattern.compile("_(\\w*)_");
Matcher m = p.matcher(test);
while (m.find()) { // find next match
// The text is in the capturing group numbered 1
// The numbering is by counting the number of opening
// parentheses that makes up a capturing group, until
// the group that you are interested in.
String match = m.group(1);
System.out.println(match);
}
Matcher.group(), without any argument will return the text matched by the whole regex pattern. Matcher.group(int group) will return the text matched by capturing group with the specified group number.
If you are using Java 7, you can make use of named capturing group, which makes the code slightly more readable. The string matched by the capturing group can be accessed with Matcher.group(String name).
String test = "xyz_stringIAmLookingFor_zxy";
// (?<name>pattern) is similar to (pattern), just that you attach
// a name to it
// specialText is not a really good name, please use a more meaningful
// name in your actual code
Pattern p = Pattern.compile("_(?<specialText>\\w*)_");
Matcher m = p.matcher(test);
while (m.find()) { // find next match
// Access the text captured by the named capturing group
// using Matcher.group(String name)
String match = m.group("specialText");
System.out.println(match);
}
Problem in pattern
Note that \w also matches _. The pattern you have is ambiguous, and I don't know what your expected output is for the cases where there are more than 2 _ in the string. And do you want to allow underscore _ to be part of the output?
You can define the group you actually want, since you're already using parentheses. You just need to tweak your pattern a bit.
String test = "xyz_stringIAmLookingFor_zxy";
Pattern p = Pattern.compile("_(\\w*)_");
Matcher m = p.matcher(test);
while (m.find()) { // find next match
System.out.println(m.group(1));
}
Use group(1) instead of group() because group() will get you the entire pattern and not the matching group.
Reference : http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#group(int)
"xyz_stringIAmLookingFor_zxy".replaceAll("_(\\w)*_", "$1");
will replace everything by this group in parenthesis
a simpler regex, no group needed:
"(?<=_)[^_]*"
if you want it more strict:
"(?<=_)[^_]+(?=_)"
try
String s = "xyz_stringIAmLookingFor_zxy".replaceAll(".*_(\\w*)_.*", "$1");
System.out.println(s);
output
stringIAmLookingFor
For example I have following regexp: \d{2} (2 digits). And when I using
Matcher matcher = Pattern.compile("\\d{2}").matcher("123");
matcher.find();
String result = matcher.group();
In result variable I get only first entry, i.e. 12. But I want to get ALL possible entries, i.e. 12 and 23.
How to achieve this?
You'll need the help of a capture group within a positive lookahead:
Matcher m = Pattern.compile("(?=(\\d{2}))").matcher("1234");
while (m.find()) System.out.println(m.group(1));
prints
12
23
34
That's not how regular expression matching works. The matcher starts at the beginning of the string, and each time it finds a match it continues looking from the character following the end of that match - it will not give you overlapping matches.
If you want to find overlapping matches of an arbitrary regular expression without needing to use lookaheads and capturing groups you can do this by resetting the matcher's "region" after each match
Matcher matcher = Pattern.compile(theRegex).matcher(str);
// prevent ^ and $ from matching the beginning/end of the region when this is
// smaller than the whole string
matcher.useAnchoringBounds(false);
// allow lookaheads/behinds to look outside the current region
matcher.useTransparentBounds(true);
while(matcher.find()) {
System.out.println(matcher.group());
if(matcher.start() < str.length()) {
// start looking again from the character after the _start_ of the previous
// match, instead of the character following the _end_ of the match
matcher.region(matcher.start() + 1, str.length());
}
}
some thing like this
^(?=[1-3]{2}$)(?!.*(.).*\1).*$
Test and experiment here