Reluctant Qualifiers in Pattern Matching - java

I have following programm
public class PatternMatching {
public static void main(String[] args) {
String pattern ="a??";
Pattern pattern1 = Pattern.compile(pattern);
String findAgainst = "a";
Matcher matcher = pattern1.matcher(findAgainst);
int count=0;
while(matcher.find()){
count++;
System.out.println(matcher.group(0)+".start="+ matcher.start()+".end="+matcher.end());
}
System.out.println(count);
}
}
which prints following output
.start=0.end=0
.start=1.end=1
2
instead of
.start=0.end=0
a.start=0.end=1
.start=1.end=1
3
when I run the program with pattern "b??"
the output is
.start=0.end=0
.start=1.end=1
2
which is correct. What would be the reason for incorrect output eventhough it is a reluctant qualifier?

From what I see, the issue is that Java regex engine uses the following algorithm when encountering a zero-length match: it compares the index of the match to the current regex index, and if they coincide, the regex index is incremented.
Thus, when you matched the empty space before a with a?? the regex engine found a zero-length match and incremented the index that appeared after a, thus, skipping a correct match.
If you use a greedy version - a? - the output will be different:
a.start=0.end=1
.start=1.end=1
2
It happens because the first a was consumed, the regex engine index is after a, and can now match the end-of-string.

Related

Find ALL matches of a regex pattern in Java - even overlapping ones [duplicate]

This question already has answers here:
Matcher not finding overlapping words?
(4 answers)
Closed 4 years ago.
I have a String of the form:
1,2,3,4,5,6,7,8,...
I am trying to find all substrings in this string that contain exactly 4 digits. For this I have the regex [0-9],[0-9],[0-9],[0-9]. Unfortunately when I try to match the regex against my String, I never obtain all the substrings, only a part of all the possible substrings. For instance, in the example above I would only get:
1,2,3,4
5,6,7,8
although I expect to get:
1,2,3,4
2,3,4,5
3,4,5,6
...
How would I go about finding all matches corresponding to my regex?
for info, I am using Pattern and Matcher to find the matches:
Pattern pattern = Pattern.compile([0-9],[0-9],[0-9],[0-9]);
Matcher matcher = pattern.matcher(myString);
List<String> matches = new ArrayList<String>();
while (matcher.find())
{
matches.add(matcher.group());
}
By default, successive calls to Matcher.find() start at the end of the previous match.
To find from a specific location pass a start position parameter to find of one character past the start of the previous find.
In your case probably something like:
while (matcher.find(matcher.start()+1))
This works fine:
Pattern p = Pattern.compile("[0-9],[0-9],[0-9],[0-9]");
public void test(String[] args) throws Exception {
String test = "0,1,2,3,4,5,6,7,8,9";
Matcher m = p.matcher(test);
if(m.find()) {
do {
System.out.println(m.group());
} while(m.find(m.start()+1));
}
}
printing
0,1,2,3
1,2,3,4
...
If you are looking for a pure regex based solution then you may use this lookahead based regex for overlapping matches:
(?=((?:[0-9],){3}[0-9]))
Note that your matches are available in captured group #1
RegEx Demo
Code:
final String regex = "(?=((?:[0-9],){3}[0-9]))";
final String string = "0,1,2,3,4,5,6,7,8,9";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
Code Demo
output:
0,1,2,3
1,2,3,4
2,3,4,5
3,4,5,6
4,5,6,7
5,6,7,8
6,7,8,9
Some sample code without regex (since it seems not useful to me). Also I would assume regex to be slower in this case. Yet it will only work as it is as long as the numbers are only 1 character long.
String s = "a,b,c,d,e,f,g,h";
for (int i = 0; i < s.length() - 8; i+=2) {
System.out.println(s.substring(i, i + 7));
}
Ouput for this string:
a,b,c,d
b,c,d,e
c,d,e,f
d,e,f,g
As #OldCurmudgeon pointed out, find() by default start looking from the end of the previous match. To position it right after the first matched element, introduce the first matched region as a capturing group, and use it's end index:
Pattern pattern = Pattern.compile("(\\d,)\\d,\\d,\\d");
Matcher matcher = pattern.matcher("1,2,3,4,5,6,7,8,9");
List<String> matches = new ArrayList<>();
int start = 0;
while (matcher.find(start)) {
start = matcher.end(1);
matches.add(matcher.group());
}
System.out.println(matches);
results in
[1,2,3,4, 2,3,4,5, 3,4,5,6, 4,5,6,7, 5,6,7,8, 6,7,8,9]
This approach would also work if your matching region is longer than one digit

Regex - How to discard match

How can I get a regular expression to discard a part of the match?
public class main {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("(?<=b)([xyz])(?:a*?)c");
String string = "abyaacbxaaac";
Matcher matcher = pattern.matcher(string);
while(matcher.find()){
System.out.println(matcher.group());
}
}
}
The output here is:
yaac
xaaac
I'd like it to output only y and x when I run System.out.println(matcher.group());
I.e. Discarding what is matched by(?:a*?)
P.S.
I know I can use matcher.group(1) to get x and y on its own but I'd like the entire match to output x and y only without having to access specific groups.
You can use lookarounds in your regex to get only the part you need in match:
(?<=b)[xyz](?=a*c)
RegEx Demo
(?=a*c) is a positive lookahead to assert that we have 0 or more a followed by a c ahead. This is a zero width assertion so your match will still be one of [xyz] characters.

Matcher cannot recognize the second group of regular expression in java

I've got a problem when I'm using Matcher for finding a symbol from the group of regular expressions, it cannot recognize the second group .Maybe the code below make it clear :
public void set(String n){
String pat = "(\\d+)[!##$%^&*()_+-=}]";
Pattern r;
r = Pattern.compile(pat);
System.out.println(r);
Matcher m;
m = r.matcher(n);
if (m.find()) {
JOptionPane.showMessageDialog(null,
"Not a correct form", "ERROR_NAME_MATCH", 0);
}else{
name = n;
}
}
After running the code the first group is recognizable but the second one [!##$%^&*()_+-=}] is not.I'm totally sure that the expression is true I've checked it with 'RegexBuddy'. There must be a problem with concatenating two or more groups in one line.
Thank you for your help.
Your regex - (\d+)[!##$%^&*()_+=}-] - matches a sequence of 1+ digits followed with a symbol from the specified set.
You want to test a string and return true if a single character from the specified set is present in the string.
So, just move \d to the character class and certainly move the - to the end of this class:
String pat = "[\\d!##$%^&*()_+=}-]";
^^^
If you need to match a digit or special char, use
String pat = "\\d|[!##$%^&*()_+=}-]";
If you need both irrespective of the order:
String pat = "^(?=\\D*\\d)(?=[^!##$%^&*()_+=}-]*[!##$%^&*()_+=}-])";

Find total number of occurrences of a substring

Suppose I want to find total number of occurrences of following substring.
Any substring that starts with 1 followed by any(0 or more) number of 0's and then followed by 1.
I formed a regular expression for it: 1[0]*1
Then I used the Pattern and Matcher class of java to do the rest of the work.
import java.util.regex.*;
class P_m
{
public static void main(String []args)
{
int s=0;
Pattern p=Pattern.compile("1[0]*1");
Matcher matcher=p.matcher("1000010101");
while(matcher.find())
++s;
System.out.println(s);
}
}
But the problem is when we have two consecutive substrings that overlap, the above code outputs answer 1 less than actual number of occurrences. For example in above code output is 2 whereas it should be 3. Can I modify above code to return the correct output.
Use a positive lookahead:
"10*(?=1)"
This matches the same pattern as you described (starts with 1, followed by zero or more 0, followed by 1), but the difference is that the final 1 is not included in the match. This way, that last 1 is not "consumed" by the match, and it can participate in further matches, effectively allowing the overlap that you asked for.
Pattern p = Pattern.compile("10*(?=1)");
Matcher matcher = p.matcher("1000010101");
int s = 0;
while (matcher.find()) ++s;
System.out.println(s);
Outputs 3 as you wanted.

Regex to replace a repeating string pattern

I need to replace a repeated pattern within a word with each basic construct unit. For example
I have the string "TATATATA" and I want to replace it with "TA". Also I would probably replace more than 2 repetitions to avoid replacing normal words.
I am trying to do it in Java with replaceAll method.
I think you want this (works for any length of the repeated string):
String result = source.replaceAll("(.+)\\1+", "$1")
Or alternatively, to prioritize shorter matches:
String result = source.replaceAll("(.+?)\\1+", "$1")
It matches first a group of letters, and then it again (using back-reference within the match pattern itself). I tried it and it seems to do the trick.
Example
String source = "HEY HEY duuuuuuude what'''s up? Trololololo yeye .0.0.0";
System.out.println(source.replaceAll("(.+?)\\1+", "$1"));
// HEY dude what's up? Trolo ye .0
You had better use a Pattern here than .replaceAll(). For instance:
private static final Pattern PATTERN
= Pattern.compile("\\b([A-Z]{2,}?)\\1+\\b");
//...
final Matcher m = PATTERN.matcher(input);
ret = m.replaceAll("$1");
edit: example:
public static void main(final String... args)
{
System.out.println("TATATA GHRGHRGHRGHR"
.replaceAll("\\b([A-Za-z]{2,}?)\\1+\\b", "$1"));
}
This prints:
TA GHR
Since you asked for a regex solution:
(\\w)(\\w)(\\1\\2){2,};
(\w)(\w): matches every pair of consecutive word characters ((.)(.) will catch every consecutive pair of characters of any type), storing them in capturing groups 1 and 2. (\\1\\2) matches anytime the characters in those groups are repeated again immediately afterward, and {2,} matches when it repeats two or more times ({2,10} would match when it repeats more than one but less than ten times).
String s = "hello TATATATA world";
Pattern p = Pattern.compile("(\\w)(\\w)(\\1\\2){2,}");
Matcher m = p.matcher(s);
while (m.find()) System.out.println(m.group());
//prints "TATATATA"

Categories