Regexp grouping and replaceAll with .* in Java duplicates the replacement - java

I got a problem using Rexexp in Java. The example code writes out ABC_012_suffix_suffix, I was expecting it to output ABC_012_suffix
Pattern rexexp = Pattern.compile("(.*)");
Matcher matcher = rexexp.matcher("ABC_012");
String result = matcher.replaceAll("$1_suffix");
System.out.println(result);
I understand that replaceAll replaces all matched groups, the questions is why is this regexp group (.*) matching twice on my string ABC_012 in Java?

Pattern regexp = Pattern.compile(".*");
Matcher matcher = regexp.matcher("ABC_012");
matcher.matches();
System.out.println(matcher.group(0));
System.out.println(matcher.replaceAll("$0_suffix"));
Same happens here, the output is:
ABC_012
ABC_012_suffix_suffix
The reason is hidden in the replaceAll method: it tries to find all subsequences that match the pattern:
while (matcher.find()) {
System.out.printf("Start: %s, End: %s%n", matcher.start(), matcher.end());
}
This will result in:
Start: 0, End: 7
Start: 7, End: 7
So, to our first surprise, the matcher finds two subsequences, "ABC_012" and another "". And it appends "_suffix" to both of them:
"ABC_012" + "_suffix" + "" + "_suffix"

Probably .* gives you "full match" and then reduces match to the "empty match" (but still a match). Try (.+) or (^.*$) instead. Both work as expected.
At regexinfo star is defined as follows:
*(star) - Repeats the previous item zero or more times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is not matched at all.

If you just want to add "_suffix" to your input why don't you just do:
String result = "ABC_012" + "_suffix";
?

Related

How do i find the first un-escaped quotes in Java?

Example, I have the folowing String:
String str = "te\\\"st\""
and I must find the index of the first un-escaped(without \) ".
In the example the right index is 9.
Is there any regex or any other solution to resolve this problem?
I have the following code
Pattern pattern = Pattern.compile(HERE A REGEX);
Matcher matcher = pattern.matcher(json);
if(matcher.find()) {
System.out.println(matcher.start());
}
but I don't know what kind of regexp to use.
You may try this regex to find first un-escaped quote,
^[^"\\]*?(?:[\\]{2})*(")
Demo,,, in which I intended to capture the first unescaped quote to group 1 (\1 or $1).
And for finding the index of the quote( captured group 1), you will need to retrieve the value,
matcher.start(1)

Java regex to match after start of previous match [duplicate]

How can I extract overlapping matches from an input using String.split()?
For example, if trying to find matches to "aba":
String input = "abababa";
String[] parts = input.split(???);
Expected output:
[aba, aba, aba]
String#split will not give you overlapping matches. Because a particular part of the string, will only be included in a unique index, of the array obtained, and not in two indices.
You should use Pattern and Matcher classes here.
You can use this regex: -
Pattern pattern = Pattern.compile("(?=(aba))");
And use Matcher#find method to get all the overlapping matches, and print group(1) for it.
The above regex matches every empty string, that is followed by aba, then just print the 1st captured group. Now since look-ahead is zero-width assertion, so it will not consume the string that is matched. And hence you will get all the overlapping matches.
String input = "abababa";
String patternToFind = "aba";
Pattern pattern = Pattern.compile("(?=" + patternToFind + ")");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println(patternToFind + " found at index: " + matcher.start());
}
Output: -
aba found at index: 0
aba found at index: 2
aba found at index: 4
I would use indexOf.
for(int i = text.indexOf(find); i >= 0; i = text.indexOf(find, i + 1))
System.out.println(find + " found at " + i);
This is not a correct use of split(). From the javadocs:
Splits this string around matches of the given regular expression.
Seems to me that you are not trying to split the string but to find all matches of your regular expression in the string. For this you would have to use a Matcher, and some extra code that loops on the Matcher to find all matches and then creates the array.

Remove occurrences of a given character sequence at the beginning of a string using Java Regex

I have a string that begins with one or more occurrences of the sequence "Re:". This "Re:" can be of any combinations, for ex. Re<any number of spaces>:, re:, re<any number of spaces>:, RE:, RE<any number of spaces>:, etc.
Sample sequence of string : Re: Re : Re : re : RE: This is a Re: sample string.
I want to define a java regular expression that will identify and strip off all occurrences of Re:, but only the ones at the beginning of the string and not the ones occurring within the string.
So the output should look like This is a Re: sample string.
Here is what I have tried:
String REGEX = "^(Re*\\p{Z}*:?|re*\\p{Z}*:?|\\p{Z}Re*\\p{Z}*:?)";
String INPUT = title;
String REPLACE = "";
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(INPUT);
while(m.find()){
m.appendReplacement(sb,REPLACE);
}
m.appendTail(sb);
I am using p{Z} to match whitespaces(have found this somewhere in this forum, as Java regex does not identify \s).
The problem I am facing with this code is that the search stops at the first match, and escapes the while loop.
Try something like this replace statement:
yourString = yourString.replaceAll("(?i)^(\\s*re\\s*:\\s*)+", "");
Explanation of the regex:
(?i) make it case insensitive
^ anchor to start of string
( start a group (this is the "re:")
\\s* any amount of optional whitespace
re "re"
\\s* optional whitespace
: ":"
\\s* optional whitespace
) end the group (the "re:" string)
+ one or more times
in your regex:
String regex = "^(Re*\\p{Z}*:?|re*\\p{Z}*:?|\\p{Z}Re*\\p{Z}*:?)"
here is what it does:
see it live here
it matches strings like:
\p{Z}Reee\p{Z: or
R\p{Z}}}
which make no sense for what you try to do:
you'd better use a regex like the following:
yourString.replaceAll("(?i)^(\\s*re\\s*:\\s*)+", "");
or to make #Doorknob happy, here's another way to achieve this, using a Matcher:
Pattern p = Pattern.compile("(?i)^(\\s*re\\s*:\\s*)+");
Matcher m = p.matcher(yourString);
if (m.find())
yourString = m.replaceAll("");
(which is as the doc says the exact same thing as yourString.replaceAll())
Look it up here
(I had the same regex as #Doorknob, but thanks to #jlordo for the replaceAll and #Doorknob for thinking about the (?i) case insensitivity part ;-) )

Regular expression for a string starting with some string

I have some string, that has this type: (notice)Any_other_string (notes that : () has in this string`.
So, I want to separate this string to 2 part : (notice) and the rest. I do as follow :
private static final Pattern p1 = Pattern.compile("(^\\(notice\\))([a-z_A-Z1-9])+");
String content = "(notice)Stack Over_Flow 123";
Matcher m = p1.matcher(content);
System.out.println("Printing");
if (m.find()) {
System.out.println(m.group(0));
System.out.println(m.group(1));
}
I hope the result will be (notice) and Stack Over_Flow 123, but instead, the result is : (notice)Stack and (notice)
I cannot explain this result. Which regex is suitable for my purpose?
Issue 1: group(0) will always return the entire match - this is specified in the javadoc - and the actual capturing groups start from index 1. Simply replace it with the following:
System.out.println(m.group(1));
System.out.println(m.group(2));
Issue 2: You do not take spaces and other characters, such as underscores, into account (not even the digit 0). I suggest using the dot, ., for matching unknown characters. Or include \\s (whitespace) and _ into your regex. Either of the following regexes should work:
(^\\(notice\\))(.+)
(^\\(notice\\))([A-Za-z0-9_\\s]+)
Note that you need the + inside the capturing group, or it will only find the last character of the second part.

Java Matcher. Return several entries from one sequence

For example I have following regexp: \d{2} (2 digits). And when I using
Matcher matcher = Pattern.compile("\\d{2}").matcher("123");
matcher.find();
String result = matcher.group();
In result variable I get only first entry, i.e. 12. But I want to get ALL possible entries, i.e. 12 and 23.
How to achieve this?
You'll need the help of a capture group within a positive lookahead:
Matcher m = Pattern.compile("(?=(\\d{2}))").matcher("1234");
while (m.find()) System.out.println(m.group(1));
prints
12
23
34
That's not how regular expression matching works. The matcher starts at the beginning of the string, and each time it finds a match it continues looking from the character following the end of that match - it will not give you overlapping matches.
If you want to find overlapping matches of an arbitrary regular expression without needing to use lookaheads and capturing groups you can do this by resetting the matcher's "region" after each match
Matcher matcher = Pattern.compile(theRegex).matcher(str);
// prevent ^ and $ from matching the beginning/end of the region when this is
// smaller than the whole string
matcher.useAnchoringBounds(false);
// allow lookaheads/behinds to look outside the current region
matcher.useTransparentBounds(true);
while(matcher.find()) {
System.out.println(matcher.group());
if(matcher.start() < str.length()) {
// start looking again from the character after the _start_ of the previous
// match, instead of the character following the _end_ of the match
matcher.region(matcher.start() + 1, str.length());
}
}
some thing like this
^(?=[1-3]{2}$)(?!.*(.).*\1).*$
Test and experiment here

Categories