Repeat pattern in RegEx - java

I've got an string parts which match to following pattern.
abcd|(|a|ab|abc)e(fghi|(|f|fg|fgh)jklmn)
But problem I have got is, my whole string is repeated combination of above like patterns. And my whole string must contain more than 14 sets of above pattern.
Can anyone one help me to improve my above RegEx to wanted format.
Thanks
Update
Input examples:
Matched string parts : abcd, abefgjkln, efjkln, ejkln
But whole string is : abcdabefgjklnefjklnejkln (Combination of above 4 parts)
There must be more than 15 parts in whole string. Above one have only 4 parts. So, it's wrong.

This will try to match your "parts" at least 15 times in a string.
boolean foundMatch = false;
try {
foundMatch = subjectString.matches("(?:(?:ab(?:cd|efgjkln))|(?:(?:ef?jkln))){15,}");
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
If there are at least 15 repetitions of any of the above parts foundMatch will be true, else it will remain false.
Breakdown :
"(?:" + // Match the regular expression below
"|" + // Match either the regular expression below (attempting the next alternative only if this one fails)
"(?:" + // Match the regular expression below
"ab" + // Match the characters “ab” literally
"(?:" + // Match the regular expression below
// Match either the regular expression below (attempting the next alternative only if this one fails)
"cd" + // Match the characters “cd” literally
"|" + // Or match regular expression number 2 below (the entire group fails if this one fails to match)
"efgjkln" + // Match the characters “efgjkln” literally
")" +
")" +
"|" + // Or match regular expression number 2 below (the entire group fails if this one fails to match)
"(?:" + // Match the regular expression below
"(?:" + // Match the regular expression below
"e" + // Match the character “e” literally
"f" + // Match the character “f” literally
"?" + // Between zero and one times, as many times as possible, giving back as needed (greedy)
"jkln" + // Match the characters “jkln” literally
")" +
")" +
"){15,}" // Between 15 and unlimited times, as many times as possible, giving back as needed (greedy)

What about this:
(?:a(?:b(?:c(?:d)?)?)?ef(?:g(?:h(?:i)?)?)?jklmn){15,}
Explanation: you create a non-capturing group (with (?: ... )), and say that this should be repeated >=15 times, hence the curly braces in the end.

First, it seems that your pattern can be simplified. Really pattern a is a subset of ab that is a subset of abc, so if pattern abc matches it means that a matches too. Think about this and change your pattern appropriately. Right now it probably not what you really want.
Second, to repeat something is puttern use {N}, i.e. abc{5} means "abc repeated five times". You can also use {3,}, {,5}, {3,5} that mean repeat>=3, repeat<=5, 3<=repeat<=5.

Related

Java Regex to match comma delimited list of interfaces

I'm trying to write a Java regex to match a comma delimited list of interfaces. Something like:
Runnable, Serializable, List, Map
There can be zero or more entries in the list. A trailing comma is invalid. Space is optional. I came up with the following, which gets me to one or more entries, and then check for empty:
String validName = "[a-zA-Z_][a-zA-Z0-9_]*";
String regex = validName + "\\s*(,\\s*" + validName + ")*";
if (s.matches(regex) || s.trim().isEmpty())
...
But is there a way to include the "zero entries" condition into the regex?
To make a pattern optional, use a group with a ? quantifier set to it:
String regex = "(?:" + validName + "\\s*(?:,\\s*" + validName + ")*)?";
// ^^^ ^^
if (s.matches(regex) {
// ....
}
The ? greedy quantifier matches one or zero occurrences of the pattern it is applied to. (greedy means it prefers to get 1 occurrence rather than 0)
The (?: character sequence opens a non-capturing group. I.E. this is how you use "normal parentheses" for logically grouping sections of regular expressions.
You may add \\s* subpatterns at the start/end of the pattern to allow leading/trailing whitespace.
Try String regex = "^$|" + regex ^$ means "nothing between the beginning of the input and the end of the input". ^$| means "either match nothing or match whatever matches the rest of the regex"

Java regular expression boundary match?

I found the following question in one Java test suite
Pattern p = Pattern.compile("[wow]*");
Matcher m = p.matcher("wow its cool");
boolean b = false;
while (b = m.find()) {
System.out.print(m.start() + " \"" + m.group() + "\" ");
}
where the output seems to be as follows
0 "wow" 3 "" 4 "" 5 "" 6 "" 7 "" 8 "" 9 "oo" 11 "" 12 ""
Up till the last match it is clear, the pattern [wow]* greedily matches 0 or more 'w' and 'o' characters, while for unmatching characters, including spaces, it results in empty strings. However after matching the last 'l' with 11 "", the following 12 "" seems to be unclear. There is no detailing for this in the test solution, nor I was really able to definitely figure it out from javadoc. My best guess is boundary character, but I would appreciate if someone could provide an explanation
The reason that you see this behavior is that your pattern allows empty matches. In other words, if you pass it an empty string, you would see a single match at position zero:
Pattern p = Pattern.compile("[wow]*"); // One of the two 'w's is redundant, but the engine is OK with it
Matcher m = p.matcher(""); // Passing an empty string results in a valid match that is empty
boolean b = false;
while (b = m.find()) {
System.out.print(m.start() + " \"" + m.group() + "\" ");
}
this would print 0 "" because an empty string is as good a match as any other match for the expression.
Going back to your example, every time the engine discovers a match, including an empty one, it advances past it by a single character. "Advancing by one" means that the engine considers the "tail" of the string at the next position. This includes the time when the regex engine is at position 11, i.e. at the very last character: here, the "tail" consists of an empty string. This is similar to calling "wow its cool".substring(12): you would get an empty string in that case as well.
The engine consider an empty string a valid input, and tries to match it against your expression, as shown in the example above. This produces a match, which your program properly reports.
[wow]* Matches the first wow string. count = 1
Because of the * (zero or more) next to the character class, [wow]* this regex would match an empty string which exists before the character which is not matched by the above pattern. So it matches the boundary or empty space which exists just before to the first space. Count = 2.
its is not matched by the above regex . So it matches the empty string which exists before each character. So count is 2+3=5.
And also the second space is not matched by the above regex. So we get an empty string as match. 5+1=6
c is not matched by the above regex. So it matches the empty space which exists just before to the c 6+1=7
oo is matched by the above regex. [wow]*. So it matches oo and this is considered as 1 match . So we get 7+1=8 as count.
l is not matched. Count = 9
At the last it matches the empty string which exists next to the last character. So now the count is 9+1=10
And finally we all know that the m.start() prints the starting index of the corresponding match.
DEMO
The regex is simply matching the pattern against the input, starting at a given offset.
For the last match, the offset of 12 is at the point after the last character of 'cool' - you might think this is the end of the string and therefore cannot be used for matching purposes - but you'd be wrong. For pattern-matching, this is a perfectly valid starting point.
As you state, your regex expression includes the possibility of zero characters and indeed, this is what happens after the end of the last character, but before the end-of-string marker (usually represented by $ in a regex expression).
To put it another way, without testing past the end of the last character, it would mean no matches would ever occur relating to the end of the string - but there are many regex constructs that match the end of the string (and you've shown one of them here).

Java regexp grouping and + operator (Obtaining multiple values of a group)

I was wondering if is it possible to obtain all the matches of a group with a + operator on a java regular expression.
Example code:
public static void main(String[] args) {
String input = "Start: First match, second match, third match.";
Pattern p = Pattern.compile("Start:\\s*(([\\w\\s]+),?\\s*)+.");
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println("Regular expression Match: "+ m.group(0));
System.out.println("Group 1: "+ m.group(1));
System.out.println("Group 2: "+ m.group(2));
}
}
OUTPUT:
Regular expression Match: Start: First match, second match, third match.
Group 1: third match
Group 2: third match
Despite group 2 matched 3 times "First match, " "second match, " "third match" due to the second "+" operator that is on the Regexp we can access just the last one on match.group(2).
My questions is:
¿There exist a way to access the other hits of the group 2 on that expression or when a + operator causes multiple match on a group only the last one can be accesed?.
thanks.
As mentioned in other answers, you can't match n groups using + like this.
However, if you are looking to solve this problem in Java then using a Scanner to break on the delimiters may help:
String input = "Start: First match, second match, third match.";
Pattern p = Pattern.compile("Start:|\\s*,");
Scanner s = new Scanner(input).useDelimiter(p);
while (s.hasNext()) {
System.out.println("Matched: " + s.next());
}
This prints out:
Matched: First match
Matched: second match
Matched: third match.
You asked:
There exist a way to access the other hits of the group 2 on that expression or when a + operator causes multiple match on a group only the last one can be accesed?.
Answer is NO, if same group matches some text multiple times then you can only access last matched text.
There are of course other ways to return multiple matches.
I think this may not be possible with your regular expression.
As per the docs:
The captured input associated with a group is always the subsequence
that the group most recently matched. If a group is evaluated a second
time because of quantification then its previously-captured value, if
any, will be retained if the second evaluation fails. Matching the
string "aba" against the expression (a(b)?)+, for example, leaves
group two set to "b". All captured input is discarded at the beginning
of each match.
Like most other regex flavors, Java doesn't save the intermediate captures of a repeated group. But that feature isn't really as useful as might think. For example, the .NET flavor provides the CaptureCollection class for that purpose, but you still have to write the code to loop through it. Not that big a deal, but still it's usually easier to use multiple matches, like the other responders suggested. Try it with this regex:
"(?:Start:|\\G,)\\s*([\\w\\s]+)"
\G is a kind of anchor that causes the regex to reject any match that doesn't start exactly where the last match ended. If there was no previous match (i.e., this is the first match attempt), it acts like \A and matches only at the very beginning of the string. That's partly why I placed the , in that part of the regex; I think it's safe to assume the string doesn't start with a comma.
Note that the first group is non-capturing; the part you're looking for will always be in 'group(1)`.

Regex Lookahead and Lookbehinds: followed by this or that

I'm trying to write a regular expression that checks ahead to make sure there is either a white space character OR an opening parentheses after the words I'm searching for.
Also, I want it to look back and make sure it is preceded by either a non-Word (\W) or nothing at all (i.e. it is the beginning of the statement).
So far I have,
"(\\W?)(" + words.toString() + ")(\\s | \\()"
However, this also matches the stuff at either ends - I want this pattern to match ONLY the word itself - not the stuff around it.
I'm using Java flavor Regex.
As you tagged your question yourself, you need lookarounds:
String regex = "(?<=\\W|^)(" + Pattern.quote(words.toString()) + ")(?= |[(])"
(?<=X) means "preceded by X"
(?<!=X) means "not preceded by X"
(?=X) means "followed by X"
(?!=X) means "not followed by X"
What about the word itself: will it always start with a word character (i.e., one that matches \w)? If so, you can use a word boundary for the leading condition.
"\\b" + theWord + "(?=[\\s(])"
Otherwise, you can use a negative lookbehind:
"(?<!\\w)" + theWord + "(?=[\\s(])"
I'm assuming the word is either quoted like so:
String theWord = Pattern.quote(words.toString());
...or doesn't need to be.
If you don't want a group to be captured by the matching, you can use the special construct (?:X)
So, in your case:
"(?:\\W?)(" + words.toString() + ")(?:\\s | \\()"
You will only have two groups then, group(0) for the whole string and group(1) for the word you are looking for.

Java regex replaceAll not working

regex not working as wanted
Code example:
widgetCSS = "#widgetpuffimg{width:100%;position:relative;display:block;background:url(/images/small-banner/Dog-Activity-BXP135285s.jpg) no-repeat 50% 0; height:220px;}
someothertext #widgetpuffimg{width:100%;position:relative;display:block;}"
newWidgetCSS = widgetCSS.replaceAll("#widgetpuffimg\\{(.*?)\\}","");
I want all occurrences in the string that match the pattern "#widgetpuffimg{anycharacters}" to be replaced by nothing
Resulting in newWidgetCSS = someothertext
Update: After edit of question
I think the regex is working properly according to your requirements if you are escaping your { as mentioned below. The exact output I am getting is " someothertext ".
It has to be newWidgetCSS = widgetCSS.replaceAll("#widgetpuffimg\\{(.*?)\\}","");
You need to use \\{ instead of \{ for escaping { properly.
This should work :
String resultString = subjectString.replaceAll("(?s)\\s*#widgetpuffimg\\{.*?\\}\\s*", "");
Explanation :
"\\s" + // Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
"*" + // Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
"#widgetpuffimg" + // Match the characters “#widgetpuffimg” literally
"\\{" + // Match the character “{” literally
"." + // Match any single character
"*?" + // Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
"}" + // Match the character “}” literally
"\\s" + // Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
"*" // Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
As an added bonus it trims the whitespace.

Categories