How to make capturing group optional? - java

Input
example("This is tes't")
example('This is the tes\"t')
Ouput should be
This is tes't
This is the tes"t
Code
String text = "example(\"This is tes't\")";
//String text = "$.i18nMessage('This is the tes\"t\')";
final String quoteRegex = "example.*?(\".*?\")?('.*?')?";
Matcher matcher0 = Pattern.compile(quoteRegex).matcher(text);
while (matcher0.find()) {
System.out.println(matcher0.group(1));
System.out.println(matcher0.group(2));
}
I see output as
null
null
Though when i use regex example.*?(\".*?\") it returns This is tes't and when i use example.*?('.*?') it returns
This is the tes"t but whn i combine both with example.*?(\".*?\")?('.*?')? it returns null . Why ?

The .*?(\".*?\")?('.*?')? subpattern sequence at the end of your regex can match an empty string (all 3 parts are quantified with * / *? that match 0 or more chars). After matcing example, the .*? is skipped at first, and is only expanded once the subsequent subpatterns do not match. However, they both match an empty string before (, thus, you only have example in matcher0.group(0).
Use either an alternation that makes group 1 obligatory (demo):
Pattern.compile("example.*?(\".*?\"|'.*?')"
Or a variant with a tempered greedy token (demo) that allows to get rid of the alternation:
Pattern.compile("example.*?(([\"'])(?:(?!\\2).)*\\2)"
Or, better, support escaped sequences (another demo):
Pattern.compile("example.*?(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\.[^'\\\\]*)*')"
In all 3 examples, you only need to access Group 1. If there can only be ( between example and " or ', you should replace .*? with \( since it will make matching safer. Although, it is never too safe to use a regex to match string literals (at least, with one regex).

Related

Regular expression match fails if only whitespace after the - character

I am working on a regular expression where the pattern is:
1.0.0[ - optional description]/1.0.0.0[ - optional description].txt
The [ - optional description] part is of course, optional. So some possible VALID values are
1.0.0/1.0.0.0.txt
1.0.0/1.0.0.0 - xyz.txt
1.0.0 - abc/1.0.0.0 - xyz.txt
1.0.0 - abc/1.0.0.0.txt
To be a little more robust in the pattern matching, I'd like to match zero or more spaces before and after the "-" character. So all these would be valid too.
1.0.0 - abc/1.0.0.0 - xyz.txt
1.0.0-abc/1.0.0.0-xyz.txt
1.0.0 -abc/1.0.0.0- xyz.txt
To do this matching, I have the following regular expression (Java code):
String part1 = "((\\d+.{1}\\d+.{1}\\d+)(\\s*-\\s*(.+))?)";
String part2 = "((\\d+.{1}\\d+.{1}\\d+.{1}\\d+)(\\s*-\\s*(.+))?\\.sql)";
pattern = Pattern.compile(part1+ "/" + part2);
So far this regular expression is working well. But while unit testing I found a case I can't quite figure out yet. The use case is if the string contains the "-" character is surrounded by 1 or more spaces, but there is no description after the "-" character. This would look like:
1.0.0 - /1.0.0.0.txt
1.0.0- /1.0.0.0-xyz.txt
In these cases, I want the pattern match to FAIL. But with my current regular expression the match succeeds. I think what I want is if there is a "-" character surrounded by any number of spaces like " - " then there must also be at least 1 non-space character following it. But I can't quite figure out the regex for this.
Thanks!
Something like,
^\d+\.\d+\.\d+(?:\s*-\s*\w+)?\/\d+\.\d+\.\d+\.\d+(?:\s*-\s*\w+)?.txt$
Or you can combine the \.\d+ repetitions as
^\d+(?:\.\d+){2}(?:\s*-\s*\w+)?\/\d+(?:\.\d+){3}(?:\s*-\s*\w+)?.txt$
Regex Demo
Changes
.{1} When you want to repeat something once, no need for {}. Its implicit
(?:\s*-\s*\w+) Matches zero or more space (\s*) followed by -, another space and then \w+ a description of length greater than 1
The ? at the end of this patterns makes this optional.
This same pattern is repeated again at the end to match the second part.
^ Anchors the regex at the start of the string.
$ Anchors the regex at the end of the string. These two are necessary so that there is nothing other in the string.
Don't group the patterns using () unless it is necessary to capture them. This can lead to wastage of memory. Use (?:..) If you want to group patterns but not capture them
In the group that matches the optional part, you need to replace .+ with \\S+ where \S means any non-whitespace character. This enforces the optional part to include non-whitespace character in order to match the pattern:
String part1
= "((\\d+\\.\\d+\\.\\d+)(\\s*-\\s*(\\S+))?)";
String part2
= "((\\d+\\.\\d+\\.\\d+.{1}\\d+)(\\s*-\\s*(\\S+))?\\.txt)";
Also note that .{1} (which is the same as just .) matches any character. From the examples, you want to match a dot, so it should be replaced with \.
Something like
^\d+\.\d+\.\d+(?:\s*-\s*[^\/\s]+)?\/\d+\.\d+\.\d+\.\d+?(?:\s*-\s*[^.\s]+)?\.\w+$
Check it out here at regex101.

Why won't this string regex match?

I have a string and a simple pattern (a string with a wildcard). When I use the match function I would it expect it to return true for my text, but it doesn't it returns false.
String text = "test_1_2_3";
String pattern = "test_*"
text.matches(pattern);//this returns false
_* will matches the character _ literally between zero and more times ,instead you need .* that match any character between zero and more times:
"test_.*"
Demo
pattern = "test_*" means "test" and 0 or more "_"
Because your test_* pattern, combined with Matcher#matches, will match a whole input (i.e. from start to end), that matches the following conditions:
starts with test
followed by (and ending with) 0 instance of _, or more (greedy-quantified here).
Using Matcher#find would return true in this case, since it would match a partial test_.
So, your matches invocation would return true with the given Pattern, with inputs such as:
test_
test__
... and so on.
See API.
Your regexp will match test followed by zero or more '_' character.
I think you want this:
String text = "test_1_2_3";
String pattern = "test_.*";

Replace multiple capture groups using regexp with java

I have this requirement - for an input string such as the one shown below
8This8 is &reallly& a #test# of %repl%acing% %mul%tiple 9matched9 9pairs
I would like to strip the matched word boundaries (where the matching pair is 8 or & or % etc) and will result in the following
This is really a test of repl%acing %mul%tiple matched 9pairs
This list of characters that is used for the pairs can vary e.g. 8,9,%,# etc and only the words matching the start and end with each type will be stripped of those characters, with the same character embedded in the word remaining where it is.
Using Java I can do a pattern as \\b8([^\\s]*)8\\b and replacement as $1, to capture and replace all occurrences of 8...8, but how do I do this for all the types of pairs?
I can provide a pattern such as \\b8([^\\s]*)8\\b|\\b9([^\\s]*)9\\b .. and so on that will match all types of matching pairs *8,9,..), but how do I specify a 'variable' replacement group -
e.g. if the match is 9...9, the the replacement should be $2.
I can of course run it through multiple of these, each replacing a specific type of pair, but I am wondering if there is a more elegant way.
Or is there a completely different way of approaching this problem?
Thanks.
You could use the below regex and then replace the matched characters by the characters present inside the group index 2.
(?<!\S)(\S)(\S+)\1(?=\s|$)
OR
(?<!\S)(\S)(\S*)\1(?=\s|$)
Java regex would be,
(?<!\\S)(\\S)(\\S+)\\1(?=\\s|$)
DEMO
String s1 = "8This8 is &reallly& a #test# of %repl%acing% %mul%tiple 9matched9 9pairs";
System.out.println(s1.replaceAll("(?<!\\S)(\\S)(\\S+)\\1(?=\\s|$)", "$2"));
Output:
This is reallly a test of repl%acing %mul%tiple matched 9pairs
Explanation:
(?<!\\S) Negative lookbehind, asserts that the match wouldn't be preceded by a non-space character.
(\\S) Captures the first non-space character and stores it into group index 1.
(\\S+) Captures one or more non-space characters.
\\1 Refers to the character inside first captured group.
(?=\\s|$) And the match must be followed by a space or end of the line anchor.
This makes sure that the first character and last character of the string must be the same. If so, then it replaces the whole match by the characters which are present inside the group index 2.
For this specific case, you could modify the above regex as,
String s1 = "8This8 is &reallly& a #test# of %repl%acing% %mul%tiple 9matched9 9pairs";
System.out.println(s1.replaceAll("(?<!\\S)([89&#%])(\\S+)\\1(?=\\s|$)", "$2"));
DEMO
(?<![a-zA-Z])[8&#%9](?=[a-zA-Z])([^\s]*?)(?<=[a-zA-Z])[8&#%9](?![a-zA-Z])
Try this.Replace with $1 or \1.See demo.
https://regex101.com/r/qB0jV1/15
(?<![a-zA-Z])[^a-zA-Z](?=[a-zA-Z])([^\s]*?)(?<=[a-zA-Z])[^a-zA-Z](?![a-zA-Z])
Use this if you have many delimiters.

what is missing in my java regex?

I want to fetch
http://d1oiazdc2hzjcz.cloudfront.net/promotions/precious/2x/p_608_o_6288_precious_image_1419866866.png
from
url(http://d1oiazdc2hzjcz.cloudfront.net/promotions/precious/2x/p_608_o_6288_precious_image_1419866866.png)
I have tried this code:
String a = "";
Pattern pattern = Pattern.compile("url(.*)");
Matcher matcher = pattern.matcher(imgpath);
if (matcher.find()) {
a = (matcher.group(1));
}
return a;
but a == (http://d1oiazdc2hzjcz.cloudfront.net/promotions/precious/2x/p_639_o_4746_precious_image_1419867529.png)
how can I fine tune it?
Why use a regular expression to begin with?
Given
final String s = "url(http://d1oiazdc2hzjcz.cloudfront.net/promotions/precious/2x/p_608_o_6288_precious_image_1419866866.png)";
If the string is always the same format a simple substring(4,s.length()-1) would be better.
That said, if you insist on a regular expression:
You have to escape the ( with \( so in Java ( you have to escape the \ ) it would be \\( same with the ).
Then you can get the grouping with url\\((.+)\\), test it here!
Learn to use RegEx101.com before coming here, it will point out errors like this immediately.
As you already seem to know ( and )` represents groups which means that in regex
url(.*)
(.*) will place everything after url in group 1, which in case of
url(http://d1oiazdc2hzjcz.cloudfront.net/promotions/precious/2x/p_608_o_6288_precious_image_1419866866.png)
will be
(http://d1oiazdc2hzjcz.cloudfront.net/promotions/precious/2x/p_608_o_6288_precious_image_1419866866.png)
If you want to exclude ( and ) from match you need to add their literals to regex, which means you need to escape them. There are many things to do it, like adding \ before each of them, or surrounding them with [ ].
Other problem with your regex is that .* finds maximal potential match but since . represents any character (except line separators) it can also include ( and ). To solve this problem you can make * quantifier reluctant by adding ? after it so your final regex can be written as string
"url\\((.*?)\\)"
---------------
url
\\( - ( literal
(.*?) - group 1
\\) - ) literal
or you can use instead of . character class which will accept all characters except ) like
"url\\(([^)]*)\\)"
Try this regex:
url\((.*?)\)
The outermost parentheses are escaped so they will be matched literally. The inner parentheses are for capturing a group. The question mark after the .* is to make the match lazy, so the first closing parenthesis found will end the group.
Note that to use this regex in Java, you'll have to additionally escape the backslashes in order to express the above regex as a string literal:
String regex = "url\\((.*?)\\)";
You need to escape the () to match the parenthesis in the string, and then add another set of () around the part you want to pull out in group 1, the actual url. I also changed the part inside the parenthesis to [^)]*, which will match everything until it finds a ). See below:
url\(([^)]*)\)

A regex that doesn't match with this character sequence

Here is my Regex, I am trying to search all special characters so that I can escape them.
(\(|\)|\[|\]|\{|\}|\?|\+|\\|\.|\$|\^|\*|\||\!|\&|\-|\#|\#|\%|\_|\"|\:|\<|\>|\/|\;|\'|\`|\~)
My problem here is, I don't want to escape some sepcial characters only when the come in a sequence
like this (.*)
So, Lets consider an example.
Sting message = "Hi, Mr.Xyz! Your account number is :- (1234567890) , (,*) &$#%#*(....))(((";
After escaping according to current regex what i get is,
Hi, Mr\.Xyz\! Your account number is \:\- \(1234567890\) , \(,\*\) \&\$\#\%\#\*\(\.\.\.\.\)\)\(\(\(
But is don't want to escape this part (.*) want to keep it as it is.
My above regex is only used for searching, So i just don't want to match with this part (.*) and my problem will be solved
Can anyone suggest regex that doesn't escape that part of the string?
See #nhahtdh for how to do this with a regex.
As an alternative, Here is a solution which does not use a regex, using Guava's CharMatcher instead:
private static final CharMatcher SPECIAL
= CharMatcher.anyOf("allspecialcharshere");
private static final String NO_ESCAPE = "(.*)";
public String doEncode(String input)
{
StringBuilder sb = new StringBuilder(input.length());
String tmp = input;
while (!tmp.isEmpty()) {
if (tmp.startsWith(NO_ESCAPE)) {
sb.append(NO_ESCAPE);
tmp = tmp.substring(NO_ESCAPE.length());
continue;
}
char c = tmp.charAt(0);
if (SPECIAL.matches(c))
sb.append('\\');
sb.append(c);
tmp = tmp.substring(1);
}
return sb.toString();
}
This answer is to demonstrate the possibility only. Using it in production code is questionable.
It is possible with Java String replaceAll function:
String input = "Hi, Mr.Xyz! Your account number is :- (1234567890) , (.*) &$#%#*(....))(((";
String output = input.replaceAll("\\G((?:[^()\\[\\]{}?+\\\\.$^*|!&##%_\":<>/;'`~-]|\\Q(.*)\\E)*+)([()\\[\\]{}?+\\\\.$^*|!&##%_\":<>/;'`~-])", "$1\\\\$2");
Result:
"Hi, Mr\.Xyz\! Your account number is \:\- \(1234567890\) , (.*) \&\$\#\%\#\*\(\.\.\.\.\)\)\(\(\("
Another test:
String input = "(.*) sdfHi test message <> >>>>><<<<f<f<,,,,<> <>(.*) sdf (.*) sdf (.*)";
Result:
"(.*) sdfHi test message \<\> \>\>\>\>\>\<\<\<\<f\<f\<,,,,\<\> \<\>(.*) sdf (.*) sdf (.*)"
Explanation
Raw regex:
\G((?:[^()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-]|\Q(.*)\E)*+)([()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-])
Note that \ is escaped once more when the regex is specified inside the string, and " needs to be escaped. The resulting regex in string can be seen above.
Raw replacement string:
$1\\$2
Since $ has special meaning in replacement string, and you want to keep it for $2, you need to escape the \ so that \ won't escape the $. And putting the replacement string in quoted string, you need to double up the number of \ to escape the \.
Before we dissect the monster, let's talk about the idea. We will consume non-special characters, and the sequence that we don't want to replace, and as many times as possible. The next character will either be a special character not forming sequence we don't want to replace, or is the end of the string (which means that we have found all character that needs replacing if any).
Naturally, we can think of any arbitrary string as consisting of many of the following pattern consecutively: [0 or more (non-special character or special pattern not to be replace)][special character], and the string ends with [0 or more (non-special character or special pattern not to be replace)].
replaceAll function when used with a regex without \G may find matches that are not consecutive, which can cut in the middle of the sequence not to be replaced and mess it up. \G means the boundary of last match, and can be used to make sure the next match starts from where the last match left off.
\G: Starts from last match
((?:[^()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-]|\Q(.\*)\E)*+): Capture 0 or more of, the non-special character or the special pattern not to be replaced. Note that I have added the possessive qualifier + after *. This will prevent the engine from backtracking when it cannot find the special character that we specify after this.
[^()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-]: Negated character class of special characters.
\Q(.*)\E: Special sequence (.*) not to be replaced, literal quoted by \Q and \E.
([()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-]): Capture the single special character.
The whole regex will match string with minimum length of 1 (the special character). The first capturing group contains the parts that shouldn't be replaced, and the 2nd capturing group contains the special character that should be replaced.

Categories