Java Regex inconsistent groups - java

Please Refer to following question on SO:
Java: Regex not matching
My Regex groups are not consistent. My code looks like:
public class RegexTest {
public static void main(String[] args) {
// final String VALUES_REGEX = "^\\{([0-9a-zA-Z\\-\\_\\.]+)(?:,\\s*([0-9a-zA-Z\\-\\_\\.]*))*\\}$";
final String VALUES_REGEX = "\\{([\\w.-]+)(?:, *([\\w.-]+))*\\}";
final Pattern REGEX_PATTERN = Pattern.compile(VALUES_REGEX);
final String values = "{df1_apx.fhh.irtrs.d.rrr, ffd1-afp.farr.d.rrr.asgd, ffd2-afp.farr.d.rrr.asgd}";
final Matcher matcher = REGEX_PATTERN.matcher(values);
if (null != values && matcher.matches()) {
// for (int index=1; index<=matcher.groupCount(); ++index) {
// System.out.println(matcher.group(index));
// }
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
}
I tried following combinations:
A) Regex as "^\{([0-9a-zA-Z\-\_\.]+)(?:,\s*([0-9a-zA-Z\-\_\.]))\}$" and use groupCount() to iterate. Result:
df1_apx.fhh.irtrs.d.rrr
ffd2-afp.farr.d.rrr.asgd
B) Regex as ^\{([0-9a-zA-Z\-\_\.]+)(?:,\s*([0-9a-zA-Z\-\_\.]))\}$" and use matcher.find(). Result: No result.
C) Regex as "\{([\w.-]+)(?:, ([\w.-]+))\}" and use groupCount() to iterate. Result:
df1_apx.fhh.irtrs.d.rrr
ffd2-afp.farr.d.rrr.asgd
D) Regex as "\{([\w.-]+)(?:, ([\w.-]+))\}" and use matcher.find(). Result: No results.
I never get consistent groups. Expected result here is:
df1_apx.fhh.irtrs.d.rrr
ffd1-afp.farr.d.rrr.asgd
ffd2-afp.farr.d.rrr.asgd
Please let me know, how can I achieve it.

(?<=[{,])\s*(.*?)(?=,|})
You can simply use this and grab the captures.See demo.
https://regex101.com/r/sJ9gM7/33
When you have (#something)* then only the last group is remembered by the regex engine.You wont get all the groups this way.

The problem is that you are trying to make two things at the same time:
you want to validate the string format
you want to extract each items (with an unknow number of items)
So, it's not possible using the matches method, since when you repeat the same capture group previous captures are overwritten by the last.
One possible way is to use the find method to obtain each items and to use the contiguity anchor \G to check the format. \G ensures that the current match immediatly follows the previous or the start of the string:
(?:\\G(?!\\A),\\s*|\\A\\{)([\\w.-]+)(}\\z)?
pattern details:
(?: # two possible begins:
\\G(?!\\A),\\s* # contiguous to a previous match
# (but not at the start of the string)
| # OR
\\A\\{ # the start of the string
)
([\\w.-]+) # an item in the capture group 1
(}\\z)? # the optional capture group 2 to check
# that the end of the string has been reached
So to check the format of the string from start to end all you need is to test if the capture group 2 exists for the last match.

Related

Regex detect if entire string is a placeholder

I am trying to write a regex which should detect
"Is the entire string a placeholder".
An example of a valid placeholder here is ${var}
An example of an invalid palceholder here is ${var}-sometext as the placeholder is just a part of the text
The regex I have currently is ^\$\{(.+)\}$
This works for normal cases.
for example
1
${var}
Regex Matches
Expected ✅
2
${var} txt
Regex Does Not Match
Expected ✅
even works for nested placeholders
3
${var-${nestedVar}}
Regex Matches
Expected ✅
Where this fails is if the strings begins and ends with a placeholder
for eg
4
${var1}-txt-${var2}
Regex Matches
NOT Expected ❌
Basically even though the entire string is not a placeholder, the regex treats it as one as it begins with ${ and ends with }
I can try solving it by replacing .+ with something like [^$]+ to exclude dollar, but that will break the nested use case in example 3.
How do I solve this?
EDIT
Adding some code for context
public static final Pattern PATTERN = Pattern.compile("^\\$\\{(.+)\\}$");
Matcher matcher = PATTERN.matcher(placeholder);
boolean isMatch = matcher.find();
From your example, I think you need to avoid greedy quantifier:
\$\{(.+?)\}
Notice the ? after + which are reluctant quantifier: https://docs.oracle.com/javase/tutorial/essential/regex/quant.html
That should match ${var1}-txt-${var2}
Now, if you use ^ and $ as well, this will fail.
Note that you could also use StringSubstitutor from commons-text to perform a similar job (it will handle the parsing and you may use a Lookup that capture the variable).
Edit for comment: given that Java regex don't support recursion, you would have to hard code part of recursion here if you wanted to match all your 4 cases:
\$\{([^{}-]+)(?:|-\$\{([^{}-]+)\})\}
The first part match a variable, ignoring {} and -. The other part match either an empty default value, either an interpolation.
If you need to catch ${a-${b-${c}}} you would have to add another layer which you should avoid: doing complex regex for the sake of doing complex regex will simply be a maintenance ache (with only one level of recursion the regexp above is hard to read)
If you need to handle recursion, I think you get no other alternative do it yourself with code as as below:
void parse(String pattern) {
if (pattern.startsWith("${") && pattern.endsWith("}")) {
// remove ${ and }
var content = pattern.substring(2, pattern.length() - 2 - 1);
var n = content.indexOf('-');
String leftVar = content;
if (n != -1) {
leftVar = content.substring(0, n);
// perform recursion
parse(content.substring(n+1));
}
// return whatever you need
}
Or use something that already exists.
static boolean isPlaceHolder(String s) {
return s.matches("\\$\\{[^}]*\\}");
}
or optimized for several uses:
private static final Pattern PLACE_HOLDER_PATTERN =
Pattern.compile("\\$\\{[^}]*\\}");
static boolean isPlaceHolder(String s) {
return PLACE_HOLDER_PATTERN.matcher(s).matches();
}
A matches does a match from begin to end, so no need for: ^...$. As opposed to find.
It still is tricky to detect as false: "${x}, ${y}". It would be best when the placeholder is just for a variable, \\w+.
It is not possible to match arbitrarily deep nested structures using regular expressions. The most you can do with a single regex is match a finite number of nested parts, though your pattern will probably be pretty ugly.
Another approach is to apply a simpler pattern many times, until you have an answer. For example:
Replace everything that matches \$\{[^}]*\} (or \$\{.*?\}) with nothing (the empty string)
Repeat until the pattern no longer matches
If the string is now empty, then the value was "valid".
If the string is not empty, then the value is "invalid".
private static final Pattern PATTERN = Pattern.compile("\\$\\{.*?\\}");
public boolean isValid(String value) {
while (true) {
String newValue = PATTERN.matcher(value).replaceAll("");
if (newValue.equals(value))
break;
value = newValue;
}
return value.isEmpty();
}

price formatting fails to match commas

I have this test:
#Test
public void succeedsWhenFormatWithTwoCommas(){
String input = "#,###,###.##";
PriceFormatValidator priceFormatValidator = new PriceFormatValidator();
boolean answer = priceFormatValidator.validate(input);
assertTrue(answer);
}
and it fails when it runs this code:
public boolean validate(String input) {
Pattern pattern = Pattern.compile("^#{1,3}(,?#{3})?(\\.#{0,3})?$");
Matcher matcher = pattern.matcher(input);
boolean isValid = matcher.matches();
return isValid;
}
why is that
Your ^#{1,3}(,?#{3})?(\\.#{0,3})?$ regex only allows 1 or zero ,### inside because (,?#{3})? matches an optional sequence of one or zero , followed with exactly 3 # symbols.
You need to turn the (,?#{3})? part into (,#{3})* to allow zero or more sequences of , + three # symbols.
Use
"^#{1,3}(,#{3})*(\\.#{0,3})?$"
See the regex demo.
The whole pattern will now match the following:
^ - start of string
#{1,3} - one to three #
(,#{3})* - zero or more ,+3 # symbols sequences
(\\.#{0,3})? - an optional . + 0 to 3 # symbols
$ - end of string.
NOTE: The (\\.#{0,3})? at the end allows a trailing .. If you do not want that, change it to (\\.#{1,3})?.
NOTE 2: If you are not using the captured values (those matched with (...) patterns), it is a good idea to change capturing groups into non-capturing ones (i.e. (...) with (?:...)).
You can replece your pattern by:
Pattern pattern = Pattern.compile("^#{1,3}(,?#{3}){1,2}(\\.#{0,3})?$");

java regex pattern string format

I am exploring Regular expressions.
Problem statement : Replace String between # and # with the values provided in replacements map.
import java.util.regex.*;
import java.util.*;
public class RegExTest {
public static void main(String args[]){
HashMap<String,String> replacements = new HashMap<String,String>();
replacements.put("OldString1","NewString1");
replacements.put("OldString2","NewString2");
replacements.put("OldString3","NewString3");
String source = "#OldString1##OldString2#_ABCDEF_#OldString3#";
Pattern pattern = Pattern.compile("\\#(.+?)\\#");
//Pattern pattern = Pattern.compile("\\#\\#");
Matcher matcher = pattern.matcher(source);
StringBuffer buffer = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(buffer, "");
buffer.append(replacements.get(matcher.group(1)));
}
matcher.appendTail(buffer);
System.out.println("OLD_String:"+source);
System.out.println("NEW_String:"+buffer.toString());
}
}
Output: ( Caters to my requirement but does not know who group(1) command works)
OLD_String:#OldString1##OldString2#_ABCDEF_#OldString3#
NEW_String:NewString1NewString2_ABCDEF_NewString3
If I change the code as below
Pattern pattern = Pattern.compile("\\#(.+?)\\#");
with
Pattern pattern = Pattern.compile("\\#\\#");
I am getting below error:
Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 1
I did not understand difference between
"\\#(.+?)\\#" and `"\\#\\#"`
Can you explain the difference?
The difference is fairly straightforward - \\#(.+?)\\# will match two hashes with one or more chars between them, while \\#\\# will match two hashes next to each other.
A more powerful question, to my mind, is "what is the difference between \\#(.+?)\\# and \\#.+?\\#?"
In this case, what's different is what is (or isn't) getting captured. Brackets in a regex indicate a capture group - basically, some substring you want to output separately from the overall matched string. In this case, you're capturing the text in between the hashes - the first pattern will capture and output it separately, while the second will not. Try it yourself - asking for matcher.group(1) on the first will return that text, while the second will produce an exception, even though they both match the same text.
.+? Tells it to match (one or more of) anything lazily (until it sees a #). So as soon as it parses one instance of something, it stops.
I think the \#\# would match ## so i think the error is because it only matches that one ## and then there's only a group 0, no group 1. But not 100% on that part.

Multiple matches with delimiter

this is my regex:
([+-]*)(\\d+)\\s*([a-zA-Z]+)
group no.1 = sign
group no.2 = multiplier
group no.3 = time unit
The thing is, I would like to match given input but it can be "chained". So my input should be valid if and only if the whole pattern is repeating without anything between those occurrences (except of whitespaces). (Only one match or multiple matches next to each other with possible whitespaces between them).
valid examples:
1day
+1day
-1 day
+1day-1month
+1day +1month
+1day +1month
invalid examples:
###+1day+1month
+1day###+1month
+1day+1month###
###+1day+1month###
###+1day+1month###
I my case I can use matcher.find() method, this would do the trick but it will accept input like this: +1day###+1month which is not valid for me.
Any ideas? This can be solved with multiple IF conditions and multiple checks for start and end indexes but I'm searching for elegant solution.
EDIT
The suggested regex in comments below ^\s*(([+-]*)(\d+)\s*([a-zA-Z]+)\s*)+$ will partially do the trick but if I use it in the code below it returns different result than the result I'm looking for.
The problem is that I cannot use (*my regex*)+ because it will match the whole thing.
The solution could be to match the whole input with ^\s*(([+-]*)(\d+)\s*([a-zA-Z]+)\s*)+$and then use ([+-]*)(\\d+)\\s*([a-zA-Z]+)with matcher.find() and matcher.group(i) to extract each match and his groups. But I was looking for more elegant solution.
This should work for you:
^\s*(([+-]*)(\d+)\s*([a-zA-Z]+)\s*)+$
First, by adding the beginning and ending anchors (^ and $), the pattern will not allow invalid characters to occur anywhere before or after the match.
Next, I included optional whitespace before and after the repeated pattern (\s*).
Finally, the entire pattern is enclosed in a repeater so that it can occur multiple times in a row ((...)+).
On a side, note, I'd also recommend changing [+-]* to [+-]? so that it can only occur once.
Online Demo
You could use ^$ for that, to match the start/end of string
^\s*(?:([+-]?)(\d+)\s*([a-z]+)\s*)+$
https://regex101.com/r/lM7dZ9/2
See the Unit Tests for your examples. Basically, you just need to allow the pattern to repeat and force that nothing besides whitespace occurs in between the matches.
Combined with line start/end matching and you're done.
You can use String.matches or Matcher.matches in Java to match the entire region.
Java Example:
public class RegTest {
public static final Pattern PATTERN = Pattern.compile(
"(\\s*([+-]?)(\\d+)\\s*([a-zA-Z]+)\\s*)+");
#Test
public void testDays() throws Exception {
assertTrue(valid("1 day"));
assertTrue(valid("-1 day"));
assertTrue(valid("+1day-1month"));
assertTrue(valid("+1day -1month"));
assertTrue(valid(" +1day +1month "));
assertFalse(valid("+1day###+1month"));
assertFalse(valid(""));
assertFalse(valid("++1day-1month"));
}
private static boolean valid(String s) {
return PATTERN.matcher(s).matches();
}
}
You can proceed like this:
String p = "\\G\\s*(?:([-+]?)(\\d+)\\s*([a-z]+)|\\z)";
Pattern RegexCompile = Pattern.compile(p, Pattern.CASE_INSENSITIVE);
String s = "+1day 1month";
ArrayList<HashMap<String, String>> results = new ArrayList<HashMap<String, String>>();
Matcher m = RegexCompile.matcher(s);
boolean validFormat = false;
while( m.find() ) {
if (m.group(1) == null) {
// if the capture group 1 (or 2 or 3) is null, it means that the second
// branch of the pattern has succeeded (the \z branch) and that the end
// of the string has been reached.
validFormat = true;
} else {
// otherwise, this is not the end of the string and the match result is
// "temporary" stored in the ArrayList 'results'
HashMap<String, String> result = new HashMap<String, String>();
result.put("sign", m.group(1));
result.put("multiplier", m.group(2));
result.put("time_unit", m.group(3));
results.add(result);
}
}
if (validFormat) {
for (HashMap item : results) {
System.out.println("sign: " + item.get("sign")
+ "\nmultiplier: " + item.get("multiplier")
+ "\ntime_unit: " + item.get("time_unit") + "\n");
}
} else {
results.clear();
System.out.println("Invalid Format");
}
The \G anchor matches the start of the string or the position after the previous match. In this pattern, it ensures that all matches are contigous. If the end of the string is reached, it's a proof that the string is valid from start to end.

java regex how to match some string that is not some substring

For example, my org string is:
CCC=123
CCC=DDDDD
CCC=EE
CCC=123
CCC=FFFF
I want everything that does not equal to "CCC=123" to be changed to "CCC=AAA"
So the result is:
CCC=123
CCC=AAA
CCC=AAA
CCC=123
CCC=AAA
How to do it in regex?
If I want everything that is equal to "CCC=123" to be changed to "CCC=AAA", it is easy to implement:
(AAA[ \t]*=)(123)
You can use a negative lookahead:
public static void main(String[] args)
{
String foo = "CCC=123 CCC=DDD CCC=EEE";
Pattern p = Pattern.compile("(CCC=(?!123).{3})");
Matcher m = p.matcher(foo);
String result = m.replaceAll("CCC=AAA");
System.out.println(result);
}
output:
CCC=123 CCC=AAA CCC=AAA
These are zero-width, non capturing, which is why you have to then add the .{3} to capture the non-matching characters to be replaced.
s = s.replaceAll("(?m)^CCC=(?!123$).*$", "CCC=AAA");
(?m) activates MULTILINE mode, which allows ^ and $ to match the beginning and and end of lines, respectively. The $ in the lookahead makes sure you don't skip something that matches only partially, like CCC=12345. The $ at the very end isn't really necessary, since the .* will consume the rest of the line in any case, but it helps communicate your intent.

Categories