A bug in a regex in JDK 8? - java

I have this reference working Perl script with a regex, copied from a Java snippet that isn't giving the expected results:
my $regex = '^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})(?:-([A-Z0-9]{4}))*-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$';
if ("A-PROD-COMP-LOGL-00000000-0000-8033-0000-000200354F0A" =~ /$regex/)
{
print "Matches 1=$1 2=$2 3=$3 4=$4\n";
}
This correctly outputs:
Matches 1=PROD 2=COMP 3=LOGL 4=00000000-0000-8033-0000-000200354F0A
Now the equivalent Java snippet:
private static final String NON_SYSTEM_TYPE_REGEX = "^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})(?:-([A-Z0-9]{4}))*-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$";
private static final Pattern NON_SYSTEM_TYPE_PATTERN = Pattern.compile(MutableUniqueIdentity.NON_SYSTEM_TYPE_REGEX);
...
final Matcher match = MutableUniqueIdentity.NON_SYSTEM_TYPE_PATTERN.matcher(uniqueIdentity);
The uniqueIdentity input is further back in the stack trace (in a unit test) and is this value:
final String id5CompactString = "A-PROD-COMP-LOGL-00000000-0000-8033-0000-000200354F0A";
NOTE: The regex and uniqueIdentity values were copied to the Perl program from a debug session to assert if a different language comes up with a different result (which it did).
ADDITIONAL NOTE: The reason the non-capture group is there is to allow the third element in the string to be optional, so it has to deal with both of these:
A-PROD-COMP-LOGL-00000000-0000-8033-0000-000200354F0A
A-PROD-COMP-00000000-0000-8033-0000-000200354F0A
My unit test fails in Java - the third match group, which should be LOGL, is in fact 0000.
Here is a screenshot of the debugger right after the regex match line above:
You can see that the pattern matches, you can verify that the input parameter (text) and regex are the same as the Perl script, but the result is different!
So my question is: Why does match.groups(3) have a value of 0000 (when it should have a value LOGL) and how does that related back to the regex and the string it is applied to?
In Perl it yields the correct result - LOGL.
Additional info: I have perused this page that highlights the differences between Perl and Java regex engines, and there doesn't appear to be anything applicable.

Replace your regex with the following regex:
^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})-(?:([A-Z0-9]{4}))*-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$
This has been moved out----------^
I have moved - out of the non-capturing group.
Demo:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
final String NON_SYSTEM_TYPE_REGEX = "^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})-(?:([A-Z0-9]{4}))*-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$";
final Pattern NON_SYSTEM_TYPE_PATTERN = Pattern.compile(NON_SYSTEM_TYPE_REGEX);
String uniqueIdentity = "A-PROD-COMP-LOGL-00000000-0000-8033-0000-000200354F0A";
final Matcher match = NON_SYSTEM_TYPE_PATTERN.matcher(uniqueIdentity);
if (match.find()) {
System.out.printf("Matches 1=%s 2=%s 3=%s 4=%s%n", match.group(1), match.group(2), match.group(3),
match.group(4));
}
}
}
Output:
Matches 1=PROD 2=COMP 3=LOGL 4=00000000-0000-8033-0000-000200354F0A
Check the demo at regex101 as well.

Ok I've made it work, but I don't understand why.
The regex needs to be made non-greedy, so instead of:
^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})(?:-([A-Z0-9]{4}))*-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$
it needs to be:
^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})(?:-([A-Z0-9]{4}))*?-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$
(with the extra ? after the * of the non-capture group)

Related

java regex pattern string format

I am exploring Regular expressions.
Problem statement : Replace String between # and # with the values provided in replacements map.
import java.util.regex.*;
import java.util.*;
public class RegExTest {
public static void main(String args[]){
HashMap<String,String> replacements = new HashMap<String,String>();
replacements.put("OldString1","NewString1");
replacements.put("OldString2","NewString2");
replacements.put("OldString3","NewString3");
String source = "#OldString1##OldString2#_ABCDEF_#OldString3#";
Pattern pattern = Pattern.compile("\\#(.+?)\\#");
//Pattern pattern = Pattern.compile("\\#\\#");
Matcher matcher = pattern.matcher(source);
StringBuffer buffer = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(buffer, "");
buffer.append(replacements.get(matcher.group(1)));
}
matcher.appendTail(buffer);
System.out.println("OLD_String:"+source);
System.out.println("NEW_String:"+buffer.toString());
}
}
Output: ( Caters to my requirement but does not know who group(1) command works)
OLD_String:#OldString1##OldString2#_ABCDEF_#OldString3#
NEW_String:NewString1NewString2_ABCDEF_NewString3
If I change the code as below
Pattern pattern = Pattern.compile("\\#(.+?)\\#");
with
Pattern pattern = Pattern.compile("\\#\\#");
I am getting below error:
Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 1
I did not understand difference between
"\\#(.+?)\\#" and `"\\#\\#"`
Can you explain the difference?
The difference is fairly straightforward - \\#(.+?)\\# will match two hashes with one or more chars between them, while \\#\\# will match two hashes next to each other.
A more powerful question, to my mind, is "what is the difference between \\#(.+?)\\# and \\#.+?\\#?"
In this case, what's different is what is (or isn't) getting captured. Brackets in a regex indicate a capture group - basically, some substring you want to output separately from the overall matched string. In this case, you're capturing the text in between the hashes - the first pattern will capture and output it separately, while the second will not. Try it yourself - asking for matcher.group(1) on the first will return that text, while the second will produce an exception, even though they both match the same text.
.+? Tells it to match (one or more of) anything lazily (until it sees a #). So as soon as it parses one instance of something, it stops.
I think the \#\# would match ## so i think the error is because it only matches that one ## and then there's only a group 0, no group 1. But not 100% on that part.

Regex to get the string after # sign

I have a string like follows:
#78517700-1f01-11e3-a6b7-3c970e02b4ec, #68517700-1f01-11e3-a6b7-3c970e02b4ec, #98517700-1f01-11e3-a6b7-3c970e02b4ec, #38517700-1f01-11e3-a6b7-3c970e02b4ec ....
I want to extract the string after #.
I have the current code like follows:
private final static Pattern PATTERN_LOGIN = Pattern.compile("#[^\\s]+");
Matcher m = PATTERN_LOGIN.matcher("#78517700-1f01-11e3-a6b7-3c970e02b4ec , #68517700-1f01-11e3-a6b7-3c970e02b4ec, #98517700-1f01-11e3-a6b7-3c970e02b4ec, #38517700-1f01-11e3-a6b7-3c970e02b4ec");
while (m.find()) {
String mentionedLogin = m.group();
.......
}
... but m.group() gives me #78517700-1f01-11e3-a6b7-3c970e02b4ec but I wanted 78517700-1f01-11e3-a6b7-3c970e02b4ec
You should use the regex "#([^\\s]+)" and then m.group(1), which returns you what "captured" by the capturing parentheses ().
m.group() or m.group(0) return you the full matching string found by your regex.
I would modify the pattern to omit the at sign:
private final static Pattern PATTERN_LOGIN = Pattern.compile("#([^\\s]+)");
So the first group will be the GUID only
Correct answers are mentioned in other responses. I will add some clarification. Your code is working correctly, as expected.
Your regex means: match string which starts with # and after that follows one or more characters which isn't white space. So if you omit the parentheses you get you full string as expected.
The parentheses as mentioned in other responses are used for marking capturing groups. In layman terms - the regex engine does the matching multiple times for each parenthesis enclosed group, working it's way inside the nested structure.

Find a number of a given number of digits between given separators

What regex/pattern can I use to find the following pattern in a string?
#nnnn:
nnnn can be any 4-digit long number as long as it is sorrounded by a hashtag and a colon.
I have tried the code below:
String string = "#8226:";
if(string.matches( ".*\\d:.*" )) {
System.out.println( "Yes" );
}
It DOES work, but it matches other strings like below:
"This is a string 1234: Hahaha!" // Outputs "Yes"
"Hello 1834: World!!!" // Outputs "Yes"
I want it to only match the pattern at the top of the question.
Can anybody tell me where did I go wrong?
It can be done with Regular Expression
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class FindPattern {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("#[0-9]{4}:");
String text = "#1233:#3433:abc#3993: #a343:___#8888:ki";
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
output is:
#1233:
#3433:
#3993:
#8888:
You have already a pattern: #nnnn:. The only problem is that this is not a java compatible regular expression. Let's convert.
# and : are valid character literals, so let these untouched.
As you probably know (according to your solution), a number is denoted with the \d sequence (note, there are some alternatives, e. g. [0-9], \p{Digit}). Just replace all ns with \d:
#\d\d\d\d:
There are four equal subpatterns here, so we can shorten it with a fixed quantifier:
#\d{4}:
You can now write string.matches("#\\d{4}:"). Note that this is slow because compiles the given regex pattern every time. If this code is called frequently, I would consider using a precompiled Pattern like:
Pattern HASH_NUMBER_COLON_PATTERN = Pattern.compile("#\\d{4}:");
// ...
if (HASH_NUMBER_COLON_PATTERN.matcher(yourString).matches()) {
// ...
}
Even better to use some regular expression builder library, such as regex-builder, JavaVerbalExpressions or RegexBee. These tools can make your intention very clear. RegexBee example:
Pattern HASH_NUMBER_COLON_PATTERN = Bee
.then(Bee.fixedChar('#'))
.then(Bee.intBetween(1000, 9999))
.then(Bee.fixedChar(':'))
.toPattern()

java regex how to match some string that is not some substring

For example, my org string is:
CCC=123
CCC=DDDDD
CCC=EE
CCC=123
CCC=FFFF
I want everything that does not equal to "CCC=123" to be changed to "CCC=AAA"
So the result is:
CCC=123
CCC=AAA
CCC=AAA
CCC=123
CCC=AAA
How to do it in regex?
If I want everything that is equal to "CCC=123" to be changed to "CCC=AAA", it is easy to implement:
(AAA[ \t]*=)(123)
You can use a negative lookahead:
public static void main(String[] args)
{
String foo = "CCC=123 CCC=DDD CCC=EEE";
Pattern p = Pattern.compile("(CCC=(?!123).{3})");
Matcher m = p.matcher(foo);
String result = m.replaceAll("CCC=AAA");
System.out.println(result);
}
output:
CCC=123 CCC=AAA CCC=AAA
These are zero-width, non capturing, which is why you have to then add the .{3} to capture the non-matching characters to be replaced.
s = s.replaceAll("(?m)^CCC=(?!123$).*$", "CCC=AAA");
(?m) activates MULTILINE mode, which allows ^ and $ to match the beginning and and end of lines, respectively. The $ in the lookahead makes sure you don't skip something that matches only partially, like CCC=12345. The $ at the very end isn't really necessary, since the .* will consume the rest of the line in any case, but it helps communicate your intent.

Find ASCII "arrows" in text

I'm trying to find all the occurrences of "Arrows" in text, so in
"<----=====><==->>"
the arrows are:
"<----", "=====>", "<==", "->", ">"
This works:
String[] patterns = {"<=*", "<-*", "=*>", "-*>"};
for (String p : patterns) {
Matcher A = Pattern.compile(p).matcher(s);
while (A.find()) {
System.out.println(A.group());
}
}
but this doesn't:
String p = "<=*|<-*|=*>|-*>";
Matcher A = Pattern.compile(p).matcher(s);
while (A.find()) {
System.out.println(A.group());
}
No idea why. It often reports "<" instead of "<====" or similar.
What is wrong?
Solution
The following program compiles to one possible solution to the question:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class A {
public static void main( String args[] ) {
String p = "<=+|<-+|=+>|-+>|<|>";
Matcher m = Pattern.compile(p).matcher(args[0]);
while (m.find()) {
System.out.println(m.group());
}
}
}
Run #1:
$ java A "<----=====><<---<==->>==>"
<----
=====>
<
<---
<==
->
>
==>
Run #2:
$ java A "<----=====><=><---<==->>==>"
<----
=====>
<=
>
<---
<==
->
>
==>
Explanation
An asterisk will match zero or more of the preceding characters. A plus (+) will match one or more of the preceding characters. Thus <-* matches < whereas <-+ matches <- and any extended version (such as <--------).
When you match "<=*|<-*|=*>|-*>" against the string "<---", it matches the first part of the pattern, "<=*", because * includes zero or more. Java matching is greedy, but it isn't smart enough to know that there is another possible longer match, it just found the first item that matches.
Your first solution will match everything that you are looking for because you send each pattern into matcher one at a time and they are then given the opportunity to work on the target string individually.
Your second attempt will not work in the same manner because you are putting in single pattern with multiple expressions OR'ed together, and there are precedence rules for the OR'd string, where the leftmost token will be attempted first. If there is a match, no matter how minimal, the get() will return that match and continue on from there.
See Thangalin's response for a solution that will make the second work like the first.
for <======= you need <=+ as the regex. <=* will match zero or more ='s which means it will always match the zero case hence <. The same for the other cases you have. You should read up a bit on regexs. This book is FANTASTIC:
Mastering Regular Expressions
Your provided regex pattern String does work for your example: "<----=====><==->>"
String p = "<=*|<-*|=*>|-*>";
Matcher A = Pattern.compile(p).matcher(s);
while (A.find()) {
System.out.println(A.group());
}
However it is broken for some other examples pointed out in the answers such as input string "<-" yields "<", yet strangely "<=" yields "<=" as it should.

Categories