Why does this regex fails to check accurately? - java

I have the following regex method which does the matches in 3 stages for a given string. But for some reason the Regex fails to check some of the things. As per whatever knowledge I have gained by working they seem to be correct. Can someone please correct me what am I doing wrong here?
I have the following code:
public class App {
public static void main(String[] args) {
String identifier = "urn:abc:de:xyz:234567.1890123";
if (identifier.matches("^urn:abc:de:xyz:.*")) {
System.out.println("Match ONE");
if (identifier.matches("^urn:abc:de:xyz:[0-9]{6,12}.[0-9]{1,7}.*")) {
System.out.println("Match TWO");
if (identifier.matches("^urn:abc:de:xyz:[0-9]{6,12}.[a-zA-Z0-9.-_]{1,20}$")) {
System.out.println("Match Three");
}
}
}
}
}
Ideally, this code should generate the output
Match ONE
Match TWO
Match Three
Only when the identifier = "urn:abc:de:xyz:234567.1890123.abd12" but it provides the same output event if the identifier does not match the regex such as for the following inputs:
"urn:abc:de:xyz:234567.1890123"
"urn:abc:de:xyz:234567.1890ANC"
"urn:abc:de:xyz:234567.1890123"
"urn:abc:de:xyz:234567.1890ACB.123"
I am not understanding why is it allowing the Alphanumeric characters after the . and also it does not care about the characters after the second ..
I would like my Regex to check that the string has the following format:
String starts with urn:abc:de:xyz:
Then it has the numbers [0-9] which range from 6 to 12 (234567).
Then it has the decimal point .
Then it has the numbers [0-9] which range from 1 to 7 (1890123)
Then it has the decimal point ..
Finally it has the alphanumeric character and spcial character which range from 1 to 20 (ABC123.-_12).
This is an valid string for my regex: urn:abc:de:xyz:234567.1890123.ABC123.-_12
This is an invalid string for my regex as it misses the elements from point 6:
urn:abc:de:xyz:234567.1890123
This is also an invalid string for my regex as it misses the elements from point 4 (it has ABC instead of decimal numbers).
urn:abc:de:xyz:234567.1890ABC.ABC123.-_12

This part of the regex:
[0-9]{6,12}.[0-9]{1,7} matches 6 to 12 digits followed by any character followed by 1 to 7 digits
To match a dot, it needs to be escaped. Try this:
^urn:abc:de:xyz:[0-9]{6,12}\.[0-9]{1,7}\.[a-zA-Z0-9\-_]{1,20}$

This will match with any number of dot alphanum at the end of the string as your examples:
^urn:abc:de:xyz:\d{6,12}\.\d{1,7}(?:\.[\w-]{1,20})+$
Demo & explanation

Related

Use regex to get 2 specific groups of substring

String s = #Section250342,Main,First/HS/12345/Jack/M,2000 10.00,
#Section250322,Main,First/HS/12345/Aaron/N,2000 17.00,
#Section250399,Main,First/HS/12345/Jimmy/N,2000 12.00,
#Section251234,Main,First/HS/12345/Jack/M,2000 11.00
Wherever there is the word /Jack/M in the3 string, I want to pull the section numbers(250342,251234) and the values(10.00,11.00) associated with it using regex each time.
I tried something like this https://regex101.com/r/4te0Lg/1 but it is still messed.
.Section(\d+(?:\.\d+)?).*/Jack/M
If the only parts of each section that change are the section number, the name of the person and the last value (like in your example) then you can make a pattern very easily by using one of the sections where Jack appears and replacing the numbers you want by capturing groups.
Example:
#Section250342,Main,First/HS/12345/Jack/M,2000 10.00
becomes,
#Section(\d+),Main,First/HS/12345/Jack/M,2000 (\d+.\d{2})
If the section substring keeps the format but the other parts of it may change then just replace the rest like this:
#Section(\d+),\w+,(?:\w+/)*Jack/M,\d+ (\d+.\d{2})
I'm assuming that "Main" is a class, "First/HS/..." is a path and that the last value always has 2 and only 2 decimal places.
\d - A digit: [0-9]
\w - A word character: [a-zA-Z_0-9]
+ - one or more times
* - zero or more times
{2} - exactly 2 times
() - a capturing group
(?:) - a non-capturing group
For reference see: https://docs.oracle.com/en/java/javase/18/docs/api/java.base/java/util/regex/Pattern.html
Simple Java example on how to get the values from the capturing groups using java.util.regex.Pattern and java.util.regex.Matcher
import java.util.regex.*;
public class GetMatch {
public static void main(String[] args) {
String s = "#Section250342,Main,First/HS/12345/Jack/M,2000 10.00,#Section250322,Main,First/HS/12345/Aaron/N,2000 17.00,#Section250399,Main,First/HS/12345/Jimmy/N,2000 12.00,#Section251234,Main,First/HS/12345/Jack/M,2000 11.00";
Pattern p = Pattern.compile("#Section(\\d+),\\w+,(?:\\w+/)*Jack/M,\\d+ (\\d+.\\d{2})");
Matcher m;
String[] tokens = s.split(",(?=#)"); //split the sections into different strings
for(String t : tokens) //checks every string that we got with the split
{
m = p.matcher(t);
if(m.matches()) //if the string matches the pattern then print the capturing groups
System.out.printf("Section: %s, Value: %s\n", m.group(1), m.group(2));
}
}
}
You could use 2 capture groups, and use a tempered greedy token approach to not cross #Section followed by a digit.
#Section(\d+)(?:(?!#Section\d).)*\bJack/M,\d+\h+(\d+(?:\.\d+)?)\b
Explanation
#Section(\d+) Match #Section and capture 1+ digits in group 1
(?:(?!#Section\d).)* Match any character if not directly followed by #Section and a digit
\bJack/M, Match the word Jack and /M,
\d+\h+ Match 1+ digits and 1+ spaces
(\d+(?:\.\d+)?) Capture group 2, match 1+ digits and an optional decimal part
\b A word boundary
Regex demo
In Java:
String regex = "#Section(\\d+)(?:(?!#Section\\d).)*\\bJack/M,\\d+\\h+(\\d+(?:\\.\\d+)?)\\b";

Java regex Matcher.find() confusion

I'm an experienced coder but a regex novice running Oracle's JDK 1.8 on Windows 10.
My code:
private static void regex1() {
Console con = System.console();
String txt;
Pattern pat =
Pattern.compile(con.readLine("Input a regular expression: "));
while (true) {
txt = con.readLine("\nInput a string: ");
if (txt.isEmpty()) {
break;
}
Matcher mch = pat.matcher(txt);
if (mch.find()) {
con.printf("That string matches\n");
for (int grp = 0; grp <= mch.groupCount(); grp++) {
con.printf(" Group %d matched %s\n",
grp, mch.group(grp));
}
}
else {
con.printf("That string does not match\n");
}
}
}
A sample run:
Input a regular expression: ([a-zA-Z]*), ([a-zA-Z]*)
Pattern: '([a-zA-Z]*), ([a-zA-Z]*)'
Input a string: Doe, John
String: 'Doe, John'
That string matches
2 groups
Group 0 matched 'Doe, John'
Group 1 matched 'Doe'
Group 2 matched 'John'
Input a string: Bond, 007
String: 'Bond, 007'
That string matches
2 groups
Group 0 matched 'Bond, '
Group 1 matched 'Bond'
Group 2 matched ''
Input a string: once again, stuff
String: 'once again, stuff'
That string matches
2 groups
Group 0 matched 'again, stuff'
Group 1 matched 'again'
Group 2 matched 'stuff'
Input a string:
The first and third sets seem fine, but the "Bond, 007" response has me stumped.
The expression is a group of one or more alphas followed by a comma and a space followed by another group of one or more alphas.
The find() method seems to be returning true when it stumbles on the "007" and the group that it claims to have matched is a null string.
Am I missing something obvious here or just losing my mind?
TIA
Following documentation of the find() method, we can see that it will:
Attempts to find the next subsequence of the input sequence that matches the pattern.
In the case where you input Bond, 0007, your regex will match:
Capture group 0 (the whole match): Bond,
Capture group 1 (the first part between ()'s (([a-zA-Z]*)): Bond
Capture group 2 (the second part between ()'s (([a-zA-Z]*)): Empty string
I'm suspecting that your confusion either comes from find() not matching the entire input (if you want this, then you should use matches() instead), or you might be confused by * being able to match zero occurrences of the part it applies to (opposed to +, which must match at least once).

Regex to mask multiple phone numbers (~) separated except last 4 digiits

I am trying to find a regex which masks phone numbers except last 4 digits.
example: phone=9988998888~7654321908~6789054321
Desired output : phone=******8888~******1908~*****4321
I tried below regex but it is masking only starting number
phone=******8888~7654321908~6789054321
^(phone)=(\d(?=\d{4}))*
Use replaceAll​(Function<MatchResult,​String> replacer) to replace each digit in MatchResult with "*".
public class PhoneNumberMask {
public static void main(String[] args) {
String target = "phone=9988998888~7654321908~6789054321";
Pattern pattern = Pattern.compile("(\\d+(?=\\d{4}))");
Matcher matcher = pattern.matcher(target);
String result = matcher.replaceAll((matchResult) -> matchResult.group(1).replaceAll("\\d", "*"));
System.out.println(result);
}
}
You could use:
\d(?=\d{4})
See this online demo
\d - Any single digit.
(?=\d{4}) - Positive lookahead for 4 digits.
Replace with *.
See a Java demo
Assuming you only want to mask all numbers in a string that starts with phone= separated with ~, you can use a plain regex solution without a lambda in the replacement with
String masked = text.replaceAll("(\\G(?!^)(?:\\d{4}~)?|^phone=)\\d(?=\\d{4})", "$1*");
See the regex demo. Details:
(\G(?!^)(?:\d{4}~)?|^phone=) - Group 1: end of the previous successful match and then an optional sequence of four digits and a ~ or start of string and phone=
\d - a digit
(?=\d{4}) - followed with any four digits.

REGEX extract two double number separated from hypen

I have strings like:
some foo text
some foo
1-2
1.00-2.00
3.21-1.23
2.12-2.12
I have to check if the string format contains two numbers separated by hyphen.
How can I do it?
Thanks
Regex for float is: ^[1-9]\d*\.\d+$ if decimals are optional : ^[1-9]\d*(?:\.\d+)?$
Repeat it twice with hyphen in between:
`^[1-9]\d*(?:\.\d+)?-[1-9]\d*(?:\.\d+)?$`
You can use the regex:
^\d+(\.\d+)?-\d+(\.\d+)?$
Explanation can be found here.
Using java you can create a method that checks whether your desired pattern exists or not:
public static boolean returnMatch(String input) {
Pattern p1 = Pattern.compile("^\\d+(\\.\\d+)?-\\d+(\\.\\d+)?$");
Matcher m1 = p1.matcher(input);
return m1.find() ? true : false;
}
Now call it using:
System.out.println(returnMatch("some foo text")); // false
System.out.println(returnMatch("1.00-2.00")); // true
System.out.println(returnMatch("2.12-2.12")); // true
System.out.println(returnMatch("10-20")); // true
Use a simple Regex:
(\d+(?:\.\d+)?)-(\d+(?:\.\d+)?)
This solution assumes there is always a decimal part present (at least one digit). Demo at Regex101.
\d is a digit
\d+ is at least one digit
\. matches a dot (.) literally
() is a capturing group
(?:\.\d+)? is a non-capturing group which optionally matches the decimal part
Don't forget the proper escaping in Java String regex = "(\\d+(?:\\.\\d+)?)-(\\d+(?:\\.\\d+)?)";
In case one or more spaced or blank characters appear between the dash and numbers, use:
(\d+(?:\.\d+)?)\s*-\s*(\d+(?:\.\d+)?)

How regex lookaround works when used alone

public class Test {
public static void main(String[] args){
Pattern a = Pattern.compile("(?=\\.)|(?<=\\.)");
Matcher b = a.matcher(".");
while (b.find()) System.out.print("+");
}
}
I've been reading the lookaround section on Regular-Expressions.info and trying to figure out how it works, and I'm stuck with this thing. when I run the code above the result is ++, which I don't understand, because since "." is the only token to match the pattern against, and apparently there's nothing behind or ahead of the "." so how can it match twice?
As the regex engine advances through the input, it considers both characters and positions before and after characters as distinct positions within the input.
Your input has 3 positions:
Just before the first character
The first character
Just after the first character
Position 1 matches (?=\\.).
Position 3 matches (?<=\\.).

Categories