Regular expressions: some groups missing - java

I have following Java code:
String s2 = "SUM 12 32 42";
Pattern pat1 = Pattern.compile("(PROD)|(SUM)(\\s+(\\d+))+");
Matcher m = pat1.matcher(s2);
System.out.println(m.matches());
System.out.println(m.groupCount());
for (int i = 1; i <= m.groupCount(); ++i) {
System.out.println(m.group(i));
}
which produces:
true
4
null
SUM
42
42
I wonder what's a null and why 12 and 32 are missing (I expected to find them amongst groups).

A repeated group will contain the match of the last substring matching the expression for the group.
It would be nice if the regexp engine would give back all substrings that matched a group. Unfortunately this is not supported:
Regular expression with variable number of groups?
Furthermore groups are a static and numbered like this:
0
_______________________
/ \
(PROD)|(SUM)(\\s+(\\d+))+
\____/ \___/| \____/|
1 2 | 4 |
\________/
3

Group X from this part of your regex:
(\\s+(\\d+))+
| |
+----------+--> X
will first match 12, then 32 and finally 42. Each time X's value gets changed, and replaces the previous one. If you want all values, you'll need a Pattern & Matcher.find() approach:
String s = "SUM 12 32 42 PROD 1 2";
Matcher m = Pattern.compile("(PROD|SUM)((\\s+\\d+)+)").matcher(s);
while(m.find()) {
System.out.println("Matched : " + m.group(1));
Matcher values = Pattern.compile("\\d+").matcher(m.group(2));
while(values.find()) {
System.out.println(" : " + values.group());
}
}
which will print:
Matched : SUM
: 12
: 32
: 42
Matched : PROD
: 1
: 2
And you see a null printed because in group 1, there's PROD, which you didn't match.

I wonder what's a null
Capturing groups are indexed from left to right, starting at one. Group zero denotes the entire pattern, so the expression m.group(0) is equivalent to m.group().
http://download.oracle.com/javase/1.5.0/docs/api/java/util/regex/Matcher.html#group%28int%29
the string given does not matches the entire pattern.

Related

Regex not matching all numbers with delimiters

Need a single combined regex for the following pattern:
Prefix: 2221-2720 , Length: 16
Prefix: 51-55 , Length: 16
where the delimiters b/w digits can be either space ( ), minus sign (-), period (.), backslash (\), equals (=). The condition being that more than one delimiter (same or different type) can't occur more than once b/w any two digits.
Valid number - 230.293.217.952.148.4
Valid number - 230.293 217-952.148.4
Invalid number - 230..293.217.952.148.4
Invalid number - 230.293.-217. 952.148.4
A valid input is one where you have 16 digits separated by any/no delimiters as long as there are no two delimiters adjacent to each other.
Have come up with the following regex:
(2[\s=\\.-]*2[\s=\\.-]*2[\s=\\.-]*[1-9][\s=\\.-]*|2[\s=\\.-]*2[\s=\\.-]*[3-9][\s=\\.-]*[0-9][\s=\\.-]*|2[\s=\\.-]*[3-6][\s=\\.-]*[0-9](?:[\s=\\.-]*[0-9]){1}|2[\s=\\.-]*7[\s=\\.-]*[01][\s=\\.-]*[0-9][\s=\\.-]*|2[\s=\\.-]*7[\s=\\.-]*2[\s=\\.-]*0[\s=\\.-]*)[0-9](?:[\s=\\.-]*[0-9]){11}|(5[\s=\\.-]*[1-5][\s=\\.-]*)[0-9](?:[\s=\\.-]*[0-9]){13}
It does not match certain patterns. For example:
2 3 0 2 9 3 2 1 7 9 5 2 1 4 8 4
23-02-93-21-79-52-14-84
2 3 0 3 4 5 8 0 9 4 9 3 0 8 2 3
For the same numbers, it matches (as expected) the following patterns:
2302932179521484
230.293.217.952.148.4
2303458094930823
230.345.809.493.082.3
230-345-809-493-082-3
There seems to be an issue with delimiters. Kindly let me know what is wrong with my regex.
For this rule
A valid input is one where you have 16 digits separated by any/no
delimiters as long as there are no two delimiters adjacent to each
other
Prefix: 2221-2720 , Length: 16
Prefix: 51-55 , Length: 16
2221 can also be written as 2.2.-2.1
For these rules, it might be easier to write a pattern with 2 capture groups to match the whole string.
Then using some Java code, you can check the value of the capture groups for the ranges.
^((\d[ =\\.-]?\d)[ =\\.-]?\d[ =\\.-]?\d)(?:[ =\\.-]?\d){12}$
The pattern matches:
^ Start of string
( Capture group 1
(\d[ =\\.-]?\d) Capture group 2 Match 2 digits with an optional char = \ . -
[ =\\.-]?\d[ =\\.-]?\d Match 2 times optionally 1 of the listed chars and a single digit
) close group 1
(?:[ =\\.-]?\d){12} Repeat 12 times matching one of the characters and a single digit
$ End of string
Regex demo | Java demo
For example
String strings[] = {
"2221.7.952.148.412.32",
"230.293.217.952.148.4",
"5511111111111111",
"130.293 217-952.148.4",
"30..293.217.952.148.4",
"5..5",
".5.5."
};
String regex = "^((\\d[ =\\\\.-]?\\d)[ =\\\\.-]?\\d[ =\\\\.-]?\\d)(?:[ =\\\\.-]?\\d){12}$";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
for (String s : strings) {
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
int grp1 = Integer.parseInt(matcher.group(1).replaceAll("\\D+", ""));
int grp2 = Integer.parseInt(matcher.group(2).replaceAll("\\D+", ""));
if ((grp1 >= 2221 && grp1 <= 2720) || (grp2 >=51 && grp2 <= 55)) {
System.out.println("Match for " + matcher.group());
}
}
}
Output
Match for 2221.7.952.148.412.32
Match for 230.293.217.952.148.4
Match for 5511111111111111

Split string in 3 blocks (Text - Number - Other characters) if possible

I need to divide a string into 3 blocks at most
The first block is only for letters
The second only numbers
The third the remainder
Examples:
Karbwqeaf 11D
Jablunkovska 21/2
Tastoor Nstraat 43
Schzelkjedow
Heajsd 3/5/7 m 344
Lasdasdt seavees 3., 729. tasdasd F 2.
ul. Pasydufasdfa 73k/120
I need to split like this:
Block1: Karbwqeaf
Block2: 11
Block3: D
Block1: Jablunkovska
Block2: 21
Block3: /2
Block1: Tastoor Nstraat
Block2: 43
Block3:
Block1: Schzelkjedow
Block2:
Block3:
Block1: Heajsd
Block2: 3
Block3: /5/7 m 344
Block1: Lasdasdt seavees 3
Block2: 3
Block3: ., 729. tasdasd F 2.
Block1: ul. Pasydufasdfa
Block2: 73
Block3: k/120
Below my code, but I don't know how to do it so that all my requirements are met. Any idea?
List<String> AllAddress = Arrays.asList("Karbwqeaf 11D", "Jablunkovska 21/2", "Tastoor Nstraat 43", "Schzelkjedow", "Heajsd 3/5/7 m 344", "Lasdasdt seavees 3., 729. tasdasd F 2.", "ul. Pasydufasdfa 73k/120");
for (String Address : AllAddress) {
String block1 = "";
String block2 = "";
String block3 = "";
Pattern pattern = Pattern.compile("(.+)\\s(\\d)(.*)");
Matcher matcher = pattern.matcher(Address);
if(matcher.matches()) {
block1 = matcher.group(1);
block2 = matcher.group(2);
block3 = matcher.group(3);
System.out.println("block1 = " + block1);
System.out.println("block2 = " + block2);
System.out.println("block3 = " + block3);
}
}
You can use 3 capturing groups, where the second group matches 1 or more digits and the 3rd group matches any character 0+ times.
^([^\d\r\n]+)(?:\h+(\d+)(.*))?$
Explanation
^ Start of string
( Capture group 1
[^\d\r\n]+ Match any char except a newline or digit
) Close group 1
(?: Non capture group
\h+ Match 1+ horizontal whitespace chars
(\d+)(.*) Capture 1 or more digits in group 2 and capture 0 or more times any character in group 3
)? Close the non capture group and make it optional
$ End of string
Regex demo

Java regex: how to select words starting with a specific letter and is x number of characters long?

This is the code I wrote that selects all names starting from A:
String longString = "Amal Kamal Jamal Amitha Farook Amani Tom Adele George Ariana";
String pattern = "(?i)(\\s|^)[a][A-Za-z]+(\\s|$)";
Pattern checkRegex = Pattern.compile(pattern);
Matcher regexMatcher = checkRegex.matcher(longString);
while (regexMatcher.find()) {
System.out.println(regexMatcher.start() + " : " + regexMatcher.group());
}
Output is as expected
0 : Amal
16 : Amitha
30 : Amani
40 : Adele
53 : Ariana
Now I want to select names that are at least 5 characters long. So the expected output is: Amitha, Adele, Ariana.
When I type this only Ariana is returned. And I can't understand why.
String pattern = "(?i)(\\s|^)[a][A-Za-z]+(\\s|$){5,}";
Output
53 : Ariana
If I put a bracket around the whole expression (to say that this expression should be 5 characters long) Then output is nothing
String pattern = "(?i)((\\s|^)[a][A-Za-z]+(\\s|$)){5,}";
What is the correct way of writing this?
You quantified (\\s|$) while you need to quantify [a-zA-Z]. So, you only match texts that have 5 or more whitespaces or 5 or more ends of string (makes no sense of course) after the words. Also, you need to use {4,} as [a] already matches 1 letter.
Use this regex to fix the issue (although it is not the best one, see below why):
(?i)(\s|^)a[a-z]{4,}(\s|$)
Details
(?i) - case insensitive modifier
(\s|^) - either a whitespace or a start of a string
a - an a or A letter
[a-z]{4,} - any 4 or more ASCII letters
(\s|$) - either a whitespace or an end of a string (note: the whitespace will be consumed, and consecutive matching words will not be handled properly).
You may use "(?i)(?<!\\S)a[a-z]{4,}(?!\\S)" pattern to make sure you are matching a word in between whitespaces or start/end of string positions.
Or, use word boundaries - "(?i)\\ba[a-z]{4,}\\b".
See the Java online demo:
String longString = "Amal Kamal Jamal Amitha Farook Amani Tom Adele George Ariana";
String pattern = "(?i)(?<!\\S)a[a-z]{4,}(?!\\S)";
Pattern checkRegex = Pattern.compile(pattern);
Matcher regexMatcher = checkRegex.matcher(longString);
while (regexMatcher.find()) {
System.out.println(regexMatcher.start() + " : " + regexMatcher.group());
}
Result:
17 : Amitha
31 : Amani
41 : Adele
54 : Ariana

regular expression for dicom age

I am trying to create regular expression.
There is an age, that can be written in the number of ways:
e.g. for person 64 years old it could be:
064Y
064
64
but for 0 years old it could also be
0Y
0
Could you help me producing right regular for JAVA matcher, so I can get Integer after parsing this the age string.
Currently I came to the following, which obviously does not cover all the possible cases.
#Test
public void testAgeConverter() throws AppException, IOException {
Pattern pattern = Pattern.compile("0([0-9]+|[1-9]+)[Yy]?");
Matcher m = pattern.matcher("062Y");
String str = "";
if (m.find()) {
for (int i = 1; i <= m.groupCount(); i++) {
str += "\n" + m.group(i);
}
}
System.out.println(str);
}
I will appreciate your help, thank you.
I would try with the following self-contained example:
String[] testCases = {
"064Y", "064", "64", "0Y", "0"
};
int[] expectedResults = {
64, 64, 64, 0, 0
};
// ┌ optional leading 0
// | ┌ 1 or 2 digits from 0 to 9 (00->99)
// | | in group 1
// | | ┌ optional one Y
// | | | ┌ case insensitive
Pattern p = Pattern.compile("0*([0-9]{1,2})Y?", Pattern.CASE_INSENSITIVE);
// fine-tune the Pattern for centenarians
// (up to 199 years in this ugly draft):
// "0*([0-1][0-9]{1,2}";
for (int i = 0; i < testCases.length; i++) {
Matcher m = p.matcher(testCases[i]);
if (m.find()) {
System.out.printf("Found: %s%n", m.group());
int result = Integer.parseInt(m.group(1));
System.out.printf("Expected result is: %d, actual result is: %d", expectedResults[i], result);
System.out.printf("... matched? %b%n", result == expectedResults[i]);
}
}
Output
Found: 064Y
Expected result is: 64, actual result is: 64... matched? true
Found: 064
Expected result is: 64, actual result is: 64... matched? true
Found: 64
Expected result is: 64, actual result is: 64... matched? true
Found: 0Y
Expected result is: 0, actual result is: 0... matched? true
Found: 0
Expected result is: 0, actual result is: 0... matched? true
In any case, you only want the numbers so you could use
[0]*((\d)*)
Note that to make it work in Java you have to escape the backslash so
[0]*((\\d)*)
Then just capture the first matching group.
Which would select all the numbers, except the leading zeros. In the case of 0 or 0Y it would select nothing but then you could check it easily with
if(result.isEmpty())
val = 0;
You could try and use something like so: ^0*?(\d+)Y?$. A working example is available here. You would then iterate over the matches and use regex groups to extract the integer value you are after.
Why is your expression so complicated? Won't this do...?
Pattern pattern = Pattern.compile("([0-9]+)[Yy]?");
Matcher m = pattern.matcher("062Y");
Integer age = null;
if (m.find()) {
age = Integer.valueOf(m.group(1));
}
System.out.println(age);
You need to be more specific with the regex for the problem it can be solved with:
[0-9]+[Y|y]?
But this won't help you much you should try and narrow it down more with unique identifiers around these values
If you are using matcher.find, it is not even necessary to match the leading zero; neither to match for [yY], thus we have:
(1[0-9][0-9]|[1-9]?[0-9])
which will find all the integers from 0 to 199 and give them in a group

Understanding regular expression output [duplicate]

This question already has an answer here:
SCJP6 regex issue
(1 answer)
Closed 7 years ago.
I need help to understand the output of the code below. I am unable to figure out the output for System.out.print(m.start() + m.group());. Please can someone explain it to me?
import java.util.regex.*;
class Regex2 {
public static void main(String[] args) {
Pattern p = Pattern.compile("\\d*");
Matcher m = p.matcher("ab34ef");
boolean b = false;
while(b = m.find()) {
System.out.println(m.start() + m.group());
}
}
}
Output is:
0
1
234
4
5
6
Note that if I put System.out.println(m.start() );, output is:
0
1
2
4
5
6
Because you have included a * character, your pattern will match empty strings as well. When I change your code as I suggested in the comments, I get the following output:
0 ()
1 ()
2 (34)
4 ()
5 ()
6 ()
So you have a large number of empty matches (matching each location in the string) with the exception of 34, which matches the string of digits. Use \\d+ if you want to match digits without also matching empty strings..
You used this regex - \d* - which basically means zero or more digits. Mind the zero!
So this pattern will match any group of digits, e.g. 34 plus any other position in the string, where the matched sequence will be the empty string.
So, you will have 6 matches, starting at indices 0,1,2,4,5,6. For match starting at index 2, the matched sequence is 34, while for the remaining ones, the match will be the empty string.
If you want to find only digits, you might want to use this pattern: \d+
d* - match zero or more digits in the expresion.
expresion ab34ef and his corresponding indices 012345
On the zero index there is no match so start() prints 0 and group() prints nothing, then on the first index 1 and nothing, on the second we find match so it prints 2 and 34. Next it will print 4 and nothing and so on.
Another example:
Pattern pattern = Pattern.compile("\\d\\d");
Matcher matcher = pattern.matcher("123ddc2ab23");
while(matcher.find()) {
System.out.println("start:" + matcher.start() + " end:" + matcher.end() + " group:" + matcher.group() + ";");
}
which will println:
start:0 end:2 group:12;
start:9 end:11 group:23;
You will find more information in the tutorial

Categories