Understanding regular expression output [duplicate] - java

This question already has an answer here:
SCJP6 regex issue
(1 answer)
Closed 7 years ago.
I need help to understand the output of the code below. I am unable to figure out the output for System.out.print(m.start() + m.group());. Please can someone explain it to me?
import java.util.regex.*;
class Regex2 {
public static void main(String[] args) {
Pattern p = Pattern.compile("\\d*");
Matcher m = p.matcher("ab34ef");
boolean b = false;
while(b = m.find()) {
System.out.println(m.start() + m.group());
}
}
}
Output is:
0
1
234
4
5
6
Note that if I put System.out.println(m.start() );, output is:
0
1
2
4
5
6

Because you have included a * character, your pattern will match empty strings as well. When I change your code as I suggested in the comments, I get the following output:
0 ()
1 ()
2 (34)
4 ()
5 ()
6 ()
So you have a large number of empty matches (matching each location in the string) with the exception of 34, which matches the string of digits. Use \\d+ if you want to match digits without also matching empty strings..

You used this regex - \d* - which basically means zero or more digits. Mind the zero!
So this pattern will match any group of digits, e.g. 34 plus any other position in the string, where the matched sequence will be the empty string.
So, you will have 6 matches, starting at indices 0,1,2,4,5,6. For match starting at index 2, the matched sequence is 34, while for the remaining ones, the match will be the empty string.
If you want to find only digits, you might want to use this pattern: \d+

d* - match zero or more digits in the expresion.
expresion ab34ef and his corresponding indices 012345
On the zero index there is no match so start() prints 0 and group() prints nothing, then on the first index 1 and nothing, on the second we find match so it prints 2 and 34. Next it will print 4 and nothing and so on.
Another example:
Pattern pattern = Pattern.compile("\\d\\d");
Matcher matcher = pattern.matcher("123ddc2ab23");
while(matcher.find()) {
System.out.println("start:" + matcher.start() + " end:" + matcher.end() + " group:" + matcher.group() + ";");
}
which will println:
start:0 end:2 group:12;
start:9 end:11 group:23;
You will find more information in the tutorial

Related

Regex not matching all numbers with delimiters

Need a single combined regex for the following pattern:
Prefix: 2221-2720 , Length: 16
Prefix: 51-55 , Length: 16
where the delimiters b/w digits can be either space ( ), minus sign (-), period (.), backslash (\), equals (=). The condition being that more than one delimiter (same or different type) can't occur more than once b/w any two digits.
Valid number - 230.293.217.952.148.4
Valid number - 230.293 217-952.148.4
Invalid number - 230..293.217.952.148.4
Invalid number - 230.293.-217. 952.148.4
A valid input is one where you have 16 digits separated by any/no delimiters as long as there are no two delimiters adjacent to each other.
Have come up with the following regex:
(2[\s=\\.-]*2[\s=\\.-]*2[\s=\\.-]*[1-9][\s=\\.-]*|2[\s=\\.-]*2[\s=\\.-]*[3-9][\s=\\.-]*[0-9][\s=\\.-]*|2[\s=\\.-]*[3-6][\s=\\.-]*[0-9](?:[\s=\\.-]*[0-9]){1}|2[\s=\\.-]*7[\s=\\.-]*[01][\s=\\.-]*[0-9][\s=\\.-]*|2[\s=\\.-]*7[\s=\\.-]*2[\s=\\.-]*0[\s=\\.-]*)[0-9](?:[\s=\\.-]*[0-9]){11}|(5[\s=\\.-]*[1-5][\s=\\.-]*)[0-9](?:[\s=\\.-]*[0-9]){13}
It does not match certain patterns. For example:
2 3 0 2 9 3 2 1 7 9 5 2 1 4 8 4
23-02-93-21-79-52-14-84
2 3 0 3 4 5 8 0 9 4 9 3 0 8 2 3
For the same numbers, it matches (as expected) the following patterns:
2302932179521484
230.293.217.952.148.4
2303458094930823
230.345.809.493.082.3
230-345-809-493-082-3
There seems to be an issue with delimiters. Kindly let me know what is wrong with my regex.
For this rule
A valid input is one where you have 16 digits separated by any/no
delimiters as long as there are no two delimiters adjacent to each
other
Prefix: 2221-2720 , Length: 16
Prefix: 51-55 , Length: 16
2221 can also be written as 2.2.-2.1
For these rules, it might be easier to write a pattern with 2 capture groups to match the whole string.
Then using some Java code, you can check the value of the capture groups for the ranges.
^((\d[ =\\.-]?\d)[ =\\.-]?\d[ =\\.-]?\d)(?:[ =\\.-]?\d){12}$
The pattern matches:
^ Start of string
( Capture group 1
(\d[ =\\.-]?\d) Capture group 2 Match 2 digits with an optional char = \ . -
[ =\\.-]?\d[ =\\.-]?\d Match 2 times optionally 1 of the listed chars and a single digit
) close group 1
(?:[ =\\.-]?\d){12} Repeat 12 times matching one of the characters and a single digit
$ End of string
Regex demo | Java demo
For example
String strings[] = {
"2221.7.952.148.412.32",
"230.293.217.952.148.4",
"5511111111111111",
"130.293 217-952.148.4",
"30..293.217.952.148.4",
"5..5",
".5.5."
};
String regex = "^((\\d[ =\\\\.-]?\\d)[ =\\\\.-]?\\d[ =\\\\.-]?\\d)(?:[ =\\\\.-]?\\d){12}$";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
for (String s : strings) {
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
int grp1 = Integer.parseInt(matcher.group(1).replaceAll("\\D+", ""));
int grp2 = Integer.parseInt(matcher.group(2).replaceAll("\\D+", ""));
if ((grp1 >= 2221 && grp1 <= 2720) || (grp2 >=51 && grp2 <= 55)) {
System.out.println("Match for " + matcher.group());
}
}
}
Output
Match for 2221.7.952.148.412.32
Match for 230.293.217.952.148.4
Match for 5511111111111111

Replace dash character between three or more number separate by dash with space in sentence

I want to replace dash between three or more number (each number has only one digit, i.e number from 0 - 9) separate by dash with one space in an sentence. What is the good solution to done this task ?
Sample Input:
4-2-2-1 kim yoong-yun
4 -2 - 2 - 1 and 4 - 5
1-2-3-4-5
1-5
4 - 5
Expected Output:
4 2 2 1 kim yoong-yun
4 2 2 1 and 4 - 5
1 2 3 4 5
1-5 // will not replace
4 - 5 // will not replace
I know i can done this by this complex method:
String sentence = "4-2-3-1";
Pattern pCode = Pattern.compile("\\b(?:\\d ?- ?){2,}\\d");
Matcher mCode = pCode.matcher(sent);
while (mCode.find()) {
sentence = mCode.replaceFirst(mCode.group(0).replaceAll(" ?- ?", " "));
mCode = pCode.matcher(sent);
}
System.out.print(sentence) // 4 2 3 1
But can I done in one replace, or any simple solution?
In Java 9+, you may use Matcher#replaceAll​(Function<MatchResult,String> replacer) method:
String sentence = "4-2-3-1";
Pattern pCode = Pattern.compile("\\b\\d(?:\\s?-\\s?\\d){2,}\\b");
Matcher mCode = pCode.matcher(sentence);
String result = mCode.replaceAll(x -> x.group().replace("-", " ") );
System.out.println( result ); // => 4 2 3 1
See the online Java demo. In earlier versions, use
String sentence = "4-2-3-1";
Pattern pCode = Pattern.compile("\\b\\d(?:\\s?-\\s?\\d){2,}\\b");
Matcher mCode = pCode.matcher(sentence);
StringBuffer sb = new StringBuffer();
while (mCode.find()) {
mCode.appendReplacement(sb, mCode.group().replace("-", " "));
}
mCode.appendTail(sb);
See this Java demo.
The regex is a bit modified to follow the best practices (quantified parts should be moved as far to the right as possible):
\b\d(?:\s?-\s?\d){2,}\b
See the regex demo. Details:
\b - word boundary
\d - a single digit
(?:\s?-\s?\d){2,} - two or more occurrences of:
\s?-\s? - a - enclosed with one or zero whitespace
\d - a single digit
\b - word boundary
You can use the following function
private static String unDash(String input) {
String[] splitString = input.split("\\s*-\\s*");
if(splitString.length < 3){
return input;
} else {
return String.join(" ", splitString);
}
}
Split is done using "\\s*-\\s*" which takes care of trimming the String after splitting it by '-'. String.join can be used to combine the spilt String using a delimiter, which in our case is " ".
Here’s a 1-liner:
String s2 = s1.matches("(\\d+[ -]+){2,}\\d") ? s1.replaceAll("[ -]+", " ") : s1;
Your logic is “replace separators if there’s more than 3 numbers”, and this code captures that succinctly.
See live demo.

Java Regular expressions issue - Can't match two strings in the same line [duplicate]

This question already has answers here:
What do 'lazy' and 'greedy' mean in the context of regular expressions?
(13 answers)
Closed 8 years ago.
just experiencing some problems with Java Regular expressions.
I have a program that reads through an HTML file and replaces any string inside the #VR# characters, i.e. #VR#Test1 2 3 4#VR#
However my issue is that, if the line contains more than two strings surrounded by #VR#, it does not match them. It would match the leftmost #VR# with the rightmost #VR# in the sentence and thus take whatever is in between.
For example:
#VR#Google#VR#
My code would match
URL-GOES-HERE#VR#" target="_blank" style="color:#f4f3f1; text-decoration:none;" title="ContactUs">#VR#Google
Here is my Java code. Would appreciate if you could help me to solve this:
Pattern p = Pattern.compile("#VR#.*#VR#");
Matcher m;
Scanner scanner = new Scanner(htmlContent);
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
m = p.matcher(line);
StringBuffer sb = new StringBuffer();
while (m.find()) {
String match_found = m.group().replaceAll("#VR#", "");
System.out.println("group: " + match_found);
}
}
I tried replacing m.group() with m.group(0) and m.group(1) but nothing. Also m.groupCount() always returns zero, even if there are two matches as in my example above.
Thanks, your help will be very much appreciated.
Your problem is that .* is "greedy"; it will try to match as long a substring as possible while still letting the overall expression match. So, for example, in #VR# 1 #VR# 2 #VR# 3 #VR#, it will match 1 #VR# 2 #VR# 3.
The simplest fix is to make it "non-greedy" (matching as little as possible while still letting the expression match), by changing the * to *?:
Pattern p = Pattern.compile("#VR#.*?#VR#");
Also m.groupCount() always returns zero, even if there are two matches as in my example above.
That's because m.groupCount() returns the number of capture groups (parenthesized subexpressions, whose corresponding matched substrings retrieved using m.group(1) and m.group(2) and so on) in the underlying pattern. In your case, your pattern has no capture groups, so m.groupCount() returns 0.
You can try the regular expression:
#VR#(((?!#VR#).)+)#VR#
Demo:
private static final Pattern REGEX_PATTERN =
Pattern.compile("#VR#(((?!#VR#).)+)#VR#");
public static void main(String[] args) {
String input = "#VR#Google#VR# ";
System.out.println(
REGEX_PATTERN.matcher(input).replaceAll("$1")
); // prints "Google "
}

Trying to understand this Regex code [duplicate]

This question already has an answer here:
SCJP6 regex issue
(1 answer)
Closed 7 years ago.
I have the following code. As far as I can see, the program should print 0123445. Instead, it prints 01234456.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Regex2 {
public static void main(String[] args) {
Pattern p = Pattern.compile("\\d*");
Matcher m = p.matcher("ab34ef");
boolean b = false;
while(b=m.find()){
System.out.print(m.start() + m.group());
}
System.out.println();
}
}
I think the following should happen-
Since the search pattern is for a \d*,
It finds a hit at position 0, but since the hit is not a digit, it just prints 0
It finds a hit at position 1, but again, not a digit, prints 0
Finds a hit at position 2 and since we are looking for \d*, the hit is 34, and so it prints 234.
Moves to position 4, finds a hit, but since hit is not a digit, it just prints 4.
Moves to position 5, finds a hit, but since hit is not a digit, it just prints 5.
At this point, as far as I can see, it should be done. But for some reason, the program also returns a 6.
Much appreciate it if someone can explain.
The \d* matches zero(!) or more digits, that's why it returns an empty string as a match at 0 and 1, it the matches 34 at position 2 and an empty string again at position 4 and 5. At that point what is left to match against is an empty string. And this empty string also matches \d* (because an empty string contains zero digits), that's why there is another match at position 6.
To contrast this try using \d+ (which matches one or more digits) as the pattern and see what happens then.

Java Regular expression

I am not much familiar with regular expressions.
I want help for following regular exceptions:
1. String start with alpha word and then followed by any alpha or number. e.g. Abc 20 Jan to 15 Dec
2. String for a decimal number. e.g. 450,122,224.00
3. Also to check if String contain any pattern like 'Page 2 of 20'
Thanks.
// 1. String start with alpha word and then followed by
// any aplha or number. e.g. Abc 20 Jan to 15 Dec
// One or more alpha-characters, followed by a space,
// followed by some alpha-numeric character, followed by what ever
Pattern p = Pattern.compile("\\p{Alpha}+ \\p{Alnum}.*");
for (String s : new String[] {"Abc 20 Jan to 15 Dec", "hello world", "123 abc"})
System.out.println(s + " matches: " + p.matcher(s).matches());
// 2. String for a decimal number. e.g. 450,122,224.00
p = Pattern.compile(
"\\p{Digit}+(\\.\\p{Digit})?|" + // w/o thousand seps.
"\\p{Digit}{1,3}(,\\p{Digit}{3})*\\.\\p{Digit}+"); // w/ thousand seps.
for (String s : new String[] { "450", "122", "224.00", "450,122,224.00", "0.0.3" })
System.out.println(s + " matches: " + p.matcher(s).matches());
// 3. Also to check if String contain any pattern like 'Page 2 of 20'
// "Page" followed by one or more digits, followed by "of"
// followed by one or more digits.
p = Pattern.compile("Page \\p{Digit}+ of \\p{Digit}+");
for (String s : new String[] {"Page 2 of 20", "Page 2 of X"})
System.out.println(s + " matches: " + p.matcher(s).matches());
Output:
Abc 20 Jan to 15 Dec matches: true
hello world matches: true
123 abc matches: false
450 matches: true
122 matches: true
224.00 matches: true
450,122,224.00 matches: true
0.0.3 matches: false
Page 2 of 20 matches: true
Page 2 of X matches: false
1.) /[A-Z][a-z]*(\s([\d]+)|\s([A-Za-z]+))+/
[A-Z][a-z]* being an uppercase word
\s([\d]+) being a number prefixed be a (white)space
\s([A-Za-z]+) being a word prefixed be a (white)space
2.) /(\d{1,3})(,(\d{3}))*(.(\d{2}))/
(\d{1,3}) being a 1-to-3 digit number
(,(\d{3}))* being 0-or-more three-digit numbers prefixed by a comma
(.(\d{2})) being a 2-digit decimal
3.) /Page (\d+) of (\d+)/
(\d+) being one-or-more digits
When writing this (or any regex) I like to use this tool
1
I am not sure what you mean here. A word at te start, followed by any number of words and numbers?
Try this one:
^[a-zA-Z]+(\s+([a-zA-Z]+|\d+))+
2
Just a decimal number would be
\d+(\.\d+)?
Getting the commas in there:
\d{1,3}(,\d{3})*(\.\d+)?
3
Use
Page \d+ of \d+

Categories