x? quantifer: Why does a non-x give a "zero-length" match? - java

A quantifier x? means a single or no occurance of x.
I am posting a test harness for matching the regex with strings for convenience.
I am confused about the regex a? when compared to the string ababaaaab.
The output of the program is:
Enter your regex: a?
Enter your input string to seacrh: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "a" starting at index 4 and ending at index 5.
I found the text "a" starting at index 5 and ending at index 6.
I found the text "a" starting at index 6 and ending at index 7.
I found the text "a" starting at index 7 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.
Enter your regex:
I am confused about the b's.
"The regular expression a? is not specifically looking for the letter
"b"; it's merely looking for the presence (or lack thereof) of the
letter "a". If the quantifier allows for a match of "a" zero times,
anything in the input string that's not an "a" will show up as a
zero-length match."
Reference
QUESTION:-
The first line is understandable, and I do understand that presence of b or any non-a is an absence of a, or 0 occurence of a, so should result in a match. But the absence of a (that is the occurance of b) is between the indices 1 and 2. So why is the match of the text "" between the index 1 and 1 (in other words, why are we getting a zero-length match here). From my reasoning, it should be between the indices 1 and 2.
import java.io.InputStreamReader;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/*
* Enter your regex: foo
* Enter input string to search: foo
* I found the text foo starting at index 0 and ending at index 3.
* */
public class RegexTestHarness {
public static void main(String[] args){
/*Console console = System.console();
if (console == null) {
System.err.println("No console.");
System.exit(1);
}*/
while (true) {
/*Pattern pattern =
Pattern.compile(console.readLine("%nEnter your regex: ", null));*/
System.out.print("\nEnter your regex: ");
Scanner scanner = new Scanner(new InputStreamReader(System.in));
Pattern pattern = Pattern.compile(scanner.next());
System.out.print("\nEnter your input string to seacrh: ");
Matcher matcher =
pattern.matcher(scanner.next());
boolean found = false;
while (matcher.find()) {
/*console.format("I found the text" +
" \"%s\" starting at " +
"index %d and ending at index %d.%n",
matcher.group(),
matcher.start(),
matcher.end());*/
System.out.println("I found the text \"" + matcher.group() + "\" starting at index " + matcher.start() + " and ending at index " + matcher.end() + ".");
found = true;
}
if(!found){
//console.format("No match found.%n", null);
System.out.println("No match found.");
}
}
}
}

But the absence of a (that is the occurance of b) is between the indices 1 and 2. So why is the match of the text "" between the index 1 and 1 (in other words, why are we getting a zero-length match here)
The length of the match is the length of the input string that matched the pattern.
Since there was no "a", only an empty string was matched.
Again, the pattern does not match "a sequence of non-a characters", it matches a (possibly empty) sequence of "a"s up to a total length of one. In this case, that matched sequence was empty.
But the absence of a (that is the occurance of b)
The absence of a is not the occurance of b. The absence of a takes place before the occurance of b and ends at the occurance of b.

The position reported is not the position of a character
The key thing to understand is that the regex engine is not giving you the position of a character where it found a match.
It is giving you the starting position where it started the match that was successful. That position is not a character. It is the space between characters. For instance,
Position 0 is the very beginning of the string. That is where the \A or ^ assertions match.
Position 1 is the position between the first and the second characters.
Position 9 is the position after the last b at the end of ababaaaab. That is where the \Z or $ assertions match.

a? is greedy. In other words, the regex engine will process as follow:
foreach index
if next char is "a"
return "a"
else if next char is ""
return ""
end if
end foreach
If you apply this algorithm on your input string, you'll have the same output as the one you provided.
You could try its non-greedy (or lazy) equivalent: a??. The regex engine would then process as follow:
foreach index
if next char is ""
return ""
else if next char is "a"
return "a"
end if
end foreach
An empty string would thus be found at each index, and no a would be outputted at all.

Related

Why does regex="" (an empty pattern) match at every character position?

I have regex="" and a String str="stackoveflow";
I don't understand why it is matching every character in the string. can you explain to me?
public class test {
public static void main(String[] args){
Console console = System.console();
String str="stackoveflow";
Pattern pattern = Pattern.compile("");
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
console.format("I found the text" +
" \"%s\" starting at " +
"index %d and ending at index %d.%n",
matcher.group(),
matcher.start(),
matcher.end());
}
}
}
Output is:
I found the text "" starting at index 0 and ending at index 0.
I found the text "" starting at index 1 and ending at index 1.
I found the text "" starting at index 2 and ending at index 2.
I found the text "" starting at index 3 and ending at index 3.
I found the text "" starting at index 4 and ending at index 4.
I found the text "" starting at index 5 and ending at index 5.
I found the text "" starting at index 6 and ending at index 6.
I found the text "" starting at index 7 and ending at index 7.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.
I found the text "" starting at index 10 and ending at index 10.
I found the text "" starting at index 11 and ending at index 11.
I found the text "" starting at index 12 and ending at index 12.
Pattern("") matches a string consisting of zero characters. You can find one of those at every position in the string.
Note: if you changed find to match, you should find that there are no matches. (With match the pattern needs to match the entire input, and the entire input does not match a sequence of zero characters.)
Before you edited the question, your pattern was Pattern("e*"). That means zero or more repetitions of the character 'e'. By the logic above, you can "find" one of those at every character position in the input.

How to match several capturing groups, but results not as expected

I'm trying to learn the Java Regular Expression. I want to match several capturing group (i.e. j(a(va))) against another string (i.e. this is java. this is ava, this is va). I was expecting the output to be:
I found the text "java" starting at index 8 and ending at index 12.
I found the text "ava" starting at index 21 and ending at index 24.
I found the text "va" starting at index 34 and ending at index 36.
Number of group: 2
However, the IDE instead only output:
I found the text "java" starting at index 8 and ending at index 12.
Number of group: 2
Why this is the case? Is there something I am missing?
Original code:
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
System.out.println("\nEnter your regex:");
Pattern pattern
= Pattern.compile(br.readLine());
System.out.println("\nEnter input string to search:");
Matcher matcher
= pattern.matcher(br.readLine());
boolean found = false;
while (matcher.find()) {
System.out.format("I found the text"
+ " \"%s\" starting at "
+ "index %d and ending at index %d.%n",
matcher.group(),
matcher.start(),
matcher.end());
found = true;
System.out.println("Number of group: " + matcher.groupCount());
}
if (!found) {
System.out.println("No match found.");
}
After running the code above, I have entered the following input:
Enter your regex:
j(a(va))
Enter input string to search:
this is java. this is ava, this is va
And the IDE outputs:
I found the text "java" starting at index 8 and ending at index 12.
Number of group: 2
Your regexp only matches the whole string java, it doesn't match ava or va. When it matches java, it will set capture group 1 to ava and capture group 2 to va, but it doesn't match those strings on their own. The regexp that would produce the result you want is:
j?(a?(va))
The ? makes the preceding item optional, so it will match the later items without these prefixes.
DEMO
You need regex (j?(a?(va)))
Pattern p = Pattern.compile("(j?(a?(va)))");
Matcher m = p.matcher("this is java. this is ava, this is va");
while( m.find() )
{
String group = m.group();
int start = m.start();
int end = m.end();
System.out.format("I found the text"
+ " \"%s\" starting at "
+ "index %d and ending at index %d.%n",
group,
start,
end);
}
You can see demo here

java regular expression examples for match without length limitation

i trying to write a regular expression for match a string starting with letter "G" and second index should be any number (0-9) and rest of the string can be contain any thing and can be any length,
i'm stuck in following code
String[] array = { "DA4545", "G121", "G8756942", "N45", "4578", "#45565" };
String regExp = "^[G]\\d[0-9]";
for(int i = 0; i < array.length; i++)
{
if(Pattern.matches(regExp, array[i]))
{
System.out.println(array[i] + " - Successful");
}
}
output:
G12 - Successful
why is not match the 3 index "G8756942"
G - the letter G
[0-9] - a digit
.* - any sequence of characters
So the expression
G[0-9].*
will match a letter G followed by a digit followed by any sequence of characters.
when you write \d it already means [0-9]
so when you say \d[0-9] that means two digits exactly
better use :
^G\\d*
which will match all words starting with G and having zero or more digits
"^[G]\\d[0-9]"
This regex matches "G" followed by \\d, then another number.
Use one of these:
"^G\\d"
"^G[0-9]"
Also note that you don't need a character class since it only contains one letter, so it's redundant.
try this regex .* will match any character after digit
^G\\d.*
http://regex101.com/r/uE4tX1/1
why is not match the 3 index "G8756942"
Because you match for a string starting with G, followed by a \, a d and exactly one digit. Solution:
^[G]\d
This regex would be fine.
"G\\d.*"
Because matches method tries to match the whole input, you need to add .* at the last in your pattern and also you don't need to include anchors.
String[] array = { "DA4545", "G121", "G8756942", "N45", "4578", "#45565" };
String regExp = "G\\d.*";
for(int i = 0; i < array.length; i++)
{
if(Pattern.matches(regExp, array[i]))
{
System.out.println(array[i] + " - Successful");
}
}
Output:
G121 - Successful
G8756942 - Successful

Matching a sequence of dot-separated digits of variable length with regular expressions

I am parsing text from an Excel spread sheet using Java.
I need to validate whether a sequence of 3 integers is present in the text.
The sequence of integers is:
comma separated inside
whitespace-delimited outside
Integers in the sequence can either have 1 or 2 digits.
This is my attempt:
*((\d|\d\d)[^\w](\d|\d\d)[^\w](\d|\d\d))*
With the * meaning that I can have characters before it, and the [\d|\d\d] being a number of either one or two digits, and [^\w] being a non word character?
Valid text: CPI WEIGHTS 05.1.2 : CARPETS & OTHER FLOOR COVERINGS
Invalid text: CPIH INDEX 05.2 : HOUSEHOLD TEXTILES 2005=100
Your last comment actually clarifies the question a bit.
Assuming you are looking for a dot-separated sequence of 1 or 2 digits, externally delimited by whitespace, here's an example:
String ok = "CPI WEIGHTS 05.1.2 : CARPETS & OTHER FLOOR COVERINGS";
String notOk = "CPIH INDEX 05.2 : HOUSEHOLD TEXTILES 2005=100";
Pattern p = Pattern.compile("(\\d{1,2}(\\.|\\s)){3}");
Matcher m = p.matcher(ok);
while (m.find()) {
System.out.printf("Found: %s%n", m.group());
}
m = p.matcher(notOk);
while (m.find()) {
System.out.printf("Found: %s%n", m.group());
}
Output
Found: 05.1.2

Understanding regular expression output [duplicate]

This question already has an answer here:
SCJP6 regex issue
(1 answer)
Closed 7 years ago.
I need help to understand the output of the code below. I am unable to figure out the output for System.out.print(m.start() + m.group());. Please can someone explain it to me?
import java.util.regex.*;
class Regex2 {
public static void main(String[] args) {
Pattern p = Pattern.compile("\\d*");
Matcher m = p.matcher("ab34ef");
boolean b = false;
while(b = m.find()) {
System.out.println(m.start() + m.group());
}
}
}
Output is:
0
1
234
4
5
6
Note that if I put System.out.println(m.start() );, output is:
0
1
2
4
5
6
Because you have included a * character, your pattern will match empty strings as well. When I change your code as I suggested in the comments, I get the following output:
0 ()
1 ()
2 (34)
4 ()
5 ()
6 ()
So you have a large number of empty matches (matching each location in the string) with the exception of 34, which matches the string of digits. Use \\d+ if you want to match digits without also matching empty strings..
You used this regex - \d* - which basically means zero or more digits. Mind the zero!
So this pattern will match any group of digits, e.g. 34 plus any other position in the string, where the matched sequence will be the empty string.
So, you will have 6 matches, starting at indices 0,1,2,4,5,6. For match starting at index 2, the matched sequence is 34, while for the remaining ones, the match will be the empty string.
If you want to find only digits, you might want to use this pattern: \d+
d* - match zero or more digits in the expresion.
expresion ab34ef and his corresponding indices 012345
On the zero index there is no match so start() prints 0 and group() prints nothing, then on the first index 1 and nothing, on the second we find match so it prints 2 and 34. Next it will print 4 and nothing and so on.
Another example:
Pattern pattern = Pattern.compile("\\d\\d");
Matcher matcher = pattern.matcher("123ddc2ab23");
while(matcher.find()) {
System.out.println("start:" + matcher.start() + " end:" + matcher.end() + " group:" + matcher.group() + ";");
}
which will println:
start:0 end:2 group:12;
start:9 end:11 group:23;
You will find more information in the tutorial

Categories