Java regular expression boundary match? - java

I found the following question in one Java test suite
Pattern p = Pattern.compile("[wow]*");
Matcher m = p.matcher("wow its cool");
boolean b = false;
while (b = m.find()) {
System.out.print(m.start() + " \"" + m.group() + "\" ");
}
where the output seems to be as follows
0 "wow" 3 "" 4 "" 5 "" 6 "" 7 "" 8 "" 9 "oo" 11 "" 12 ""
Up till the last match it is clear, the pattern [wow]* greedily matches 0 or more 'w' and 'o' characters, while for unmatching characters, including spaces, it results in empty strings. However after matching the last 'l' with 11 "", the following 12 "" seems to be unclear. There is no detailing for this in the test solution, nor I was really able to definitely figure it out from javadoc. My best guess is boundary character, but I would appreciate if someone could provide an explanation

The reason that you see this behavior is that your pattern allows empty matches. In other words, if you pass it an empty string, you would see a single match at position zero:
Pattern p = Pattern.compile("[wow]*"); // One of the two 'w's is redundant, but the engine is OK with it
Matcher m = p.matcher(""); // Passing an empty string results in a valid match that is empty
boolean b = false;
while (b = m.find()) {
System.out.print(m.start() + " \"" + m.group() + "\" ");
}
this would print 0 "" because an empty string is as good a match as any other match for the expression.
Going back to your example, every time the engine discovers a match, including an empty one, it advances past it by a single character. "Advancing by one" means that the engine considers the "tail" of the string at the next position. This includes the time when the regex engine is at position 11, i.e. at the very last character: here, the "tail" consists of an empty string. This is similar to calling "wow its cool".substring(12): you would get an empty string in that case as well.
The engine consider an empty string a valid input, and tries to match it against your expression, as shown in the example above. This produces a match, which your program properly reports.

[wow]* Matches the first wow string. count = 1
Because of the * (zero or more) next to the character class, [wow]* this regex would match an empty string which exists before the character which is not matched by the above pattern. So it matches the boundary or empty space which exists just before to the first space. Count = 2.
its is not matched by the above regex . So it matches the empty string which exists before each character. So count is 2+3=5.
And also the second space is not matched by the above regex. So we get an empty string as match. 5+1=6
c is not matched by the above regex. So it matches the empty space which exists just before to the c 6+1=7
oo is matched by the above regex. [wow]*. So it matches oo and this is considered as 1 match . So we get 7+1=8 as count.
l is not matched. Count = 9
At the last it matches the empty string which exists next to the last character. So now the count is 9+1=10
And finally we all know that the m.start() prints the starting index of the corresponding match.
DEMO

The regex is simply matching the pattern against the input, starting at a given offset.
For the last match, the offset of 12 is at the point after the last character of 'cool' - you might think this is the end of the string and therefore cannot be used for matching purposes - but you'd be wrong. For pattern-matching, this is a perfectly valid starting point.
As you state, your regex expression includes the possibility of zero characters and indeed, this is what happens after the end of the last character, but before the end-of-string marker (usually represented by $ in a regex expression).
To put it another way, without testing past the end of the last character, it would mean no matches would ever occur relating to the end of the string - but there are many regex constructs that match the end of the string (and you've shown one of them here).

Related

can deal with the first line space when i use regex for polynomials

here is my code
String a = "X^5+2X^2+3X^3+4X^4";
String exp[]=a.split("(|\\+\\d)[xX]\\^");
for(int i=0;i<exp.length;i++) {
System.out.println("exp: "+exp[i]+" ");
}
im try to find the output which is 5,2,3,4
but instead i got this answer
exp:
exp:5
exp:2
exp:3
exp:4
i dont know where is the first line space come from, and i cannot find a will to get rid of that, i try to use others regex for this and also use compile,still can get rid of the first line, i try to use new string "X+X^5+2X^2+3X^3+4X^4";the first line shows exp:X.
and i also use online regex compiler to try my problem, but their answer is 5,2,3,4, buy eclipse give a space ,and then 5,2,3,4 ,need a help to figure this out
Try to use regex, e.g:
String input = "X^5+2X^2+3X^3+4X^4";
Pattern pattern = Pattern.compile("\\^([0-9]+)");
Matcher matcher = pattern.matcher(input);
for (int i = 1; matcher.find(); i++) {
System.out.println("exp: " + matcher.group(1));
}
It gives output:
exp: 5
exp: 2
exp: 3
exp: 4
How does it work:
Pattern used: \^([0-9]+)
Which matches any strings starting with ^ followed by 1 or more digits (note the + sign). Dash (^) is prefixed with backslash (\) because it has a special meaning in regular expressions - beginning of a string - but in Your case You just want an exact match of a ^ character.
We want to wrap our matches in a groups to refer to them late during matching process. It means we need to mark them using parenthesis ( and ).
Then we want to pu our pattern into Java String. In String literal, \character has a special meaning - it is used as a control character, eg "\n" represents a new line. It means that if we put our pattern into String literal, we need to escape a \ so our pattern becomes: "\\^([0-9]+)". Note double \.
Next we iterate through all matches getting group 1 which is our number match. Note that a ^.character is not covered in our match even if it is a part of our pattern. It is so because wr used parenthesis to mark our searched group, which in our case are only digits
Because you are using the split method which looks for the occurrence of the regex and, well.. splits the string at this position. Your string starts with X^ so it very much matches your regex.

Why the zero-length character always remains at the end of the source string for java regex pattern a?

Pattern pattern = Pattern.compile("a?");
Matcher matcher = pattern.matcher("a");
while(matcher.find()){
System.out.println(matcher.start()+"["+matcher.group()+"]"+matcher.end());
}
Output :
0[a]1
1[]1
why this gives me two outputs while there is a single characters as the matcher.
I noticed that for this pattern it gives an zero-length always at the end of the source string.
Eg : when source is "abab" it gives
0[a]1
1[]1
2[a]3
3[]3
4[]4
The regex special character ? (question mark) means "match the preceding thing zero or one time".
Since you are matching in a while loop (while (matcher.find()) {...) it finds both matches of the expression - one occurrence of "a" (at position 0, the string "a") and zero occurrences of "a" (at position 1, the empty string at the very end).
So here's what your code snippet is matching (start/end indices are denoted by X/Y):
String: " a b a b "
├─┼─┼─┼─┤
Index: 0 1 2 3 4
Match: ╰┬╯ ╰┬╯ ╰- the empty string 4/4 (zero occurrences of "a").
|| |╰- the empty string 3/3 (zero occurrences of "a").
|| ╰ the string "a" 2/3 (one occurrence of "a").
|╰ the empty string 1/1 (zero occurrences of "a").
╰ the string "a" 0/1 (one occurrence of "a").
It doesn't match at positions 0/0 or 2/2 since the expression is greedy, which means it will try to consider the next character (at positions 0/1, 2/3) as long as it doesn't invalidate the match, which it doesn't so they are skipped. To illustrate, if you were to match the string "bbbb" against the pattern a? then you would get five empty strings, one for each empty string at the beginning, end, and between each character.
Have a look at
http://docs.oracle.com/javase/tutorial/essential/regex/quant.html
It explains your case in detail under the section Zero-Length Matches
a? stands for 0-or-1 occurrances of the character a.
The empty string is matching the 0 occurrence.
The matching is also greedy in you case, so it matches the 1 occurrance first, then the 0 occurrance at the end.
In the abab case, think of it as a[]ba[]b[], where [] denotes the empty occurrance found. The matcher does not find it in the beginning or after the first b, because it can greedily match on a.
Matching the empty space after the last character is not universal.
The Vim editor has this behavior:
Buffer before:
aaaa
~
~
:s/x\?/y/g <- command
Buffer after:
yayaya
~
~
No x occurs in aaaa but the x? (written x\? in Vim by default) allows an empty match. The pattern matches the empty space at the start of the string and between
all the characters, but not past the end.
The exception is if the line is empty. The command will replace a blank line with a single y.
I implemented the Vim-like behavior in my own program:
$ txr -c '#(bind result #(regsub #/x?/ "y" "aaaa"))'
result="yayayaya"
$ txr -c '#(bind result #(regsub #/x?/ "y" ""))'
result="y"
Only because Vim is popular and I can point to that as the reference model if any questions come up. But it's a bit of a hack. The logic has a do .. while loop, which allows an incoming empty string to be processed:
do {
/* regex match, extraction, substitution ... */
position++;
} while (position < length(input))
So if the starting position is zero and the input has length zero, we do the loop once, applying the regex to the empty string. But if we process the last character, position reaches the length and the loop terminates without processing the empty string.
Originally, I had the loop test at the top, so it was behaving like Vim, but not in the empty input case, which would not match regexes that match on empty.
The behavior of the Java class you're using might be implemented like this:
while (position <= length(input)) {
/* process regex */
position++;
}

Strange behavior in regexes

There was a question about regex and trying to answer I found another strange things.
String x = "X";
System.out.println(x.replaceAll("X*", "Y"));
This prints YY. why??
String x = "X";
System.out.println(x.replaceAll("X*?", "Y"));
And this prints YXY
Why reluctant regex doesn't match 'X' character? There is "noting"X"nothing" but why first doesn't match three symbols and matches two and then one instead of three? and second regex matches only "nothing"s and not X?
Let's consider them in turn:
"X".replaceAll("X*", "Y")
There are two matches:
At character position 0, X is matched, and is replaced with Y.
At character position 1, the empty string is matched, and Y gets added to the output.
End result: YY.
"X".replaceAll("X*?", "Y")
There are also two matches:
At character position 0, the empty string is matched, and Y gets added to the output. The character at this position, X, was not consumed by the match, and is therefore copied into the output verbatim.
At character position 1, the empty string is matched, and Y gets added to the output.
End result: YXY.
The * is a tricky 'quantifier' since it means '0 or more'. Thus, it also matches '0 times X' (i.e. an empty string).
I would use
"X".replaceAll("X+", "Y")
which has the expected behaviour.
In your first example you are using a "Greedy" quantifier. This means that the input string is forced to be read entirely before attempting the first match, so the first match tried is the whole input. If the input matches, the matcher goes past the input string and performs the zero-length match at the end of the string hence the two matches you see. The greedy matcher never backs-off to the zero-length match before the character X before the first match attempt was successful.
On the second example you are using a "Reluctant" quantifier which does the opposite of "Greedy". It starts at the beginning and tries to match one character at the time going forward (if it has to). So the zero-length match before the "X" character is matched, matcher moves forward by one (that's why you still see the "X" character in the output) where the next match is now the zero-length match after the "X".
There is a good tutorial here: http://docs.oracle.com/javase/tutorial/essential/regex/quant.html

Repeat pattern in RegEx

I've got an string parts which match to following pattern.
abcd|(|a|ab|abc)e(fghi|(|f|fg|fgh)jklmn)
But problem I have got is, my whole string is repeated combination of above like patterns. And my whole string must contain more than 14 sets of above pattern.
Can anyone one help me to improve my above RegEx to wanted format.
Thanks
Update
Input examples:
Matched string parts : abcd, abefgjkln, efjkln, ejkln
But whole string is : abcdabefgjklnefjklnejkln (Combination of above 4 parts)
There must be more than 15 parts in whole string. Above one have only 4 parts. So, it's wrong.
This will try to match your "parts" at least 15 times in a string.
boolean foundMatch = false;
try {
foundMatch = subjectString.matches("(?:(?:ab(?:cd|efgjkln))|(?:(?:ef?jkln))){15,}");
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
If there are at least 15 repetitions of any of the above parts foundMatch will be true, else it will remain false.
Breakdown :
"(?:" + // Match the regular expression below
"|" + // Match either the regular expression below (attempting the next alternative only if this one fails)
"(?:" + // Match the regular expression below
"ab" + // Match the characters “ab” literally
"(?:" + // Match the regular expression below
// Match either the regular expression below (attempting the next alternative only if this one fails)
"cd" + // Match the characters “cd” literally
"|" + // Or match regular expression number 2 below (the entire group fails if this one fails to match)
"efgjkln" + // Match the characters “efgjkln” literally
")" +
")" +
"|" + // Or match regular expression number 2 below (the entire group fails if this one fails to match)
"(?:" + // Match the regular expression below
"(?:" + // Match the regular expression below
"e" + // Match the character “e” literally
"f" + // Match the character “f” literally
"?" + // Between zero and one times, as many times as possible, giving back as needed (greedy)
"jkln" + // Match the characters “jkln” literally
")" +
")" +
"){15,}" // Between 15 and unlimited times, as many times as possible, giving back as needed (greedy)
What about this:
(?:a(?:b(?:c(?:d)?)?)?ef(?:g(?:h(?:i)?)?)?jklmn){15,}
Explanation: you create a non-capturing group (with (?: ... )), and say that this should be repeated >=15 times, hence the curly braces in the end.
First, it seems that your pattern can be simplified. Really pattern a is a subset of ab that is a subset of abc, so if pattern abc matches it means that a matches too. Think about this and change your pattern appropriately. Right now it probably not what you really want.
Second, to repeat something is puttern use {N}, i.e. abc{5} means "abc repeated five times". You can also use {3,}, {,5}, {3,5} that mean repeat>=3, repeat<=5, 3<=repeat<=5.

Java - Extract strings with Regex

I've this string
String myString ="A~BC~FGH~~zuzy|XX~ 1234~ ~~ABC~01/01/2010 06:30~BCD~01/01/2011 07:45";
and I need to extract these 3 substrings
1234
06:30
07:45
If I use this regex \\d{2}\:\\d{2} I'm only able to extract the first hour 06:30
Pattern depArrHours = Pattern.compile("\\d{2}\\:\\d{2}");
Matcher matcher = depArrHours.matcher(myString);
String firstHour = matcher.group(0);
String secondHour = matcher.group(1); (IndexOutOfBoundException no Group 1)
matcher.group(1) throws an exception.
Also I don't know how to extract 1234. This string can change but it always comes after 'XX~ '
Do you have any idea on how to match these strings with regex expressions?
UPDATE
Thanks to Adam suggestion I've now this regex that match my string
Pattern p = Pattern.compile(".*XX~ (\\d{3,4}).*(\\d{1,2}:\\d{2}).*(\\d{1,2}:\\d{2})";
I match the number, and the 2 hours with matcher.group(1); matcher.group(2); matcher.group(3);
The matcher.group() function expects to take a single integer argument: The capturing group index, starting from 1. The index 0 is special, which means "the entire match". A capturing group is created using a pair of parenthesis "(...)". Anything within the parenthesis is captures. Groups are numbered from left to right (again, starting from 1), by opening parenthesis (which means that groups can overlap). Since there are no parenthesis in your regular expression, there can be no group 1.
The javadoc on the Pattern class covers the regular expression syntax.
If you are looking for a pattern that might recur some number of times, you can use Matcher.find() repeatedly until it returns false. Matcher.group(0) once on each iteration will then return what matched that time.
If you want to build one big regular expression that matches everything all at once (which I believe is what you want) then around each of the three sets of things that you want to capture, put a set of capturing parenthesis, use Matcher.match() and then Matcher.group(n) where n is 1, 2 and 3 respectively. Of course Matcher.match() might also return false, in which case the pattern did not match, and you can't retrieve any of the groups.
In your example, what you probably want to do is have it match some preceding text, then start a capturing group, match for digits, end the capturing group, etc...I don't know enough about your exact input format, but here is an example.
Lets say I had strings of the form:
Eat 12 carrots at 12:30
Take 3 pills at 01:15
And I wanted to extract the quantity and times. My regular expression would look something like:
"\w+ (\d+) [\w ]+ (\d{1,2}:\d{2})"
The code would look something like:
Pattern p = Pattern.compile("\\w+ (\\d+) [\\w ]+ (\\d{2}:\\d{2})");
Matcher m = p.matcher(oneline);
if(m.matches()) {
System.out.println("The quantity is " + m.group(1));
System.out.println("The time is " + m.group(2));
}
The regular expression means "a string containing a word, a space, one or more digits (which are captured in group 1), a space, a set of words and spaces ending with a space, followed by a time (captured in group 2, and the time assumes that hour is always 0-padded out to 2 digits). I would give a closer example to what you are looking for, but the description of the possible input is a little vague.

Categories