pattern search using regex in java

pattern search using regex in java - java

public static void main(String args[]) {
Pattern p = Pattern.compile("ab"); // Case 1
Pattern p = Pattern.compile("bab"); // Case 2
Matcher m = p.matcher("abababa");
while(m.find()){
System.out.print(m.start());
}
}
When I used Case 1, then output is 024 as expected. But, when I used Case 2 then output is 1, but I was expected 13. So, anyone explain me, is there any exceptional rule in regex, which causes this output, if not. Then, why I'm getting this output.
Help appreciated !!
Note : Case 1 and Case 2 are independently used.

The match consumes the input, so the next match is found after the end of the previous match:
Position of "bab" matcher's pointer before each match would be:
|abababa
abab|aba

For Case 2:
its because, after it search's for bab, it wouldn't consider the already searched char(b in this case at index 3) thus you get only 1.
Input: abababa
Search for bab,
find's a match starting at index 1 and ending at index 3, next the search would start at index 4(aba)

Related

Why does this regex fails to check accurately?

I have the following regex method which does the matches in 3 stages for a given string. But for some reason the Regex fails to check some of the things. As per whatever knowledge I have gained by working they seem to be correct. Can someone please correct me what am I doing wrong here?
I have the following code:
public class App {
public static void main(String[] args) {
String identifier = "urn:abc:de:xyz:234567.1890123";
if (identifier.matches("^urn:abc:de:xyz:.*")) {
System.out.println("Match ONE");
if (identifier.matches("^urn:abc:de:xyz:[0-9]{6,12}.[0-9]{1,7}.*")) {
System.out.println("Match TWO");
if (identifier.matches("^urn:abc:de:xyz:[0-9]{6,12}.[a-zA-Z0-9.-_]{1,20}$")) {
System.out.println("Match Three");
}
}
}
}
}
Ideally, this code should generate the output
Match ONE
Match TWO
Match Three
Only when the identifier = "urn:abc:de:xyz:234567.1890123.abd12" but it provides the same output event if the identifier does not match the regex such as for the following inputs:
"urn:abc:de:xyz:234567.1890123"
"urn:abc:de:xyz:234567.1890ANC"
"urn:abc:de:xyz:234567.1890123"
"urn:abc:de:xyz:234567.1890ACB.123"
I am not understanding why is it allowing the Alphanumeric characters after the . and also it does not care about the characters after the second ..
I would like my Regex to check that the string has the following format:
String starts with urn:abc:de:xyz:
Then it has the numbers [0-9] which range from 6 to 12 (234567).
Then it has the decimal point .
Then it has the numbers [0-9] which range from 1 to 7 (1890123)
Then it has the decimal point ..
Finally it has the alphanumeric character and spcial character which range from 1 to 20 (ABC123.-_12).
This is an valid string for my regex: urn:abc:de:xyz:234567.1890123.ABC123.-_12
This is an invalid string for my regex as it misses the elements from point 6:
urn:abc:de:xyz:234567.1890123
This is also an invalid string for my regex as it misses the elements from point 4 (it has ABC instead of decimal numbers).
urn:abc:de:xyz:234567.1890ABC.ABC123.-_12

This part of the regex:
[0-9]{6,12}.[0-9]{1,7} matches 6 to 12 digits followed by any character followed by 1 to 7 digits
To match a dot, it needs to be escaped. Try this:
^urn:abc:de:xyz:[0-9]{6,12}\.[0-9]{1,7}\.[a-zA-Z0-9\-_]{1,20}$

This will match with any number of dot alphanum at the end of the string as your examples:
^urn:abc:de:xyz:\d{6,12}\.\d{1,7}(?:\.[\w-]{1,20})+$
Demo & explanation

Java, Regex, Nested optional groups

I'm trying to capture nested optional groups in Java but it's not working out.
I'm trying to capture a keyword followed by an interval, where a keyword is anything for now, and an interval is just two dates. The interval may be optional, and the two dates may be optional as well. So, the following are valid matches.
word
word [01/01/1900, ]
word [, 01/01/2000]
word [01/01/1900, 01/01/2000]
I want to capture the keyword and both the dates even if they are null.
This is the Java MWE I've came up with.
public class Parser {
public static void main(String[] args) {
Parser parser = new Parser();
String s = "word [01/01/1900, 01/01/2000]";
parser.parse(s);
}
public void parse(String s) {
String date = "\\d{2}/\\d{2}/\\d{4}";
String interval = "\\[("+date+")?, ("+date+")?\\]";
String keyword = "(.+)( "+interval+")?";
Pattern p = Pattern.compile(keyword);
Matcher m = p.matcher(s);
if (m.matches()) {
for (int i = 0; i <= m.groupCount(); ++i) {
System.out.println(i + ": " + m.group(i));
}
}
}
}
And this is the output
0: word [01/01/1900, 01/01/2000]
1: word [01/01/1900, 01/01/2000]
2: null
3: null
4: null
If interval isn't optional, then it works.
String keyword = "(.+)( "+interval+")";
0: word [01/01/1900, 01/01/2000]
1: word
2: [01/01/1900, 01/01/2000]
3: 01/01/1900
4: 01/01/2000
If interval is a non-matching group (but still optional), then it doesn't work.
String keyword = "(.+)(?: "+interval+")?";
0: word [01/01/1900, 01/01/2000]
1: word [01/01/1900, 01/01/2000]
2: null
3: null
What do I need to do to retrieve back both dates? Thank You.
Edit: Part 2.
Suppose now I watch to match repeated keywords. i.e. the regex, keyword(, keyword)*. I tried this out, but only the first and the last instance is captured.
For simplicity, suppose I want to match the following a, b, c, d with the regex ([a-z])(?:, ([a-z]))*
However, I can only retrieve back the first and last group.
0: a, b, c, d
1: a
2: d
Why is this so?
Just found out that this cannot be done. Capture group multiple times

Change the first part of keyword from (.+) to (.+?).
Without the ?, the (.+) is a greedy quantifier. That means it will try to match as much as it can. I don't know all the mechanics of how the regex engine works, but I believe that in your case, what it's doing is setting some counter N to the number of characters remaining in the source. If it can use up that many characters and get the whole regex to match, it will. Otherwise, it tries N-1, N-2, etc., until the entire regex matches. I also think it goes from left to right when trying this; that is, since (.+) is the leftmost "part" of the pattern (for some definition of "part"), it loops on that part before it tries any looping on parts that are to the right. Thus, it's more important to make (.+) greedy than to make any other part of the pattern greedy; the (.+) takes precedence.
In your case, since (.+) is followed by an optional part, the regex matcher starts by trying the entire remainder of the string--and it succeeds, because the rest of the string, which is empty, is a fine match for an optional substring. That should also explains why it doesn't work if your substring isn't optional--the empty substring no longer matches.
Adding ? makes it a "reluctant" (or "stingy") quantifier, which works in the opposite direction. It starts by seeing if it can make a match with 0 characters, then 1, 2, ..., instead of starting with N and going downward. So when it gets up to 5, matching "word ", and it finds that the rest of the string matches your optional part, it completes and gives the results you were expecting.

Regular Expressions (regex) Pattern Matching

Can someone please help me to understand how does this program calculate output given below?
import java.util.regex.*;
class Demo{
public static void main(String args[]) {
String name = "abc0x12bxab0X123dpabcq0x3423arcbae0xgfaaagrbcc";
Pattern p = Pattern.compile("[a-c][abc][bca]");
Matcher m = p.matcher(name);
while(m.find()) {
System.out.println(m.start()+"\t"+m.group());
}
}
}
OUTPUT :
0 abc
18 abc
30 cba
38 aaa
43 bcc

It simply searches the String for a match according to the rules specified by "[a-c][abc][bca]"
0 abc --> At position 0, there is [abc].
18 abc --> Exact same thing but at position 18.
30 cba --> At position 30, there is a group of a, b and c (specified by [a-c])
38 aaa --> same as 30
43 bcc --> same as 30
Notice, the counting starts at 0. So the first letter is at position 0, the second ist at position 1 an so on...
For further information about Regex and it's use see: Oracle Tutorial for Regex

Lets analize:
"[a-c][abc][bca]"
This pattern looks for groups of 3 letters each.
[a-c] means that first letter has to be between a and c so it can be either a,b or c
[abc] means that second letter has to be one of following letters a,b or c co basicly [a-c]
[bca] meanst that third letter has to be either b or c or a, order rather doesnt matter here.
Everything what you needs to know is in official java regex tutorial
http://docs.oracle.com/javase/tutorial/essential/regex/

This pattern basically matches 3-character words where each letter is either a,b, or c.
It then prints out each matching 3-char sequence along with the index at which it was found.
Hope that helps.

It is printing out the place in the string, starting with 0 instead of 1, where the occurrence of each match occurs. That is the first match, "abc" happens in position 0. the second match "abc" happens at string position 18.
essentially it is matching any 3 character string that contains an 'a', 'b', and 'c'.
the pattern could be written as "[a-c]{3}" and you should get the same result.

Lets look at your sourcecode, because the regexp itself was already well explained in the other answers.
//compiles a regexp pattern, kind of make it useable
Pattern p = Pattern.compile("[a-c][abc][bca]");
//creates a matcher for your regexp pattern. This one is used to find this regexp pattern
//in your actual string name.
Matcher m = p.matcher(name);
//loop while the matcher finds a next substring in name that matches your pattern
while(m.find()) {
//print out the index of the found substring and the substring itself
System.out.println(m.start()+"\t"+m.group());
}

How to precisely identify & work greedy or reluctant quantifiers? [duplicate]

This question already has an answer here:
SCJP6 regex issue
(1 answer)
Closed 7 years ago.
Given:
import java.util.regex.*;
class Regex2 {
public static void main (String args[]) {
Pattern p = Pattern.compile(args[0]);
Matcher m = p.matcher (args [1]);
boolean b = false;
while (m. find()) {
System.out.print(m.start() + m.group());
}
}
}
the command line expression is :
java Regex2 "\d*" ab34ef
What is the result?
A. 234
B. 334
C. 2334
D 0123456
E. 01234456
F. 12334567
G. Compilation fails
The SCJP book explains regex, pattern and matchers so horribly it's unbelievable.
Anyway, I pretty much understand most of the basics and have looked at the Sun/Oracle documentation about greedy and reluctant quantifiers. I understand the concepts but am a blurry about a few things:
What exactly is the physical symbol of a "greedy" quantifier? Is it simply a single *,? or + ?
If so, can someone explain in detail how this answer turns out to be E according to the book? When I run it myself I get the answer: 2334!
Here we would be using a greedy quantifier correct? This would consume the entire string and then backtrack and look for zero or more digits in a row. Thus, if greedy, the 'full string' would contain 2 digits in a row and would execute .find() only once (ie. m.start = 0 , m.group = "ab34ef"), by that definition!
Thanks for the help guys.

These are the matches of \d* against "ab34ef":
index 0: zero-width;
index 1: zero-width;
index 2: "34";
index 4: zero-width;
index 5: zero-width;
index 6: zero-width.
This should explain your output. If the quantifier was reluctant, this would be the difference:
index 2: zero-width;
index 3: zero-width;
The reluctant quantifier grabs as little as allowed to make the entire expression match.

Can't retrieve data from matched * group in Java

I'm having trouble figuring out the proper regex.
Here is some sample code:
#Test
public void testFindEasyNaked() {
System.out.println("Naked_find");
String arg = "hi mom <us-patent-grant seq=\"002\" image=\"D000001\" >foo<name>Fred</name></us-patent-grant> extra stuff";
String nakedPat = "<(us-patent-grant)((\\s*[\\S&&[^>]])*)*\\s*>(.+?)</\\1>";
System.out.println(nakedPat);
Pattern naked = Pattern.compile(nakedPat, Pattern.MULTILINE + Pattern.DOTALL );
Matcher m = naked.matcher(arg);
if (m.find()) {
System.out.println("found naked");
for (int i = 0; i <= m.groupCount(); i++) {
System.out.printf("%d: %s\n", i, m.group(i));
}
} else {
System.out.println("can't find naked either");
}
System.out.flush();
}
My regex matches the string, but I am not able to pull the repeated pattern.
What I want is to have
seq=\"002\" image=\"D000001\"
pulled out as a group. Here is what the program shows when I execute it.
Naked_find
<(us-patent-grant)((\s*[\S&&[^>]])*)*\s*>(.+?)</\1>
found naked
0: <us-patent-grant seq="002" image="D000001" >foo<name>Fred</name></us-patent-grant>
1: us-patent-grant
2:
3: "
4: foo<name>Fred</name>
The group #4 is fine, but where is the data for #2 and #3, and why is there a double quote in #3?
Thanks
Pat

Even if using an XML parser would be sound, I think I can explain the error in your regular expression:
String nakedPat = "<(us-patent-grant)((\\s*[\\S&&[^>]])*)*\\s*>(.+?)</\\1>";
You try to match the parameters in the part ((\\s*[\\S&&[^>]])*)*. Look at your innermost group: you have \s* ("one or more space") followed by \\S&&[^>] ("one non-space which is not >). It means that in your group, you will either have from zero to some spaces followed by a single non-space character.
So this will match any non-space character between "us-patent-grant" and >. And every time the regular expression engine will match it, it will assign the value to the group 3. It means the group previously matched are lost. That's why you have the last character of the tag, that is ".
You can improve it a bit by adding a + after [\\S&&[^>]], so it will match at least a complete sequence of non-spaces, but you would only obtain the last tag attribute in your group. You should instead use a better and simpler way:
Your goal being to pull out seq="002" image="D000001" in a group, what you should do is simply to match the sequence of every characters which are not > after "us-patent-grant":
"<(us-patent-grant)\\s*([^>]*)\\s*>(.+?)</\\1>"
This way, you have the following values in your groups:
Group 1: us-patent-grant
Group 2: seq=\"002\" image=\"D000001\"
Group 3: foo<name>Fred</name>
Here is the test on Regexplanet: http://fiddle.re/ezfd6

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.