Regular Expressions (regex) Pattern Matching - java

Can someone please help me to understand how does this program calculate output given below?
import java.util.regex.*;
class Demo{
public static void main(String args[]) {
String name = "abc0x12bxab0X123dpabcq0x3423arcbae0xgfaaagrbcc";
Pattern p = Pattern.compile("[a-c][abc][bca]");
Matcher m = p.matcher(name);
while(m.find()) {
System.out.println(m.start()+"\t"+m.group());
}
}
}
OUTPUT :
0 abc
18 abc
30 cba
38 aaa
43 bcc

It simply searches the String for a match according to the rules specified by "[a-c][abc][bca]"
0 abc --> At position 0, there is [abc].
18 abc --> Exact same thing but at position 18.
30 cba --> At position 30, there is a group of a, b and c (specified by [a-c])
38 aaa --> same as 30
43 bcc --> same as 30
Notice, the counting starts at 0. So the first letter is at position 0, the second ist at position 1 an so on...
For further information about Regex and it's use see: Oracle Tutorial for Regex

Lets analize:
"[a-c][abc][bca]"
This pattern looks for groups of 3 letters each.
[a-c] means that first letter has to be between a and c so it can be either a,b or c
[abc] means that second letter has to be one of following letters a,b or c co basicly [a-c]
[bca] meanst that third letter has to be either b or c or a, order rather doesnt matter here.
Everything what you needs to know is in official java regex tutorial
http://docs.oracle.com/javase/tutorial/essential/regex/

This pattern basically matches 3-character words where each letter is either a,b, or c.
It then prints out each matching 3-char sequence along with the index at which it was found.
Hope that helps.

It is printing out the place in the string, starting with 0 instead of 1, where the occurrence of each match occurs. That is the first match, "abc" happens in position 0. the second match "abc" happens at string position 18.
essentially it is matching any 3 character string that contains an 'a', 'b', and 'c'.
the pattern could be written as "[a-c]{3}" and you should get the same result.

Lets look at your sourcecode, because the regexp itself was already well explained in the other answers.
//compiles a regexp pattern, kind of make it useable
Pattern p = Pattern.compile("[a-c][abc][bca]");
//creates a matcher for your regexp pattern. This one is used to find this regexp pattern
//in your actual string name.
Matcher m = p.matcher(name);
//loop while the matcher finds a next substring in name that matches your pattern
while(m.find()) {
//print out the index of the found substring and the substring itself
System.out.println(m.start()+"\t"+m.group());
}

Related

Java regex: how to select words starting with a specific letter and is x number of characters long?

This is the code I wrote that selects all names starting from A:
String longString = "Amal Kamal Jamal Amitha Farook Amani Tom Adele George Ariana";
String pattern = "(?i)(\\s|^)[a][A-Za-z]+(\\s|$)";
Pattern checkRegex = Pattern.compile(pattern);
Matcher regexMatcher = checkRegex.matcher(longString);
while (regexMatcher.find()) {
System.out.println(regexMatcher.start() + " : " + regexMatcher.group());
}
Output is as expected
0 : Amal
16 : Amitha
30 : Amani
40 : Adele
53 : Ariana
Now I want to select names that are at least 5 characters long. So the expected output is: Amitha, Adele, Ariana.
When I type this only Ariana is returned. And I can't understand why.
String pattern = "(?i)(\\s|^)[a][A-Za-z]+(\\s|$){5,}";
Output
53 : Ariana
If I put a bracket around the whole expression (to say that this expression should be 5 characters long) Then output is nothing
String pattern = "(?i)((\\s|^)[a][A-Za-z]+(\\s|$)){5,}";
What is the correct way of writing this?
You quantified (\\s|$) while you need to quantify [a-zA-Z]. So, you only match texts that have 5 or more whitespaces or 5 or more ends of string (makes no sense of course) after the words. Also, you need to use {4,} as [a] already matches 1 letter.
Use this regex to fix the issue (although it is not the best one, see below why):
(?i)(\s|^)a[a-z]{4,}(\s|$)
Details
(?i) - case insensitive modifier
(\s|^) - either a whitespace or a start of a string
a - an a or A letter
[a-z]{4,} - any 4 or more ASCII letters
(\s|$) - either a whitespace or an end of a string (note: the whitespace will be consumed, and consecutive matching words will not be handled properly).
You may use "(?i)(?<!\\S)a[a-z]{4,}(?!\\S)" pattern to make sure you are matching a word in between whitespaces or start/end of string positions.
Or, use word boundaries - "(?i)\\ba[a-z]{4,}\\b".
See the Java online demo:
String longString = "Amal Kamal Jamal Amitha Farook Amani Tom Adele George Ariana";
String pattern = "(?i)(?<!\\S)a[a-z]{4,}(?!\\S)";
Pattern checkRegex = Pattern.compile(pattern);
Matcher regexMatcher = checkRegex.matcher(longString);
while (regexMatcher.find()) {
System.out.println(regexMatcher.start() + " : " + regexMatcher.group());
}
Result:
17 : Amitha
31 : Amani
41 : Adele
54 : Ariana

Find total number of occurrences of a substring

Suppose I want to find total number of occurrences of following substring.
Any substring that starts with 1 followed by any(0 or more) number of 0's and then followed by 1.
I formed a regular expression for it: 1[0]*1
Then I used the Pattern and Matcher class of java to do the rest of the work.
import java.util.regex.*;
class P_m
{
public static void main(String []args)
{
int s=0;
Pattern p=Pattern.compile("1[0]*1");
Matcher matcher=p.matcher("1000010101");
while(matcher.find())
++s;
System.out.println(s);
}
}
But the problem is when we have two consecutive substrings that overlap, the above code outputs answer 1 less than actual number of occurrences. For example in above code output is 2 whereas it should be 3. Can I modify above code to return the correct output.
Use a positive lookahead:
"10*(?=1)"
This matches the same pattern as you described (starts with 1, followed by zero or more 0, followed by 1), but the difference is that the final 1 is not included in the match. This way, that last 1 is not "consumed" by the match, and it can participate in further matches, effectively allowing the overlap that you asked for.
Pattern p = Pattern.compile("10*(?=1)");
Matcher matcher = p.matcher("1000010101");
int s = 0;
while (matcher.find()) ++s;
System.out.println(s);
Outputs 3 as you wanted.

Java Regex to match repeated keywords

I need to filter a document if the caption is the same surname (i.e.,Smith Vs Smith or John Vs John etc.).
I am converting entire document into a string and validating that string against a regular expression.
Could any one help me to write a regular expression for the above case.
Backreferences.
Example: (\w+) Vs \1
If a had exactly understand your question: you have a string like this "X Vs Y" (Where X and Y are two names) and you want to know if X == Y.
In this case, a simple (\w+) regex can do it :
String input = "Smith Vs Smith";
// Build the Regex
Pattern p = Pattern.compile("(\\w+)");
Matcher m = p.matcher(input);
// Store the matches in a list
List<String> str = new ArrayList<String>();
while (m.find()) {
if (!m.group().equals("Vs"))
{
str.add(m.group());
}
}
// Test the matches
if (str.size()>1 && str.get(0).equals(str.get(1)))
System.out.println(" The Same ");
else System.out.println(" Not the Same ");
(\w+).*\1
This means: a word of 1 or more characters, signed as group 1, followed by anything, and followed by whatever group 1 is.
More explained: grouping (bracketing part of regex) and referencing to groups defined in the expression ( \1 does that here).
Example:
String s = "Stewie is a good guy. Stewie does no bad things";
s.find("(\\w+).*\\1") // will be true, and group 1 is the duplicated word. (note the additional java escape);

How to match just 1 or 2 chars with regex

i want regx to match any word of 2 or 1 characters example ( is , an , or , if, a )
i tried this :-
int scount = 0;
String txt = "hello everyone this is just test aa ";
Pattern p2 = Pattern.compile("\\w{1,2}");
Matcher m2 = p2.matcher(txt);
while (m2.find()) {
scount++;
}
but got wrong matches.
You probably want to use word boundary anchors:
Pattern p2 = Pattern.compile("\\b\\w{1,2}\\b");
These anchors match at the start/end of alphanumeric "words", that is, in positions before a \w character if there is no \w character before that, or after a \w character if there is no \w character after that.
I think that you should be a bit more descriptive. Your current code returns 15 from the variable scount. That's not nothing.
If you want to get a count of the 2 letter words, and that is excluding underscores, digits within this count, I think that you would be better off with negative lookarounds:
Pattern.compile("(?i)(?<![a-z])[a-z]{1,2}(?![a-z])");
With a string input of hello everyone this is just 1 test aa, you get the value of scount as 2 (is and aa) and not 3 (is, 1, aa) as you would have if you were looking for only 1 or 2 consecutive \w.
Also, with hello everyone this is just test aa_, you get a count of 1 with \w (is), but 2 (is, aa)with the lookarounds.

Regular Expression (Java) anomaly - explanation sought

Using Java (1.6) I want to split an input string that has components of a header, then a number of tokens. Tokens conform to this format: a ! char, a space char, then a 2 char token name (from constrained list e.g. C0 or 04) and then 5 digits. I have built a pattern for this, but it fails for one token (CE) unless I remove the requirement for the 5 digits after the token name. Unit test explains this better than I could (see below)
Can anyone help with what's going on with my failing pattern? The input CE token looks OK to me...
Cheers!
#Test
public void testInputSplitAnomaly() {
Pattern pattern = Pattern.compile("(?=(! [04|C0|Q2|Q6|C4|B[2-6]|Q[8-9]|C6|CE]\\d{5}))");
splitByRegExp(pattern);
}
#Test
public void testInputSplitWorks() {
Pattern pattern = Pattern.compile("(?=(! [04|C0|Q2|Q6|C4|B[2-6]|Q[8-9]|C6|CE]))");
splitByRegExp(pattern);
}
public void splitByRegExp(Pattern pattern) {
String input = "& 0000800429! C600080 123456789-! C000026 213 00300! 0400020 A1Y1! Q200002 13! CE00202 01 ! Q600006 020507! C400012 O00511011";
String[] tokens = pattern.split(input);
Arrays.sort(tokens);
System.out.println("-----------------------------");
for (String token : tokens) {
System.out.println(token.substring(0,11));
}
assertThat(tokens,Matchers.hasItemInArray(startsWith("! CE")));
assertThat(tokens.length,is(8));
}
I think that your mistake here is your use of square brackets. Don't forget that these indicate a character class, so [04|C0|Q2|Q6|C4|B[2-6]|Q[8-9]|C6|CE] doesn't do what you expect it to.
What it does do is the following:
[04|C0|Q2|Q6|C4|B[2-6] constitutes a character class, matching one of: |, [, 0, 2, 3, 4, 5, 6, B, C or Q,
the rest is interpreted as listing a set of alternatives, specificially the character class mentioned above, or Q[8-9] *or * C6 *or * CE]. That is why the CE doesn't work, because it does not have a square bracket with it.
What you are probably after is (?:04|C0|Q2|Q6|C4|B[2-6]|Q[8-9]|C6|CE)
This doesn't make any sense:
[04|C0|Q2|Q6|C4|B[2-6]|Q[8-9]|C6|CE]
I believe you want:
(?:04|C0|Q2|Q6|C4|B[2-6]|Q[8-9]|C6|CE)
Square brackets are only used for character classes, not general grouping. Use (?:...) or (...) for general grouping (the latter also captures).

Categories