Capturing groups and Pattern split method in regular expression

Capturing groups and Pattern split method in regular expression - java

How can I understand the output of the below code? The code's first four print statements are about the Capturing Groups in Regular Expression in Java and the rest of the code is about the Pattern split method. I referred a few documents to perceive the code's output (shown in the pic) but could not figured it out how exactly it's working and showing this output.
Java Code
import java.util.*;
import java.util.regex.*;
import java.lang.*;
import java.io.*;
/* Name of the class has to be "Main" only if the class is public. */
public class Codechef
{
public static void main(String[] args) {
//Capturing Group in Regular Expression
System.out.println(Pattern.matches("(\\w\\d)\\1", "a2a2")); //true
System.out.println(Pattern.matches("(\\w\\d)\\1", "a2b2")); //false
System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B2AB")); //true
System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B3AB")); //false
// using pattern split method
Pattern pattern = Pattern.compile("\\W");
String[] words = pattern.split("one#two#three:four$five");
System.out.println(words);
for (String s : words) {
System.out.println("Split using Pattern.split(): " + s);
}
}
}
Results
Edit-1
Queries
If I talk about Capturing Groups, I cannot figure out what’s use of ‘\1’ or ‘\2’ here? How these are evaluating to true or false.
If I talk about Pattern split method, I wish to know how the string split is happening. How does this split method work differently than a normal string split method?

The first console print lines...
System.out.println(Pattern.matches("(\\w\\d)\\1", "a2a2")); //true
System.out.println(Pattern.matches("(\\w\\d)\\1", "a2b2")); //false
System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B2AB")); //true
System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B3AB")); //false
utilizes the matches() method which always returns a boolean (true or false). This method is mostly used for String validation of one sort or another. Taking the first and second example regular expressions which both are: "(\\w\\d)\\1" and then work that expression against the two supplied strings ("a2a2" and "a2b2") though the matches() method as they have done you will definitely be returned a boolean true and a false in that order.
The real key here is knowing what that particular Regular Expression is suppose to validate. The expression above is only working against 1 Capturing Group which is denoted by the parentheses. The \\w is used for matching any single word character which is equal to a-z or A-Z or 0-9 and _ (the underscore character). The \\d is used for matching a single digit equal to any number from 0 to 9.
Note: In reality the expression Meta characters are written as \w and \d but because the Escape Character (\) in Java Strings need to be escaped you have to add an additional Escape
Character.
The \1 is used to see if there is a single match of the same text as most recently matched by the 1st capturing group. Since there is only one capturing group specified you can only use a value of 1 here. Well, that's not entirely true, you could use the value of 0 here but then your not looking for a match in any capturing group which eliminates the purpose here. Any other value greater than 1 would create a expression exception since you have only 1 Capturing Group.
Bottom line, The expression looks at the first two characters within the supplied string:
Is the first character (\\w) within the supplied string a upper or lower case
A to Z or _ or a number from 0 to 9? If it isn't then there is no match and boolean false is returned but, if there is then.....
Is the second character (\\d) within the supplied string a digit
from 0 to 9? If it isn't then boolean false is returned but, if there is then....
Are the remaining 2 characters exactly the same (including letter
case if a-z or A-Z are used). If the remaining 2 characters are not
identical or there are more than two remaining characters then boolean
false is returned. If however those two remaining characters are identical then return boolean true.
Basically, the expression is merely used to validate that the Last Two characters within the supplied String match the First Two characters of the same supplied String. This is why the second console print:
System.out.println(Pattern.matches("(\\w\\d)\\1", "a2b2")); //false
returns a boolean false, b2 is not the same as a2 whereas in the first console print:
System.out.println(Pattern.matches("(\\w\\d)\\1", "a2a2")); //true
the Last Two characters a2 do indeed match the First Two characters a2 and therefore boolean true is returned.
You will now notice that in the other two console prints:
System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B2AB")); //true
System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B3AB")); //false
the Regular Expression used contains 2 Capture Groups (two sets of parentheses). The same sort of matching applies here but against two capture groups instead of one like the first two console prints.
If you want to see how these Regular Expressions play out and get explanations on what the expressions mean then use Regular Expression Tester at regex101.com. This is also a good Regular Expressions resource.
Pattern.split():
In this case, the use of the Pattern.split() method is a little overkill in my opinion since String.split() accepts Regular Expressions but does have it's purpose in other areas. Never the less it is a good example of how it can be used. The .split() method is used here to carry out the grouping based on the String that was supplied to it and what was deemed as the Regular Expression through Pattern which in this case is "\\W" (otherwise: \W). The \W (uppercase W) means 'match any non-word character which is not equal to a-z or A-Z or 0-9 or _. This expression is basically the opposite of "\w" (with the lowercase w). The characters #, #, :, and $ contained within the supplied String (yes... the comma, semicolon, exclamation, etc):
"one#two#three:four$five"
are considered non-word characters and therefore the split is carried out on any one of them resulting in a String Array containing:
[one, two, three, four, five]
The very same thing can be accomplished doing it this way using the String.split() method since tis method allows for a Regular Expression to be applied:
String[] s = "one#two#three;four$five".split("\\W");
or even:
String[] s = "one#two#three;four$five".split("[##:$]");
or even:
String[] s = "one#two#three;four$five".split("#|#|:|\\$");
// The $ character is a reserved RegEx symbol and therefore
// needs to be escaped.
or on and on and on...
Yup... "\\W" is easier since it covers all non-word characters. ;)

If i talk about Capturing Groups, I cannot figure out what is usage of ‘\1’ or ‘\2’ here? How these are evaluating to true or false.
Answer:
\\1 repeats the first captured group (i.e. a2 captured by (\\w\\d))
\\2 repeats the second captured group (i.e. B2 captured by (B\\d))
The actual name for those combinations is backreferences:
The section of the input string matching the capturing
group(s) is saved in memory for later recall via backreference. A
backreference is specified in the regular expression as a backslash
() followed by a digit indicating the number of the group to be
recalled.
If i talk about Pattern split method, I wish to know how the string split is happening. How does this split method work differently than a normal string split method?
Answer:
The split() method in the Pattern class can split a text into an array of String's, using the regular expression (the pattern) as delimiter
Rather than explicitly split a string using a fixes string or character, here you provide a regex, which is much more powerful and elastic.

Related

How to use regular expressions on an index of a String of array in Java

I am basically trying to find regular expression for a text "TC XX" where XX can be any two digit number. My piece of code is:
boolean b = DocArray[RTArrayIndex].matches("/TC \\d{2}/");
where DocArray - an array of string which is basically derived from another string separated by \t
RTArrayIndex - current index of the DocArray array.
Regular Expression - /TC \\d{2}/
The value of string at the current index is "TC 10", but still the value of "b" I am getting is false.
Another index of the array contains the string, "Refer Logs of TC 10" too, but again the value of "b" is false.

You have a few problems. First, your regex contains some "/" characters, which it is attempting to match. If you remove both of those, you will have a slightly better regex.
boolean b = DocArray[RTArrayIndex].matches("TC \\d{2}");
The regex above should evaluate for your first example, but not your second. You need to account for leading and trailing characters. You can do this by using the "." symbol. "." is a placeholder for any character at all, "" means it can be seen any number of times. If you add ".*" to the beginning and end of your pattern, any string that contains the substring "TC \d\d" will match to your regex.
boolean b = DocArray[RTArrayIndex].matches(".*TC \\d{2}.*");

Remove the slash at the begining and the end of your regular expression like that :
TC \\d{2}
This works for your first exemple. If you want all strings containing TC 10, you need to add some part at the begining and the end like .* (which means 'anything')
The final regular expression should be :
.*TC \\d{2}.*

How to make Java String split greedy with lookahead?

Code is basically:
String[] result = "T&&T&T".split("(?=\\w|&+)");
I was expecting the lookahead to be greedy but instead it is returning the array:
T, &, &, T, &, T
What I am aiming for is:
T, &&, T, &, T
Is this possible for split and lookahead?
I have tried the following split regex values but the result is still not greedy for the ampersand:
"(?=\\w|&&?)"
"(?=\\w|&{1,2})"

It is already greedy, but I think you are misunderstanding how your split is working. The problem is that you are thinking of the characters but not the space between them (this is one of the places where regexes can get away from you).
You are asking to split at the places in the string where the next character is either a word character or a series of ampersands. In your string, let's mark the places that satisfy that:
T|&|&|T|&|T
In the space between the first T and the first ampersand, the next character is an ampersand (matches (?=&) which is valid in your regex), the space between the two ampersands also matches for this same reason. The space between the ampersands and the second T also matches (matches (?=\w)), and so on.
The split function will test each space in the string to determine if it is a candidate for a split position. To do what you want, you have to be careful about using the lookahead, so that we don't allow allow splits in the middle of a string of ampersands.
There are multiple ways you may overcome this; Wiktor Stribiżew provides a suggestion that works in his comment.
Usually using a look-behind to check that you are not repeating an undesired character will work, or if possible you can use a look-behind to identify the matching places, and a look-ahead to avoid the undesired repetitions. For example, if we wish to split at all characters keeping repeated characters together, you could do (?<=(.))(?!\\1) which splits your example as T, &&, T, &, T.

Lookarounds cannot be greedy or reluctant, they just check if the adjoining text to the left (lookbehind) and to the right (lookahead) matches the lookaround subpattern. If there is a match, and the lookaround is positive, the empty location is matched. If the lookaround is not anchored, each location in string is tested against the pattern in the lookaround, even the beginning and end. See this screenshot showing that (with your (?=\w|&&?)):
Since the lookaround is a zero-width assertion and it does not consume characters, all locations (before each character and at the end) are tested. Thus, you get matches between each character.
The (?=\w|&&?) checks the first location before T: it gets matched with \w, so this location is matched (see the first |). Then comes the next location, after the first T before the &. It is matched as it is followed woth &&. Then the regex engine goes on to check the location after the first & and the second &. It is matched as there is a & after it. This way, we match up to the end. The end location is not matched as it is not followed with & or a word character.
You may restrict the pattern inside a lookaround with another lookaround to avoid matching specific locations inside the input string.
(?=\w|(?<!&)&)
^^^^^^
The (?<!&)& pattern will match a & that is not preceded with another &. See the regex demo.
IDEONE demo:
String[] result = "T&&T&T".split("(?=\\w|(?<!&)&)");
System.out.println(Arrays.toString(result));
// => [T, &&, T, &, T]
The lookaround solution is a generic one. If we are to consider the current case, you can surely "shorten" the pattern to \b (which will also find a match at the end of the string, though Java String#split will safely remove trailing empty elements from the resulting array) that matches all locations between a non-word and word characters and also at the start/end of the string if there is a word character at its start/end. This won't work if the alternatives (like \w and & in your regex) belong to the same type (say, both are word characters.

How about this:
"(?=\\w)|(?<=\\w)"
or allowing repeat of T:
"(?<!\\w)(?=\\w)|(?<=\\w)(?!\\w)"
or the best form here

It looks like you want to split between different chars, so generally:
String[] parts = input.split("(?<=T)(?=&)|(?<=&)(?=T)");
But in this case, you can split on word boundaries except at start/end:
String[] parts = input.split("(?<=.)\b(?=.)");

Regex for First word and last word of a string separates with

I'm trying to get a regex for the following expression but can't make it:
String have 4 words separated with dots(.).
First word matches a given one (HELLO for example).
Second and third words could have any character but dot itself (.).
Last word matches a given one again(csv for example).
So:
HELLO.something.Somethi#gElse.csv should match.
something.HELLO.?.csv shouldn't match.
HELLO.something...csv shouldn't match.
HELLO.something.somethingelse.notcsv shouldn't match
I can do it with split(.) and then check for individual words, but I'm trying to get it working with Regex and Pattern class.
Any help would be really appreciated.

This is relatively straightforward, as long as you understand character classes. A regex with square brackets [xyz] matches any character from the list {x, y, z}; a regex [^xyz] matches any character except {x, y, z}.
Now you can construct your expression:
^HELLO\.[^.]+\.[^.]+\.csv$
+ means "one or more of the preceding expression"; \. means "dot itself". ^ means "the beginning of the string"; $ means "the end of the string". These anchors prevent regex from matching
blahblahHELLO.world.world.csvblahblah
Demo.
A common goal for writing regular expressions like that is to capture some content, for example, the string between the first and the second dot, and the string between the second and the third dot. Use capturing groups to bring the content of these strings into your Java program:
^HELLO\.([^.]+)\.([^.]+)\.csv$
Each pair of parentheses defines a capturing group, indexed from 1 (group at index zero represents the capture of the entire expression). Once you obtain a match object from the pattern, you can query it for the groups, and extract the corresponding strings.
Note that backslashes in Java regex need to be doubled.

(^HELLO\.[^.]+\.[^.]+\.csv$)
Here is the same regex with token explanation on regex101.

Java regular expression: A-Z and - or _, but only once

I've only dabbled in regular expressions and was wondering if someone could help me make a Java regex, which matches a string with these qualities:
It is 1-14 characters long
It consists only of A-Z, a-z and the letters _ or -
The symbol - and _ must be contained only once (together) and not at the start
It should match
Hello-Again
ThisIsValid
AlsoThis_
but not
-notvalid
Not-Allowed-This
Nor-This_thing
VeryVeryLongStringIndeed
I've tried the following regex string
[a-zA-Z^\\-_]+[\\-_]?[a-zA-Z^\\-_]*
and it seems to work. However, I'm not sure how to do the total character limiting part with this approach. I've also tried
[[a-zA-Z]+[\\-_]?[a-zA-Z]*]{1,14}
but it matches (for example) abc-cde_aa which it shouldn't.

This ought to work:
(?![_-])(?!(?:.*[_-]){2,})[A-Za-z_-]{1,14}
The regex is quite complex, let my try and explain it.
(?![_-]) negative lookahead. From the start of the string assert that the first character is not _ or -. The negative lookahead "peeks" of the current position and checks that it doesn't match [_-] which is a character group containing _ and -.
(?!(?:.*[_-]){2,}) another negative lookahead, this time matching (?:.*[_-]){2,} which is a non capturing group repeated at least two times. The group is .*[_-], it is any character followed by the same group as before. So we don't want to see some characters followed by _ or - more than once.
[A-Za-z_-]{1,14} is the simple bit. It just says the characters in the group [A-Za-z_-] between 1 and 14 times.
The second part of the pattern is the most tricky, but is a very common trick. If you want to see a character A repeated at some point in the pattern at least X times you want to see the pattern .*A at least X times because you must have
zzzzAzzzzAzzzzA....
You don't care what else is there. So what you arrive at is (.*A){X,}. Now, you don't need to capture the group - this just slows down the engine. So we make the group non-capturing - (?:.*A){X,}.
What you have is that you only want to see the pattern once, so you want not to find the pattern repeated two or more times. Hence it slots into a negative lookahead.
Here is a testcase:
public static void main(String[] args) {
final String pattern = "(?![_-])(?!(?:.*[_-]){2,})[A-Za-z_-]{1,14}";
final String[] tests = {
"Hello-Again",
"ThisIsValid",
"AlsoThis_",
"_NotThis_",
"-notvalid",
"Not-Allow-This",
"Nor-This_thing",
"VeryVeryLongStringIndeed",
};
for (final String test : tests) {
System.out.println(test.matches(pattern));
}
}
Output:
true
true
true
false
false
false
false
false
Things to note:
the character - is special inside character groups. It must go at the start or end of a group otherwise it specifies a range
lookaround is tricky and often counter-intuitive. It will check for matches without consuming, allowing you to test multiple conditions on the same data.
the repetition quantifier {} is very useful. It has 3 states. {X} is repeated exactly X times. {X,} is repeated at least X times. And {X, Y} is repeated between X and Y times.

To check if string is in form XXX-XXX where -XXX or _XXX part is optional you can use
[a-zA-Z]+([-_][a-zA-Z]*)?
which is similar to what you already had
[[a-zA-Z]+[\\-_]?[a-zA-Z]*]
but you made crucial mistake and wrapped it entirely in [...] which makes it character class, and that is not what you wanted.
To check if matched part has only 1-14 length you can use look-ahead mechanism. Just place
(?=.{1,14}$)
at start of your regex to make sure that part from start of match till end of it (represented by $) contains of any 1-14 characters.
So your final regex can look like
String regex = "(?=.{1,14}$)[a-zA-Z]+([-_][a-zA-Z]*)?";
Demo
String [] data = {
"Hello-Again",
"ThisIsValid",
"AlsoThis_",
"-notvalid",
"Not-Allowed-This",
"Nor-This_thing",
"VeryVeryLongStringIndeed",
};
for (String s : data)
System.out.println(s + " : " + s.matches(regex));
Output:
Hello-Again : true
ThisIsValid : true
AlsoThis_ : true
-notvalid : false
Not-Allowed-This : false
Nor-This_thing : false
VeryVeryLongStringIndeed : false

Why does "3.5".matches("[0-9]+") return false?

I use the method String.matches(String regex) to find if a string matches the regex expression
From my point of view the regular expression regex="[0-9]+" means a String that contains at least one figure between 0 and 9
But when I debug "3.5".matches("[0-9]+") it returns false.
So what is wrong ?

matches determines if the regex matches the whole string. It won't return true if the string contains a match.
To test if the string contains a match to a given regex, use Pattern.compile(regex).matcher(string).find().
(Your regex, [0-9]+, will match any string that contains only digits from 0 to 9, and at least one digit. It doesn't magically match against any real number. If you want something matching any real number, look at e.g. the Javadoc for Double.valueOf(String), which specifies a regex used in validating doubles. That regex allows hexadecimal input, NaNs, and infinities, but it should give you a better idea of what's required.)
Alternately, edit the regex so it directly matches any string containing one or more digits, e.g. .*[0-9]+.* would do the job.

If you want to match decimal numbers, your reg ex needs to be \d*\.?\d+. If you want negatives as well, then \-?\d*\.?\d+.

. is not 0-9 and matches tests the entire string.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Capturing groups and Pattern split method in regular expression - java

Related

How to use regular expressions on an index of a String of array in Java

How to make Java String split greedy with lookahead?

Regex for First word and last word of a string separates with

Java regular expression: A-Z and - or _, but only once

Why does "3.5".matches("[0-9]+") return false?

Categories

Resources