I need help coming up with a regular expression to match if a string has more than one occurrence of character. I already validated the length of the two strings and they will always be equal. Heres what i mean, for example. The string "aab" and "abb". These two should match the regular expression because they have repeating characters, the "aa" in the first string and the "bb" in the second.
Since you say "aba"-style repetition doesn't count, back-references should make this simple:
(.)\1+
Would find sequences of characters. Try it out:
java.util.regex.Pattern.compile("(.)\\1+").matcher("b").find(); // false
java.util.regex.Pattern.compile("(.)\\1+").matcher("bbb").find(); // true
If you're checking anagrams maybe a different algorithm would be better.
If you sort your strings (both the original and the candidate), checking for anagrams can be done with a string comparison.
static final String REGEX_MORE_THAN_ONE_OCCURANCE_OF_B = "([b])\\1{1,}";
static final String REGEX_MORE_THAN_ONE_OCCURANCE_OF_B_AS_PREFIX_TO_A = "(b)\\1+([a])";
Related
How can I understand the output of the below code? The code's first four print statements are about the Capturing Groups in Regular Expression in Java and the rest of the code is about the Pattern split method. I referred a few documents to perceive the code's output (shown in the pic) but could not figured it out how exactly it's working and showing this output.
Java Code
import java.util.*;
import java.util.regex.*;
import java.lang.*;
import java.io.*;
/* Name of the class has to be "Main" only if the class is public. */
public class Codechef
{
public static void main(String[] args) {
//Capturing Group in Regular Expression
System.out.println(Pattern.matches("(\\w\\d)\\1", "a2a2")); //true
System.out.println(Pattern.matches("(\\w\\d)\\1", "a2b2")); //false
System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B2AB")); //true
System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B3AB")); //false
// using pattern split method
Pattern pattern = Pattern.compile("\\W");
String[] words = pattern.split("one#two#three:four$five");
System.out.println(words);
for (String s : words) {
System.out.println("Split using Pattern.split(): " + s);
}
}
}
Results
Edit-1
Queries
If I talk about Capturing Groups, I cannot figure out what’s use of ‘\1’ or ‘\2’ here? How these are evaluating to true or false.
If I talk about Pattern split method, I wish to know how the string split is happening. How does this split method work differently than a normal string split method?
The first console print lines...
System.out.println(Pattern.matches("(\\w\\d)\\1", "a2a2")); //true
System.out.println(Pattern.matches("(\\w\\d)\\1", "a2b2")); //false
System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B2AB")); //true
System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B3AB")); //false
utilizes the matches() method which always returns a boolean (true or false). This method is mostly used for String validation of one sort or another. Taking the first and second example regular expressions which both are: "(\\w\\d)\\1" and then work that expression against the two supplied strings ("a2a2" and "a2b2") though the matches() method as they have done you will definitely be returned a boolean true and a false in that order.
The real key here is knowing what that particular Regular Expression is suppose to validate. The expression above is only working against 1 Capturing Group which is denoted by the parentheses. The \\w is used for matching any single word character which is equal to a-z or A-Z or 0-9 and _ (the underscore character). The \\d is used for matching a single digit equal to any number from 0 to 9.
Note: In reality the expression Meta characters are written as \w and \d but because the Escape Character (\) in Java Strings need to be escaped you have to add an additional Escape
Character.
The \1 is used to see if there is a single match of the same text as most recently matched by the 1st capturing group. Since there is only one capturing group specified you can only use a value of 1 here. Well, that's not entirely true, you could use the value of 0 here but then your not looking for a match in any capturing group which eliminates the purpose here. Any other value greater than 1 would create a expression exception since you have only 1 Capturing Group.
Bottom line, The expression looks at the first two characters within the supplied string:
Is the first character (\\w) within the supplied string a upper or lower case
A to Z or _ or a number from 0 to 9? If it isn't then there is no match and boolean false is returned but, if there is then.....
Is the second character (\\d) within the supplied string a digit
from 0 to 9? If it isn't then boolean false is returned but, if there is then....
Are the remaining 2 characters exactly the same (including letter
case if a-z or A-Z are used). If the remaining 2 characters are not
identical or there are more than two remaining characters then boolean
false is returned. If however those two remaining characters are identical then return boolean true.
Basically, the expression is merely used to validate that the Last Two characters within the supplied String match the First Two characters of the same supplied String. This is why the second console print:
System.out.println(Pattern.matches("(\\w\\d)\\1", "a2b2")); //false
returns a boolean false, b2 is not the same as a2 whereas in the first console print:
System.out.println(Pattern.matches("(\\w\\d)\\1", "a2a2")); //true
the Last Two characters a2 do indeed match the First Two characters a2 and therefore boolean true is returned.
You will now notice that in the other two console prints:
System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B2AB")); //true
System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B3AB")); //false
the Regular Expression used contains 2 Capture Groups (two sets of parentheses). The same sort of matching applies here but against two capture groups instead of one like the first two console prints.
If you want to see how these Regular Expressions play out and get explanations on what the expressions mean then use Regular Expression Tester at regex101.com. This is also a good Regular Expressions resource.
Pattern.split():
In this case, the use of the Pattern.split() method is a little overkill in my opinion since String.split() accepts Regular Expressions but does have it's purpose in other areas. Never the less it is a good example of how it can be used. The .split() method is used here to carry out the grouping based on the String that was supplied to it and what was deemed as the Regular Expression through Pattern which in this case is "\\W" (otherwise: \W). The \W (uppercase W) means 'match any non-word character which is not equal to a-z or A-Z or 0-9 or _. This expression is basically the opposite of "\w" (with the lowercase w). The characters #, #, :, and $ contained within the supplied String (yes... the comma, semicolon, exclamation, etc):
"one#two#three:four$five"
are considered non-word characters and therefore the split is carried out on any one of them resulting in a String Array containing:
[one, two, three, four, five]
The very same thing can be accomplished doing it this way using the String.split() method since tis method allows for a Regular Expression to be applied:
String[] s = "one#two#three;four$five".split("\\W");
or even:
String[] s = "one#two#three;four$five".split("[##:$]");
or even:
String[] s = "one#two#three;four$five".split("#|#|:|\\$");
// The $ character is a reserved RegEx symbol and therefore
// needs to be escaped.
or on and on and on...
Yup... "\\W" is easier since it covers all non-word characters. ;)
If i talk about Capturing Groups, I cannot figure out what is usage of ‘\1’ or ‘\2’ here? How these are evaluating to true or false.
Answer:
\\1 repeats the first captured group (i.e. a2 captured by (\\w\\d))
\\2 repeats the second captured group (i.e. B2 captured by (B\\d))
The actual name for those combinations is backreferences:
The section of the input string matching the capturing
group(s) is saved in memory for later recall via backreference. A
backreference is specified in the regular expression as a backslash
() followed by a digit indicating the number of the group to be
recalled.
If i talk about Pattern split method, I wish to know how the string split is happening. How does this split method work differently than a normal string split method?
Answer:
The split() method in the Pattern class can split a text into an array of String's, using the regular expression (the pattern) as delimiter
Rather than explicitly split a string using a fixes string or character, here you provide a regex, which is much more powerful and elastic.
How to repeat every character of a given String in java?
For example:
String s = "Hello";
Becomes:
s = "HHeelllloo";
Use regex!
s = s.replaceAll(".", "$0$0");
OK, so how does this work?
The replaceAll() method takes a regex as the search term, and a dot matches every character. So every character will be replaced.
The replacement term can contain back references to captured groups, which are coded as $n, where n is 1-9. But there's a special implicit group zero that is the entire match, so $0$0 means "the whole match twice".
Overall, in English this means "replace every character with two copies of itself".
I want to split my string on every occurrence of an alpha-beta character.
for example:
"s1l1e13" to an array of: ["s1","l1","e13"]
when trying to use this simple split by regex i get some weird results:
testStr = "s1l1e13"
Arrays.toString(testStr.split("(?=[a-z])"))
gives me the array of:
["","s1","l1","e13"]
how can i create the split without the empty array element?
I tried a couple more things:
testStr = "s1"
Arrays.toString(testStr.split("(?=[a-z])"))
does return the currect array: ["s1"]
but when trying to use substring
testStr = "s1l1e13"
Arrays.toString(testStr.substring(1).split("(?=[a-z])")
i get in return ["1","l1","e13"]
what am i missing?
Your Lookahead marks each position before any character of a to z; marking the following positions:
s1 l1 e13
^ ^ ^
So by spliting using just the Lookahead, it returns ["", "s1", "l1", "e13"]
You can use a Negative Lookbehind here. This looks behind to see if there is not the beginning of the string.
String s = "s1l1e13";
String[] parts = s.split("(?<!\\A)(?=[a-z])");
System.out.println(Arrays.toString(parts)); //=> [s1, l1, e13]
Your problem is that (?=[a-z]) means "place before [a-z]" and in your text
s1l1e13
you have 3 such places. I will mark them with |
|s1|l1|e13
so split (unfortunately correctly) produces "" "s1" "l1" "e13" and doesn't automatically remove for you first empty elements.
To solve this problem you have at least two options:
make sure that there is something before your place you need to split on (it is not at start of your string). You can use for instance (?<=\\d)(?=[a-z]) if you want to split after digit but before character
(PREFFERED SOLUTION) start using Java 8 which automatically removes empty strings at start of result array if regex used on split is zero-length (look-arounds are zero length).
The first match finds "" to be okay because its looking ahead for any alpha character, which is called zero-width lookahead, so it doesn't need to actually match anything. So "s" at the beginning is alphanumeric, and it matches that at a probable spot.
If you want the regex to match something always, use ".+(?=[a-z])"
The problem is that the initial "s" counts as an alphabetic character. So, the regex is trying to split at s.
The issue is that there is nothing before the s, so the regex machine instead decides to show that there is nothing by adding the null element. It'll do the same thing at the end if you ended with "s" (or any other letter).
If this is the only string you're splitting, or if every array you had starts with a letter but does not end with one, just truncate the array to omit the first element. Otherwise, you'll probably need to loop through each array as you make it so that you can drop empty elements.
So it seems your matches has the pattern x###, where x is a letter, and # is a number.
I'd make the following Regex:
([a-z][0-9]+)
I use the method String.matches(String regex) to find if a string matches the regex expression
From my point of view the regular expression regex="[0-9]+" means a String that contains at least one figure between 0 and 9
But when I debug "3.5".matches("[0-9]+") it returns false.
So what is wrong ?
matches determines if the regex matches the whole string. It won't return true if the string contains a match.
To test if the string contains a match to a given regex, use Pattern.compile(regex).matcher(string).find().
(Your regex, [0-9]+, will match any string that contains only digits from 0 to 9, and at least one digit. It doesn't magically match against any real number. If you want something matching any real number, look at e.g. the Javadoc for Double.valueOf(String), which specifies a regex used in validating doubles. That regex allows hexadecimal input, NaNs, and infinities, but it should give you a better idea of what's required.)
Alternately, edit the regex so it directly matches any string containing one or more digits, e.g. .*[0-9]+.* would do the job.
If you want to match decimal numbers, your reg ex needs to be \d*\.?\d+. If you want negatives as well, then \-?\d*\.?\d+.
. is not 0-9 and matches tests the entire string.
I have a string of "abc123(" and want to check if contains one or more chars that are not a number or character.
"abc123(".matches("[^a-zA-Z0-9]+"); should return true in this case? But it dose not! Whats wrong?
My test script:
public class NewClass {
public static void main(String[] args) {
if ("abc123(".matches("[^a-zA-Z0-9]+")) {
System.out.println("true");
}
}
}
In Java, the expressions has to match the entire string, not just part of it.
myString.matches("regex") returns true or false depending whether the
string can be matched entirely by the regular expression. It is
important to remember that String.matches() only returns true if the
entire string can be matched. In other words: "regex" is applied as if
you had written "^regex$" with start and end of string anchors. Source
Your expression is looking for part of the string, not the whole thing. You can change your expression to .*YOUR_EXPRESSION.* and it will expand to match the entire string.
Rather than checking to see if it contains only letters and numbers, why not check to see if it contains anything other than that? You can use the not word group (\W) and if that returns true than you know the string contains something other than the characters you are looking for,
"abc123(".matches("[\W]");
If this returns true than there is something other than just word characters and digits.
Expression [^A-Za-z0-9]+ means 'not letters or digits'. You probably want to replace it with ^[A-Za-z0-9]+$ which means 'Only letters or digits'.