Regex for finding between 1 and 3 character in a string - java

I am trying to write a regex which should return true, if [A-Za-z] is occured between 1 and 3, but I am not able to do this
public static void main(String[] args) {
String regex = "(?:([A-Za-z]*){3}).*";
String regex1 = "(?=((([A-Za-z]){1}){1,3})).*";
Pattern pattern = Pattern.compile(regex);
System.out.println(pattern.matcher("AD1CDD").find());
}
Note: for consecutive 3 characters I am able to write it, but what I want to achieve is the occurrence should be between 1 and 3 only for the entire string. If there are 4 characters, it should return false. I have used look-ahead to achieve this

If I understand your question correctly, you want to check if
1 to 3 characters of the range [a-zA-Z] are in the string
Any other character can occur arbitrary often?
First of all, just counting the characters and not using a regular expression is more efficient, as this is not a regular language problem, but a trivial counting problem. There is nothing wrong with using a for loop for this problem (except that interpreters such as Python and R can be fairly slow).
Nevertheless, you can (ab-) use extended regular expressions:
^([^A-Za-z]*[A-Za-z]){1,3}[^A-Za-z]*$
This is fairly straightforward, once you also model the "other" characters. And that is what you should do to define a pattern: model all accepted strings (i.e. the entire "language"), not only those characters you want to find.
Alternatively, you can "findAll" matches of ([A-Za-z]), and look at the length of the result. This may be more convenient if you also need the actual characters.
The for loop would look something like this:
public static boolean containsOneToThreeAlphabetic(String str) {
int matched = 0;
for(int i=0; i<str.length; i++) {
char c = str.charAt(i);
if ((c>='A' && c<='Z') || (c>='a' && c<='z')) matched++;
}
return matched >=1 && matched <= 3;
}
This is straightforward, readable, extensible, and efficient (in compiled languages). You can also add a if (matched>=4) return false; (or break) to stop early.

Please, stop playing with regex, you'll complicate not only your own life, but the life of the people, who have to handle your code in the future. Choose a simpler approach, find all [A-Za-z]+ strings, put them into the list, then check every string, if the length is within 1 and 3 or beyond that.

Regex
/([A-Za-z])(?=(?:.*\1){3})/s
Looking for a char and for 3 repetitions of it. So if it matches there are 4 or more equal chars present.

Related

Porting Twemoji regex to extract Unicode emojis in Java

I'm trying to identify the same emojis in a String for extraction that Twemoji would, using Java. A straight up port isn't working for a great deal of emojis - I think I've identified the issue, so I'll give it in an example below:
Suppose we have the emoji 🪔 (Codeunits being \ud83e\ude94). In Javascript regex, this is captured by, \ud83e[\ude94-\ude99] which will first match the \ude83e then find subsequent \ude94 within the range indicated inside the brackets. The same expression in Java regex, however, fails to match at all. If I modify the Java pattern to [\ud83e[\ude94-\ude99]], according to an online engine, the 2nd half is captured, but not the 1st.
My working theory is that Java encounters the brackets and treats everything inside as a single codepoint and when combined with the outside codeunit, thinks it's looking for two codepoints instead of one. Is there an easy way to fix this or the regex pattern to work around it? The obvious fix would be to use something like [\ud83e\ude94-\ud83e\ude99], the actual regex pattern is quite lengthy. I wonder if there might be an easy encoding fix somewhere here as well.
Toy sample below:
public static void main(String[] args) {
String emojiPattern = "\ud83e[\ude94-\ude99]";
String raw = "\ud83e\ude94";
Pattern pattern = Pattern.compile(emojiPattern);
Matcher matcher = pattern.matcher(raw);
System.out.println(matcher.matches());
}
If you're trying to match a single specific codepoint, don't mess with surrogate pairs; refer to it by number:
String emojiPattern = "\\x{1FA94}";
or by name:
String emojiPattern = "\\N{DIYA LAMP}"
If you want to match any codepoint in the block U+1FA94 is in, use the name of the block in a property atom:
String emojiPattern = "\\p{blk=Symbols and Pictographs Extended-A}";
If you switch out any of these three regular expressions your example program will print 'true'.
The problem you're running into is a UTF-16 surrogate pair is a single codepoint, and the RE engine matches codepoints, not code units; you can't match just the low or high half - just the pattern "\ud83e" will fail to match too (When used with Matcher#find instead of Matcher#matches of course), for example. It's all or none.
To do the kind of ranged matching you want, you have to turn away from regular expressions and look at the code units directly. Something like
char[] codeUnits = raw.toCharArray();
for (int i = 0; i < codeUnits.length - 1; i++) {
if (codeUnits[i] == 0xD83E &&
(codeUnits[i + 1] >= 0xDE94 && codeUnits[i + 1] <= 0xDE99)) {
System.out.println("match");
}
}

Java regex - erase characters followed by \b (backspace)

I have a string constructed from user keyboard types, so it might contain '\b' characters (backspaces).
I want to clean the string, so that it will not contain the '\b' characters, as well as the characters they are meant to erase. For instance, the string:
String str = "\bHellow\b world!!!\b\b\b.";
Should be printed as:
Hello world.
I have tried a few things with replaceAll, and what I have now is:
System.out.println(str.replaceAll("^\b+|.\b+", ""));
Which prints:
Hello world!!.
Single '\b' is handled fine, but multiples of it are ignored.
So, can I solve it with Java's regex?
EDIT:
I have seen this answer, but it seem to not apply for java's replaceAll.
Maybe I'm missing something with the verbatim string...
It can't be done in one pass unless there is a practical limit on the number of consecutive backspaces (which there isn't), and there is a guarantee (which there isn't) that there are no "extra" backspaces for which there is no preceding character to delete.
This does the job (it's only 2 small lines):
while (str.contains("\b"))
str = str.replaceAll("^\b+|[^\b]\b", "");
This handles the edge case of input like "x\b\by" which has an extra backspace at the start, which should be trimmed once the first one consumes the x, leaving just "y".
This looks like a job for Stack!
Stack<Character> stack = new Stack<Character>();
// for-each character in the string
for (int i = 0; i < str.length(); i++) {
char c = str.charAt(i);
// push if it's not a backspace
if (c != '\b') {
stack.push(c);
// else pop if possible
} else if (!stack.empty()) {
stack.pop();
}
}
// convert stack to string
StringBuilder builder = new StringBuilder(stack.size());
for (Character c : stack) {
builder.append(c);
}
// print it
System.out.println(builder.toString());
Regex, while nice, isn't well suited to every task. This approach is not as concise as Bohemian's, but it is more efficient. Using a stack is O(n) in every case, while a regex approach like Bohemian's is O(n2) in the worst case.
The problem you are trying to solve can't be solved with single regular expression. The problem there is that grammar, that generates language {any_symbol}*{any_symbol}^n{\b}^n (which is special case of your input) isn't regular. You need to store state somewhere (how much symbols before \b and \b it has read), but DFA can't do it (because DFA can't know how much sequential \b it can find). All proposed solutions are just regexes for your case ("\bHellow\b world!!!\b\b\b.") and can easily be broken with more complicated test.
Easiest solution for your case is replacing in cycle pair {all except \b}{\b}
UPD: Solution, proposed by #Bohemian seems perfectly correct:
UPD 2:
Seems like java's regexes can parse not only regular languages, but also inputs like {a}^n{b}^n with recursive lookahead, so in case for java it is possible to match those groups with single regex.
Thanks for #Pshemo comments and #Elist edits!
If i understand the question correctly, this is the solution to your question:
String str = "\bHellow\b world!!!\b\b\b.";
System.out.println(str.replace(".?\\\b", ""));
This has been a nice riddle. I think you can use a regex to remove the same number of identical repeated characters and \bs (i.e. for your particular input string):
String str = "\bHellow\b world!!!\b\b\b.";
System.out.println(str.replaceAll("^\b+|(?:([^\b])(?=\\1*+(\\2?+\b)))+\\2", ""));
This is an adaptation of How can we match a^n b^n with Java regex?.
See IDEONE demo, where I added .replace("\b","<B>")); to see if there are any \bs left.
Output:
Hello world.
A generic regex-only solution is outside of regex scope... for now.

How to match the first character in a String with a regexp?

I need a regular expression to evaluate if the first character of a word is a lowercase letter or not.
I have this java code: Character.toString(charcter).matches("[a-z?]")
For example if I have those words the result would be:
a13 => true
B54 => false
&32 => false
I want to match only one letter and I don't know if I need to use "?", "." or "{1}" after or inside "[a-z]"
There is a built in way to do this without regexes.
Character.isLowerCase(string.charAt(0))
Please use this for your needs: /^[a-z]/
You want to match if there's exactly one lowercase letter. As #Luiggi Medonza stated, you really do/should not need Regular Expressions for this, but if you want to use them, you most likely want this pattern:
[a-z]{1}
What ? does is an optional match. You want a strict match of length 1, so you need {1}.
#Ted Hopp mentioned that you don't need the {1}. Your entire match should look like this:
entire_string.matches("^[a-z].+$")
Again, using built-in string methods will be much faster/better to use.
Here I got similar requirement like in a string first character should alphabet from a-z or A-Z. than the user can type anything like number or some limited symbols.
Solution
public static boolean designationValidate(String n) {
int l = n.length();
if (l >= 4) {
Pattern pattern = Pattern.compile("^[a-zA-Z][a-zA-Z0-9-() ]*$");
Matcher matcher = pattern.matcher(n);
return (matcher.find() && matcher.group().equals(n));
} else
return false;
}
in above example I am validation minimum character should more than 3 length and start with alphabet. If you want any other symbols you can enter there.
The method will return true if expressions match otherwise return false.
May this will helpful for you.

Need help for writing regular expression

I am weak in writing regular expressions so I'm going to need some help on the one. I need a regular expression that can validate that a string is an set of alphabets (the alphabets must be unique) delimited by comma.
Only one character and after that a comma
Examples:
A,E,R
R,A
E,R
Thanks
You can use a repeated group to validate it's a comma separated string.
^[AER](?:,[AER])*$
To not have unique characters, you would do something like:
^([AER])(?:,(?!\1)([AER])(?!.*\2))*$
If I understand it correctly, a valid string will be a series (possibly zero long) of two-character patterns, where each pattern is a letter followed by a comma; finally followed at the end by one letter.
Thus:
"^([A-Za-z],)*[A-Za-z]$"
EDIT: Since you've clarified that the letters have to be A, E, or R:
"^([AER],)*[AER]$"
Something like this "^([AER],)*[AER]$"
#Edit: regarding the uniqueness, if you can drop the "last character cannot be a comma" requirement (which can be checked before the regex anyway in constant time) then this should work:
"^(?:([AER],?)(?!.*\\1))*$"
This will match A,E,R, hence you need that check before performing the regex. I do not take responsibility for the performance but since it's only 3 letters anyway...
The above is a java regex obviously, if you want a "pure one" ^(?:([AER],?)(?!.*\1))*$
#Edit2: sorry, missed one thing: this actually requires that check and then you need to add a comma at the end since otherwise it will also match A,E,E. Kind of limited I know.
My own ugly but extensible solution, which will disallow leading and trailing commas, and checks that the characters are unique.
It uses forward-declared backreference: note how the second capturing group is behind the reference made to it (?!.*\2). On the first repetition, since the second capturing group hasn't captured anything, Java treats any attempt to reference text match by second capturing group as failure.
^([AER])(?!.*\1)(?:,(?!.*\2)([AER]))*+$
Demo on regex101 (PCRE flavor has the same behavior for this case)
Demo on RegexPlanet
Test cases:
A,E,R
A,R,E
E,R,A
A
R,E
R
E
A,
A,R,
A,A,R
E,A,E
A,E,E
X,R,E
R,A,E,
,A
AA,R,E
Note: I'm going to answer the original question. That is, I don't care if the elements repeat.
We've had several suggestions for this regex:
^([AER],)*[AER]$
Which does indeed work. However, to match a String, it first has to back up one character because it will find that there is no , at the end. So we switch it for this to increase performance:
^[AER](,[AER])*$
Notice that this will match a correct String the very first time it attempts to. But also note that we don't need to worry about the ( )* backing up at all; it will either match the first time, or it won't match the String at all. So we can further improve performance by using a possessive quantifier:
^[AER](,[AER])*+$
This will take the whole String and attempt to match it. If it fails, then it stops, saving time by not doing useless backing up.
If I were trying to ensure the String had no repeated elements, I would not use regex; it just complicates things. You end up with less-readable code (sadly, most people don't understand regex) and, oftentimes, slower code. So I would build my own validator:
public static boolean isCommaDelimitedSet(String toValidate, HashSet<Character> toMatch) {
for (int index = 0; index < toValidate.length(); index++) {
if (index % 2 == 0) {
if (!toMatch.contains(toValidate.charAt(index))) return false;
} else {
if (toValidate.charAt(index) != ',') return false;
}
}
return true;
}
This assumes that you want to be able to pass in a set of characters that are allowed. If you don't want that and have explicit chars you want to match, change the contents of the if (index % 2 == 0) block to:
char c = toValidate.charAt(index);
if (c == 'A' || c == 'E' || c == 'R' || /* and so on */ ) return false;

Checking on 3 criteria

i am trying to check on 3 condition to validate a car plate number. But i just cant seems to check all 3 conditions. length must be between 4 -7. first 3 char must be from a - z. fourth char onwards must be digits '0' - '9'.
I have problem on the next part of my question. i need to implement compute CheckDigit method which i have tried to add in an array to accept the arguement for me to do the step by step instruction to compute the check digits.
Below is the steps,
take 2nd & 3rd char and convert is to numbers that correspond to the alphabet. eg. A is 1 B is 2.
add 0 to the front of the numbers is the numbers has less den 4 digits. eg. SBA123 need to append to 0123
multiply each digits in step 1 and 2 by 14,2,12,2,11,1
sum up number from step 3
divide sum in step 4 by 19 and take remainder and find the check digit in a table.
Any help will be great for me to start.
below is my code i have change,
Kindly point out my mistake.
public static void validateCarPlate(String y)throws InvalidCarPlateException{
String rex = "[a-zA-Z]{3}[0-9]{1,4}";
if(y.matches(rex)){
computeCheckDigit(y);
}else{
throw new InvalidCarPlateException();
}
}
public static void computeCheckDigit(String x){
int [] arr = Integer.parseInt(x);
}
The use of Regular Expressions would be ideal here. Regular Expressions are funny looking, well constructed strings that represent a Finite State Machine that recognizes certain types of strings as matching a pattern or not matching. Learning about regular expressions will greatly improve your string matching/validation processes.
This is the RegEx you should use: ^[a-zA-Z]{3}[0-9]{1,4}$
Lets break down what this funny looking string means:
^ : This is the start of the string (no characters before it)
[a-zA-Z] : Alphabetic characters
{3} : Exactly 3 of these alphabetic characters
[0-9] : Then numeric characters
{1,4} : Between 1 and 4 of these numeric characters (inclusively)
$ : This is the end of the string (no characters remaining)
An example usage:
String myStr = "abc123";
System.out.println(isValidString(myStr));
public boolean isValidString(String input) {
String regex = "^[a-zA-Z]{3}[0-9]{1,4}$";
if(input==null) { return false; }
return input.trim().matches(regex);
}
You can do this using regex very easily, the expression
^[a-z]{3}[0-9]{1,4}$
Would work.
Here is an example
public boolean validatePlate(final String string) {
final Matcher matcher = Pattern.compile("^[a-z]{3}[0-9]{1,4}$").matcher(string);
return matcher.matches();
}
You are always testing the first character of your string (charAt(0)) instead of using the value of the loop counter i that you set up. Also, you have no test for the digits.
You could also look into "String.indexOf()"; it would save you having to loop through (or initialize) an array of chars. Have a string "abcdefg..." (and another "01234...").
You could also look into the methods Character.isLetter() and Character.isDigit() and do away with the arrays and the strings-treated-like-arrays.
As for regular expressions, I always like the old joke: "Say you have a problem, and you decide to solve it with regular expressions. Now you have two problems..." Of course they're useful, but not nearly as much as many people seem to think they are. And not everything that CAN be solved with them SHOULD be solved with them. If you're interested, this is a nice simple regular expression problem to get started with. If you're not, don't feel like your solution is lacking somehow.

Categories