Java regex pattern matching (Irish car registration)

Java regex pattern matching (Irish car registration) - java

Sorry if this a dumb question but it's been driving me mental for the past 5 days.
I'm trying to make a regex pattern to match the Irish car registration example '12-W-1234'
So far this is what I have:
import java.util.ArrayList;
import java.util.List;
public class ValidateDemo {
public static void main(String[] args) {
List<String> input = new ArrayList<String>();
input.add("12-WW-1");
input.add("12-W-223");
input.add("02-WX-431");
input.add("98-zd-4134");
input.add("99-c-7465");
for (String car : input) {
if (car.matches("^(\\d{2}-?\\w*([KK|kk|ww|WW|c|C|ce|CE|cn|CN|cw|CW|d|D|dl|DL|g|G|ke|KE|ky|KY|l|L|ld|LD|lh|LH|lk|LK|lm|LM|ls|LS|mh|MH|mn|MN|mo|MO|oy|OY|so|SO|rn|RN|tn|TN|ts|TS|w|W|wd|WD|wh|WH|wx|WX])-?\\d{1,4})$")) {
System.out.println("Car Template " + car);
}
}
}
}
My problems are coming up when it is checking regs that would have a single letter in the that is in my pattern. Eg '12-ZD-1234'.
Where ZD isn't a valid county ID but since D is valid it allows it to be displayed.
Any help would be great.
I've already done research on a few websites including this and this.
These websites helped, but I'm still having my problems.
By the by, I'am going to change the pattern to change all inputs into
uppercase to reduce the size of my code.
Thanks for the help

Besides the \\w* that others have pointed out, you're misusing character classes ([...]). To actually use alternation (|), take out the square brackets as well:
^(\\d{2}-?(KK|kk|ww|WW|c|C|ce|CE|cn|CN|cw|CW|d|D|dl|DL|g|G|ke|KE|ky|KY|l|L|ld|LD|lh|LH|lk|LK|lm|LM|ls|LS|mh|MH|mn|MN|mo|MO|oy|OY|so|SO|rn|RN|tn|TN|ts|TS|w|W|wd|WD|wh|WH|wx|WX)-?\\d{1,4})$
Here are some examples to show you how character classes actually work:
[abc] matches a single character, either a, b, or c.
[aabbcc] is equivalent to [abc] (duplicates are disregarded).
[|] matches a pipe character, i.e. symbols are allowed.
[KK|kk|ww|WW|c|C|ce|CE ... ] ends up being equivalent to [K|wWcCeE ... ] because, again, duplicates are disregarded.
You were correct to use the alternation operator (|) to do what you desired, but you didn't need to use character classes.

You can improve you pattern like this:
^[0-9]{2}-?(?>c[enw]?|C[ENW]?|dl?|DL?|g|G|k[eky]|K[EKY]|l[dhkms]?|L[DHKMS]?|m[hno]|M[HNO]|oy|OY|rn|RN|so|SO|t[ns]|T[NS]|w[dhx]?|W[DHX]?)-?[0-9]{1,4}$
And if you don't care about the case of letters:
^(?i)[0-9]{2}-?(?>c[enw]?|dl?|g|k[eky]|l[dhkms]?|m[hno]oy|rn|so|t[ns]|w[dhx]?)-?[0-9]{1,4}$
Note that anchors (^ and $) are useful if your string must only contain the car registration number.
Note2: You can improve it more, if you put at the first place in the alternation the most frequent county.

Irish numberplates can also start with three digits, since 2013, they are now (year)(1|2)-(county)-(number), so the regex could be simply (\d+-?\w{2}-?\d+)
However, the best form of validataion is to run this against a vehicle registraion API, such as http://ie.carregistrationapi.com/ - since this will determine if the vehicle is registered, rather than just being in the right format.

Related

Why is my String array length 3 instead of 2?

I'm trying to understand regex. I wanted to make a String[] using split to show me how many letters are in a given string expression?
import java.util.*;
import java.io.*;
public class Main {
public static String simpleSymbols(String str) {
String result = "";
String[] alpha = str.split("[\\+\\w\\+]");
int alphaLength = alpha.length;
// System.out.print(alphaLength);
String[] charCount = str.split("[a-z]");
int charCountLength = charCount.length;
System.out.println(charCountLength);
}
}
My input string is "+d+=3=+s+". I split the string to count the number of letters in string. The array length should be two but I'm getting three. Also, I'm trying to make a regex to check the pattern +b+, with b being any letter in the alphabet? Is that correct?

So, a few things pop out to me:
First, your regex looks correct. If you're ever worried about how your regex will perform, you can use https://regexr.com/ to check it out. Just put your regex on the top and enter your string in the bottom to see if it is matching correctly
Second, upon close inspection, I see you're using the split function. While it is convenient for quickly splitting strings, you need to be careful as to what you are splitting on. In this case, you're removing all of the strings that you were initially looking at, which would make it impossible to find. If you print it out, you would notice that the following shows (for an input string of +d+=3=+s+):
+
+=3=+
+
Which shows that you accidentally cut out what you were looking to find in the first place. Now, there are several ways of fixing this, depending on what your criteria is.
Now, if what you wanted was just to separate on all +s and it doesn't matter that you find only what is directly bounded by +s, then split works awesome. Just do str.split("+"), and this will return you a list of the following (for +d+=3=+s+):
d
=3=
s
However, you can see that this poses a few problems. First, it doesn't strip out the =3= that we don't want, and second, it does not truly give us values that are surrounded by a +_+ format, where the underscore represents the string/char you're looking for.
Seeing as you're using +w, you intend to find words that are surrounded by +s. However, if you're just looking to find one character, I would suggest using another like [a-z] or [a-zA-Z] to be more specific. However, if you want to find multiple alphabetical characters, your pattern is fine. You can also add a * (0 or more) or a + (1 or more) at the end of the pattern to dictate what exactly you're looking for.
I won't give you the answer outright, but I'll give you a clue as to what to move towards. Try using a pattern and a matcher to find the regex that you listed above and then if you find a match, make sure to store it somewhere :)
Also, for future reference, you should always start a function name with a lower case, at least in Java. Only constants and class names should start in a capital :)

I am trying to use split to count the number of letters in that string. The array length should be two, but I'm getting three.
The regex in the split functions is used as delimiters and will not be shown in results. In your case "str.split([a-z])" means using alphabets as delimiters to separate your input string, which makes three substrings "(+)|d|(+=3=+)|s|(+)".
If you really want to count the number of letters using "split", use 'str.split("[^a-z]")'. But I would recommend using "java.util.regex.Matcher.find()" in order to find out all letters.
Also, I'm trying to make a regex to check the pattern +b+, with b being any letter in the alphabet? Is that correct?
Similarly, check the functions in "java.util.regex.Matcher".

Java Regex First Name Validation

I understand that validating the first name field is highly controversial due to the fact that there are so many different possibilities. However, I am just learning regex and in an effort to help grasp the concept, I have designed some simple validations to create just try to make sure I am able to make the code do exactly what I want it to, despite whether or not it conforms to best business logic practices.
I am trying to validate a few things.
The first name is between 1 and 25 characters.
The first name can only start with an a-z (ignore case) character.
After that the first name can contain a-z (ignore case) and [ '-,.].
The first name can only end with an a-z (ignore case) character.
public static boolean firstNameValidation(String name){
valid = name.matches("(?i)(^[a-z]+)[a-z .,-]((?! .,-)$){1,25}$");
System.out.println("Name: " + name + "\nValid: " + valid);
return valid;
}

Try this regex
^[^- '](?=(?![A-Z]?[A-Z]))(?=(?![a-z]+[A-Z]))(?=(?!.*[A-Z][A-Z]))(?=(?!.*[- '][- '.]))(?=(?!.*[.][-'.]))[A-Za-z- '.]{2,}$
Demo

Your expression is almost correct. The following is a modification that satisfies all of the conditions:
valid = name.matches("(?i)(^[a-z])((?![ .,'-]$)[a-z .,'-]){0,24}$");

A regex for the same:
([a-zA-z]{1}[a-zA-z_'-,.]{0,23}[a-zA-Z]{0,1})

Lets change the order of the requirements:
ignore case: "(?i)"
can only start with an a-z character: "(?i)[a-z]"
can only end with an a-z: "(?i)[a-z](.*[a-z])?"
is between 1 and 25 characters: "(?i)[a-z](.{0,23}[a-z])?"
can contain a-z and [ '-,.]: "(?i)[a-z]([- ',.a-z]{0,23}[a-z])?"
the last one should do the job:
valid = name.matches("(?i)[a-z]([- ',.a-z]{0,23}[a-z])?")
Test on RegexPlanet (press java button).
Notes for above points
could have used "[a-zA-Z]"' instead of"(?i)"'
need ? since we want to allow one character names
23 is total length minus first and the last charracter (25-1-1)
the - must come first (or last) inside [] else it is interpreted as range sepparator (assuming you didn't mean the characters between ' and ,)

Try this simplest version:
^[a-zA-Z][a-zA-Z][-',.][a-zA-Z]{1,25}$
Thanks for sharing.

A unicode compatible version of the answer of #ProPhoto:
^[^- '](?=(?!\p{Lu}?\p{Lu}))(?=(?!\p{Ll}+\p{Lu}))(?=(?!.*\p{Lu}\p{Lu}))(?=(?!.*[- '][- '.]))(?=(?!.*[.][-'.]))(\p{L}|[- '.]){2,}$

Regex to match if string only contains all characters from a character set, plus an optional one

I ran into a wee problem with Java regex. (I must say in advance, I'm not very experienced in either Java or regex.)
I have a string, and a set of three characters. I want to find out if the string is built from only these characters. Additionally (just to make it even more complicated), two of the characters must be in the string, while the third one is **optional*.
I do have a solution, my question is rather if anyone can offer anything better/nicer/more elegant, because this makes me cry blood when I look at it...
The set-up
There mandatory characters are: | (pipe) and - (dash).
The string in question should be built from a combination of these. They can be in any order, but both have to be in it.
The optional character is: : (colon).
The string can contain colons, but it does not have to. This is the only other character allowed, apart from the above two.
Any other characters are forbidden.
Expected results
Following strings should work/not work:
"------" = false
"||||" = false
"---|---" = true
"|||-|||" = true
"--|-|--|---|||-" = true
...and...
"----:|--|:::|---::|" = true
":::------:::---:---" = false
"|||:|:::::|" = false
"--:::---|:|---G---n" = false
...etc.
The "ugly" solution
Now, I have a solution that seems to work, based on this stackoverflow answer. The reason I'd like a better one will become obvious when you've recovered from seeing this:
if (string.matches("^[(?\\:)?\\|\\-]*(([\\|\\-][(?:\\:)?])|([(?:\\:)?][\\|\\-]))[(?\\:)?\\|\\-]*$") || string.matches("^[(?\\|)?\\-]*(([\\-][(?:\\|)?])|([(?:\\|)?][\\-]))[(?\\|)?\\-]*$")) {
//do funny stuff with a meaningless string
} else {
//don't do funny stuff with a meaningless string
}
Breaking it down
The first regex
"^[(?\\:)?\\|\\-]*(([\\|\\-][(?:\\:)?])|([(?:\\:)?][\\|\\-]))[(?\\:)?\\|\\-]*$"
checks for all three characters
The next one
"^[(?\\|)?\\-]*(([\\-][(?:\\|)?])|([(?:\\|)?][\\-]))[(?\\|)?\\-]*$"
check for the two mandatory ones only.
...Yea, I know...
But believe me I tried. Nothing else gave the desired result, but allowed through strings without the mandatory characters, etc.
The question is...
Does anyone know how to do it a simpler / more elegant way?
Bonus question: There is one thing I don't quite get in the regexes above (more than one, but this one bugs me the most):
As far as I understand(?) regular expressions, (?\\|)? should mean that the character | is either contained or not (unless I'm very much mistaken), still in the above setup it seems to enforce that character. This of course suits my purpose, but I cannot understand why it works that way.
So if anyone can explain, what I'm missing there, that'd be real great, besides, this I suspect holds the key to a simpler solution (checking for both mandatory and optional characters in one regex would be ideal.
Thank you all for reading (and suffering ) through my question, and even bigger thanks for those who reply. :)
PS
I did try stuff like ^[\\|\\-(?:\\:)?)]$, but that would not enforce all mandatory characters.

Use a lookahead based regex.
^(?=.*\\|)(?=.*-)[-:|]+$
or
^(?=.*\\|)[-:|]*-[-:|]*$
or
^[-:|]*(?:-:*\\||\\|:*-)[-:|]*$
DEMO 1DEMO 2
(?=.*\\|) expects atleast one pipe.
(?=.*-) expects atleast one hyphen.
[-:|]+ any char from the list one or more times.
$ End of the line.

Here is a simple answer:
(?=.*\|.*-|.*-.*\|)^([-|:]+)$
This says that the string needs to have a '-' followed by '|', or a '|' followed by a '-', via the look-ahead. Then the string only matches the allowed characters.
Demo: http://fiddle.re/1hnu96

Here is one without lookbefore and -hind.
^[-:|]*\\|[-:|]*-[-:|]*|[-:|]*-[-:|]*\\|[-:|]*$
This doesn't scale, so Avinash's solution is to be preferred - if your regex system has the lookbe*.

Replacing substrings in String

I am 16 and trying to learn Java, I have a paper that my uncle gave me that has things to do in Java. One of these things is too write and execute a program that will accept an extended message as a string such as
Each time she saw the painting, she was happy
and replace the word she with the word he.
Each time he saw the painting, he was happy.
This part is simple, but he wants me to be able to take any form of she and replace it we he like (she to he, She to He, she? to he?, she. to he., she' to he' and so on). Can someone help me make a program to accomplish this.
I have this
public static void main(String[] args) {
Scanner keyboard = new Scanner(System.in);
System.out.println("Write Sentence");
String original = keyboard.nextLine();
String changeWord = "he";
String modified = original.replaceAll("she", changeWord);
System.out.println(modified);
}
If this isn't the right site to find answers like this, can you redirect me to a site that answers such questions?

The best way to do this is with regular expressions (regex). Regex allow you to match patterns or classes of words so you can deal with general cases. Consider the cases you have already listed:
(she to he, She to He, she? to he?, she. to he., she' to he' and so on)
What is common between these cases? Can you think of some general rule(s) that would apply to all such transformations?
But also consider some cases you haven't listed: for example, as you've written it now, your code will change the word "ashes" to "ahes" because "ashes" contains "she." A properly written regex expression allows you to avoid this.
Before delving into regex, try and express, in plain English, a rule or set of rules for what you want to replace and what it should be replaced with.
Then, learn some regex and attempt to apply those rules.
Lastly, try and write some tests (i.e. using JUnit) for various cases so you can see which cases your code is working for and which cases it isn't working for.
Once you have done this, if something still doesn't work, feel free to post a new question here showing us your code and explaining what doesn't work. We'll be happy to help.

I would recommend this regular expression to solve this. It seems you have to search and replace separately the uppercase S and the lowercase s
String modified = original
.replaceAll("(she)(\\W)", "he$2")
.replaceAll("(She)(\\W)", "He$2");
Explanation :
The pattern (she) will match the word she and store it as the first captured group of characters
The pattern (\\W) will match one non alphabetic character (e.g. ', .) and store it as the second captured group of characters
Both of these patterns must match consecutive parts of the input string for replaceAll to replace something.
"he$2" put in the resulting string the word he followed by the second captured group of characters (in our case the group has only one character)
The above means that the regular expression will match a pattern like She'll and replace with He'll, but it will not match a pattern like Sherlock because here She is followed by an alphabetic character r

Extract variables from code statement using regex

I'm trying to extract variables from code statements and "if" condition. I have a regex to that but mymatcher.find() doesn't return any values matched.
I don't know what is wrong.
here is my code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class test {
public static void main(String[] args) {
String test="x=y+z/n-10+my5th_integer+201";
Pattern mypattern = Pattern.compile("^[a-zA-Z_$][a-zA-Z_$0-9]*$");
Matcher mymatcher = mypattern.matcher(test);
while (mymatcher.find()) {
String find = mymatcher.group(1) ;
System.out.println("variable:" + find);
}
}
}

You need to remove ^ and $ anchors that assert positions at start and end of string repectively, and use mymatcher.group(0) instead of mymatcher.group(1) because you do not have any capturing groups in your regex:
String test="x=y+z/n-10+my5th_integer+201";
Pattern mypattern = Pattern.compile("[a-zA-Z_$][a-zA-Z_$0-9]*");
Matcher mymatcher = mypattern.matcher(test);
while (mymatcher.find()) {
String find = mymatcher.group(0) ;
System.out.println("variable:" + find);
}
See IDEONE demo, the results are:
variable:x
variable:y
variable:z
variable:n
variable:my5th_integer

Usually processing source code with just a regex simply fails.
If all you want to do is pick out identifiers (we discuss variables further below) you have some chance with regular expressions (after all, this is how lexers are built).
But you probably need a much more sophisticated version than what you have, even with corrections as suggested by other authors.
A first problem is that if you allow arbitrary statements, they often have keywords that look like identifiers. In your specific example, "if" looks like an identifier. So your matcher either has to recognize identifier-like substrings, and subtract away known keywords, or the regex itself must express the idea that an identifier has a basic shape but not cannot look like a specific list of keywords. (The latter is called a subtractive regex, and aren't found in most regex engines. It looks something like:
[a-zA-Z_$][a-zA-Z_$0-9]* - (if | else | class | ... )
Our DMS lexer generator [see my bio] has subtractive regex because this is extremely useful in language-lexing).
This gets more complex if the "keywords" are not always keywords, that is,
they can be keywords only in certain contexts. The Java "keyword" enum is just that: if you use it in a type context, it is a keyword; otherwise it is an identifier; C# is similar. Now the only way to know
if a purported identifier is a keyword is to actually parse the code (which is how you detect the context that controls its keyword-ness).
Next, identifiers in Java allow a variety of Unicode characters (Latin1, Russian, Chinese, ...) A regexp to recognize this, accounting for all the characters, is a lot bigger than the simple "A-Z" style you propose.
For Java, you need to defend against string literals containing what appear to be variable names. Consider the (funny-looking but valid) statement:
a = "x=y+z/n-10+my5th_integer+201";
There is only one identifier here. A similar problem occurs with comments
that contain content that look like statements:
/* Tricky:
a = "x=y+z/n-10+my5th_integer+201";
*/
For Java, you need to worry about Unicode escapes, too. Consider this valid Java statement:
\u0061 = \u0062; // means "a=b;"
or nastier:
a\u006bc = 1; // means "akc=1;" not "abc=1;"!
Pushing this, without Unicode character decoding, you might not even
notice a string. The following is a variant of the above:
a = \u0042x=y+z/n-10+my5th_integer+201";
To extract identifiers correctly, you need to build (or use) the equivalent of a full Java lexer, not just a simple regex match.
If you don't care about being right most of the time, you can try your regex. Usually regex-applied-to-source-code-parsing ends badly, partly because of the above problems (e.g, oversimplification).
You are lucky in that you are trying to do for Java. If you had to do this for C#, a very similar language, you'd have to handle interpolated strings, which allow expressions inside strings. The expressions themselves can contain strings... its turtles all the way down. Consider the C# (version 6) statement:
a = $"x+{y*$"z=${c /* p=q */}"[2]}*q" + b;
This contains the identifiers a, b, c and y. Every other "identifier" is actually just a string or comment character. PHP has similar interpolated strings.
To extract identifiers from this, you need a something that understands the nesting of string elements. Lexers usually don't do recursion (Our DMS lexers handle this, for precisely this reason), so to process this correctly you usually need a parser, or at least something that tracks nesting.
You have one other issue: do you want to extract just variable names?
What if the identifier represents a method, type, class or package?
You can't figure this out without having a full parser and full Java name and type resolution, and you have to do this in the context in which the statement is found. You'd be amazed how much code it takes to do this right.
So, if your goals are simpleminded and you don't care if it handles these complications, you can get by with a simple regex to pick out things
that look like identifiers.
If you want to it well (e.g., use this in some production code) the single regex will be total disaster. You'll spend your life explaining to users what they cannot type, and that never works.
Summary: because of all the complications, usually processing source code with just a regex simply fails. People keep re-learning this lesson. It is one of key reasons that lexer generators are widely used in language processing tools.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.