Regex for optional leading forward slashes - java

I need to validate shipping container numbers. There is an industry standard that says only alpha-numeric and 11 characters in length is acceptable. eg: FBXU8891735
However there is also a standard industry practice where the first 4 characters can be forward-slashes eg: ////8891735
I have 2 requirements - firstly to validate the container numbers (eg. matches()) and secondly to clean the container numbers (eg. replaceAll())
System.out.println("MSCU3720090".matches("[a-zA-Z0-9]{11}")); //true - ok
System.out.println("////3720090".matches("[a-zA-Z0-9]{11}")); //false - fail
System.out.println("MSCU3720090".replaceAll("[^a-zA-Z0-9]*", "")); //MSCU3720090 - ok
System.out.println("////3720090".replaceAll("[^a-zA-Z0-9]*", "")); //3720090 - fail
I know that for matches() I can use an alternate eg:
[a-zA-Z0-9]{11}|////[a-zA-Z0-9]{7}
However this seems ugly and I'm not sure how to use it for replaceAll().
Can someone suggest a better regex to satisfy both requirements (or one for each requirement)?
Thanks.

"((?:[a-zA-Z0-9]{4}|/{4})[a-zA-Z0-9]{7})"
Then just examine the contents of capture group 1 for the number.

In case someone wants a proper validation of Cargo Container Number ISO 6346, please refer my Javascript class for the purpose or Patrik Storm's PHP Class.

Related

Regex expression for comma and dash seperated text of items

I do have a Java Web Application, where I get some inputs from the user. Once I got this input I have to parse it and the parsing part depends on what kind of input I'll get. I decided to use the Pattern class of java for some of predefined user inputs.
So I need the last 2 regex patterns:
a)Enumaration:
input can be - A03,B24.1,A25.7
The simple way would be to check if there are a comma in there ([^,]+) but it will end up with a lot of updates in to parsing function, which I would like to avoid. So, in addition to comma it should check if it starts with
letter
minimum 3 letters (combined with numbers)
can have one dot in the word
minimum 1 comma (updated it)
b) Mixed
input can be A03,B24.1-B35.5,A25.7
So all of what Enumuration part got, but with addition that it can have a dash minimum one.
I've tried to use multiple online regex generators but didnt get it correct. Would be much appreciated if you can help.
Here is what I got if its B24.1-B35.5 if its just a simple range.
"='.{1}\\d{0,2}-.{1}\\d{0,2}'|='.{1}\\d{1,2}.\\d{1,2}-.{1}\\d{1,2}.\\d{1,2}'";
Edit1: Valid and Invalid inputs
for a)Enumaration
A03,B24.1,A25.7 Valid
A03,B24.1 Valid
A03,B24.1-B25.1 -Invalid because in this case (enumaration) it should not contain dash
A03 invalid because no comma
A03,B24.1 - Valid
A03 Invalid
for b)Mixed
everything that a enumeration has with addition that it can have dash too.
You can use this regex for (a) Enumeration part as per your rules:
[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?(?:,[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
Rules:
Verifies that each segment starts with a letter
Minimum of three letters or numbers [A-Za-z][A-Za-z0-9]{2,}
Optionally followed by decimal . and one or more alphabets and numbers i.e (?:\.[A-Za-z0-9]{1,})?
Same thing repeated, and seperated by a comma ,. Also must have atleast one comma so using + i.e (?:,[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
?: to indicate non-capturing group
Using [A-Za-z0-9] instead of \w to avoid underscores
Regex101 Demo
For (b) Mixed, you haven't shared too many valid and invalid cases, but based on my current understanding here's what I have:
[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?(?:[,-][A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
Note that , from previous regex has been replaced with [,-] to allow - as well!
Regex101 Demo
// Will match
A03,B24.1-B35.5,A25.7
A03,B24.1,A25.7
A03,B24.1-B25.1
Hope this helps!
EDIT: Making sure each group starts with a letter (and not a number)
Thanks to #diginoise and #anubhava for pointing out! Changed [A-Za-z0-9]{3,} to [A-Za-z][A-Za-z0-9]{2,}
As I said in the comments, I would chop the input by commas and verify each segment separately. Your domain ICD 10 CM codes is very well defined and also I would be very wary of any input which could be non valid, yet pass the validation.
Here is my solution:
regex
([A-TV-Z][0-9][A-Z0-9](\.?[A-Z0-9]{0,4})?)
... however I would avoid that.
Since your domain is (moste likely) medical software, people's lives (or at least well being) is at stake. Not to mention astronomical damages and the lawyers ever-chasing ambulances. Therefore avoid the easy solution, and implement the bomb proof one.
You could use the regex to establish that given code is definitely not valid. However if a code passes your regex it does not mean that it is valid.
bomb proof method
See this example: O09.7, O09.70, O09.71, O09.72, O09.73 are valid entries, but O09.1 is not valid.
Therefore just get all possible codes. According to this gist there are 42784 different codes. Just load them to memory and any code which is not in the set, is not valid. You could compress said list and be clever about the encoding in memory, to occupy less space, but verbatim all codes are under 300kB on disk, so few MBs max in memory, therefore not a massive cost to pay for a price of people not having left instead of right kidney removed.

Regex to validate words do not contain numbers or special characters

I am developing a java app, running on android. I am trying to pick all words which do not contain any embedded digits or symbols.
The best I have come up with is:
\b[a-zA-Z]+[a-zA-Z]*+\b
Test Data:
this is a test , an0ther gr8 WW##ee one, w1n 1test test1 end
This results in picking the following: this, is, a, test, WW##ee, one, end
I need to eliminate the WW##ee from the results.
You shouldn't use a word boundary meta-character \b since it matches the position right after WW which sees a hash # character. This position is a word boundary itself. So you should pick up a different way:
(?<![\S&&[^,]])[a-zA-Z]+(?![\S&&[^,]])
Using character class intersection feature of Java's regex you are able to define punctuation characters that are allowed to follow or precede a word character. Here it is a comma ,.
You could use look behind and look ahead to check there is no #.
\b(?<!\#)[a-zA-Z]+(?!\#)\b
My solution has evolved a bit as I have gotten additional help with this. So, this is now my best solution but still a bit lacking. I have not been able to accept "as-is" while rejecting "-this-" and a similar case of accept "and/or" while rejecting "/slash/". Also for simplicity I have made the input data single word per line.
^(?:[\p{P}\p{S}])?((?:[\p{L}\p{Pd}'])+)(?:[\p{P}\p{S}])$
as-is is picked valid
-this- is valid but I wish it weren't
and/or is not valid but I wish it would be picked
/slash/ "slash" is picked valid
(test) "test" is picked valid
[test] "test" is picked valid
<test> "test" is picked valid

Limited currency regex

I have found a lot of good currency regular expressions that get very close to what I need. Alas, I am no regex guru and can't seem to edit my current regex to meet requirements.
I need to limit the valid inputs to the format of 'xxx,xxx.xx'. The max allowed amount needs to be '999,999.99' with commas optional. I've been using this regex until now:
^([0-9]{1,3}(,[0-9]{3})*|([0-9]+))(.[0-9]{2})?$
It has been working great except for not being able to make the upper limit '999,999.99'. Thanks for the help!
Update
I've been tinkering and I've managed to come up with this:
/^(?:([0-9]{3}?,?)?[0-9]{3}(?:\.[0-9]?[0-9]?)?)$/
Still testing to see if it works. RegexPlanet isn't passing it with any of the Strings I try, but I'll be going through my app and manually testing.
burning_LEGION's answer authorizes some cases I think you probably don't want:
- 999,9
- 9.
I'll assume you want those conditions fulfilled:
- if there is a comma, there are 3 numbers after
- if there is a point, there are 2 numbers after
^\d{1,3}(,?\d{3})?(\.\d{2})?$
use this regex ^\d{1,3}(,?\d{1,3}){0,1}(\.\d{0,2})?$

Regex: How not to match a few letters

I have the following string: SEE ATTACHED ADDENDUM TO HUD-1194,520.07
Inside that string is HUD-1 and after that is 194,520.07. What I want is the 194,520.07 part.
I have written the following regular expression to pull that value out:
[^D\-1](?:-|\()?\$?(?:\d{1,3}[ ,]?)*(?:\.\d+)\)?
However, this pulls out: 94,520.07
I know it has something to do with this part: [^D\-1] "eating" to many of the 1's. Any ideas how I can stop it from "eating" 1's after the first one that appears in HUD-1?
UPDATED:
The reason for all the other stuff is I only want to match as well if the value after HUD-1 is a money amount. And the rest of that regex tries to determine all the different ways a money amount could be written
Why not something as simple as:
.*HUD\-1(.*+)
Ok, you need to be more restrictive I see based on your updated question. Try changing [^D\-1] to just (?:HUD\-1)?. For what it's worth, your currency RegEx is vary lax, allowing input like:
001 001 .31412341234123
You might consider not reinventing the wheel there, I'm sure you can find a currency RegEx quickly via Google. Otherwise, I'd also suggest anchoring your RegEx with a $ at the end of it.
this change will make the second match group of the regex include the full number you would like (everything after the first 1), and put the possible HUD-1 in a separate matching group, if present.
(HUD-1)?((?:-|\()?\$?(?:\d{1,3}[ ,]?)*(?:\.\d+)\)?)

Regex for checking > 1 upper, lower, digit, and special char

^.(?=.{15,})(?=.\d)(?=.[a-z])(?=.[A-Z])(?=.[!##$%^&+=]).*$
This is the regex I am currently using which will evaluate on 1 of each: upper,lower,digit, and specials of my choosing. The question I have is how do I make it check for 2 of each of these? Also I ask because it is seemingly difficult to write a test case for this as I do not know if it is only evaluating the first set of criteria that it needs. This is for a password, however the requirement is that it needs to be in regex form based on the package we are utilizing.
EDIT
Well as it stands in my haste to validate the expression I forgot to validate my string length. Thanks to Ken and Gumbo on helping me with this.
This is the code I am executing:
I do apologize as regex is not my area.
The password I am using is the following string "$$QiouWER1245", the behavior I am experiencing at the moment is that it randomly chooses to pass or fail. Any thoughts on this?
Pattern pattern = Pattern.compile(regEx);
Matcher match = pattern.matcher(password);
while(match.find()){
System.out.println(match.group());
}
From what I see if it evaluates to true it will throw the value in password back to me else it is an empty string.
Personally, I think a password policy that forces use of all three character classes is not very helpful. You can get the same degree of randomness by letting people make longer passwords. Users will tend to get frustrated and write passwords down if they have to abide by too many password rules (which make the passwords too difficult to remember). I recommend counting bits of entropy and making sure they're greater than 60 (usually requires a 10-14 character password). Entropy per character would depend roughly on the number of characters, the range of character sets they use, and maybe how often they switch between character sets (I would guess that passwords like HEYthere are more predictable than heYThEre).
Another note: do you plan not to count the symbols to the right of the keyboard (period, comma, angle brackets, etc.)?
If you still have to find groups of two characters, why not just repeat each pattern? For example, make (?=.\d) into (?=.\d.*\d).
For your test cases, if you are worried that it would only check the first criteria, then write a test case that makes sure each of the following passwords fails (because one and only one of the criteria is not met in each case): Just for fun I reversed the order of expectation of each character set, though it probably won't make a difference unless someone removes/forgets the ?= at some future date.
!##TESTwithoutnumbers
TESTwithoutsymbols123
&*(testwithoutuppercase456
+_^TESTWITHOUTLOWERCASE3498
I should point out that technically none of these passwords should be acceptable because they use dictionary words, which have about 2 bits of entropy per character instead of something more like 6. However, I realize that it's difficult to write a (maintainable and efficient) regular expression to check for dictionary words.
Try this:
"^(?=(?:\\D*\\d){2})(?=(?:[^a-z]*[a-z]){2})(?=(?:[^A-Z]*[A-Z]){2})(?=(?:[^!##$%^&*+=]*[!##$%^&*+=]){2}).{15,}$"
Here non-capturing groups (?:…) are used to group the conditions and repeat them. I’ve also used the complements of each character class for optimization instead of the universal ..
If I understand your question correctly, you want at least 15 characters, and to require at least 2 uppercase characters, at least 2 lowercase characters, at least 2 digits, and at least 2 special characters. In that case you could it like this:
^.*(?=.{15,})(?=.*\d.*\d)(?=.*[a-z].*[a-z])(?=.*[A-Z].*[A-Z])(?=.*[!##$%^&*+=].*[!##$%^&*+=]).*$
BTW, your original regex had an extra backslash before the \d
I'm not sure that one big regex is the right way to go here. It already looks far too complicated and will be very difficult to change in the future.
My suggestion is to structure the code in the following way:
check that the string has 2 lower case characters
return failure if not found or continue
check that the string has 2 upper case characters
return failure if not found or continue
etc.
This will also allow you to pass out a return code or errors string specifying why the password was not accepted and the code will be much simpler.

Categories