Java regex: find sequence of letter-digit combinations, allowing certain symbols - java

I am trying to arrive at a regex to detect tokens from a sentence. These tokens should be a combination of letters and digits (mandatory), with optional chars like , or .
Given the sentence:
M5 x 35mm Full Thread Hexagon Bolts (DIN 933) - PEEK DescriptionThe M5 x 0.035mm, and 6NB7 plus a Go9IuN.
It should find six tokens:
M5, 35mm, M5, 0.035mm, 6NB7, Go9IuN
I have tried the following which does not work:
Pattern alphanum=Pattern.compile("\\b(([A-Za-z].*[0-9])|([0-9].*[A-Za-z]))\\b");
Any suggestions please?
Thanks

You could use a positive lookahead to assert at least 1 digit and then match at least 1 char a-zA-Z
The .* part will over match as it will match any char 0+ times except a newline
\b(?=[a-zA-Z0-9.,]*[0-9])[a-zA-Z0-9.,]*[a-zA-Z][a-zA-Z0-9.,]*\b
Explanation
\b Word boundary
(?=[a-zA-Z0-9.,]*[0-9]) Assert at least 1 digit
[a-zA-Z0-9.,]*[a-zA-Z][a-zA-Z0-9.,]* Match at least 1 char a-zA-Z
\b Word boundary
Regex demo
In Java
final String regex = "\\b(?=[a-zA-Z0-9.,]*[0-9])[a-zA-Z0-9.,]*[a-zA-Z][a-zA-Z0-9.,]*\\b";

Perhaps the following regex will do the job
(?=[A-Za-z,.]*\d)(?=[\d,.]*[A-Za-z])[A-Za-z\d,.]{2,}(?<![,.])
It starts with two positive lookaheads which form an and condition.
The first lookahead (?=[A-Za-z,.]*\d) checks if a token contains at least one digit.
The second lookahead (?=[\d,.]*[A-Za-z]) checks if it contains at least one letter.
The actual match [A-Za-z\d,.]{2,} reads at least two letters, digits, , or ..
In the end it checks that the match does not end with those special characters: (?<![,.])
regex101 demo

Related

Given string filter out unique element from string using regex

I have this String and I want to filter the digit that came after the big number with the space, so in this case I want to filter out 2 and 0.32. I used this regex below which only filters out decimal numbers, however I want to filter both decimals and integer numbers, is there any way?
String s = "ABB123,ABPP,ADFG0/AA/BHJ.S,392483492389 2,BBBB,YUIO,BUYGH/AA/BHJ.S,3232489880 0.32"
regex = .AA/BHJ.S,\d+ (\d+.?\d+)
https://regex101.com/r/ZqHDQ8/1
The problem is that \d+.?\d+ matches at least two digits. \d+ matches one or more digits, then .? matches any optional char other than line break char, and then again \d+ matches (requires) at least one digit (it matches one or more).
Also, note that all literal dots must be escaped.
You can use
.AA/BHJ\.S,\d+\s+(\d+(?:\.\d+)?)
See the regex demo.
Details:
. - any one char
AA/BHJ\.S, - a AA/BHJ.S, string
\d+ - one or more digits
\s+ - one or more whitespaces
(\d+(?:\.\d+)?) - Group 1: one or more digits, and then an optional sequence of a dot and one or more digits.
You could look for anything following /AA/BHJ with a reluctant quantifier, then use a capturing group to look for either digits or one or more digits followed by a decimal separator and other digits.
/AA/BHJ.*?\s+(\d+\.\d+|\d+)
Here is a link to test the regex:
https://regex101.com/r/l5nMrD/1

A period must not appear consecutively in a String Java

I have a code check if the user input is valid in the regular expression pattern. The patter is # the problem is how to check if the character . appears consecutively
[a-z|A-Z|0-9|[.]{1}]+#[[a-z|A-Z|0-9]+
i've tried this patter so far.
System.out.print("Enter your Email: ");
String userInput = new Scanner(System.in).nextLine();
Pattern pat = Pattern.compile("[a-z|A-Z|0-9|[.]{1}]+#[a-z|A-Z|0-9]+");
Matcher mat = pat.matcher(userInput);
if(mat.matches()){
System.out.print("Valid");
}else{
System.out.print("Invalid");
}
}
}
if the input is een..123#asd123
I expect the output will Invalid but if the input is een.123#asd123 the output will Valid
A character class matches any of the listed characters. If you specify a pipe | that does not mean OR but it could then also match a |.
If you don't want to match consecutive dots, you could make use of a character class that does not contain a dot, and then use a quantifier to repeat a grouping structure that does start with a dot.
^[a-zA-Z0-9]+(?:\.[a-zA-Z0-9]+)*#[a-zA-Z0-9]+(?:\.[a-zA-Z0-9]+)*$
That will match
^ Start of string
[a-zA-Z0-9]+ Match 1+ times any character that is listed in the charater class
(?:\.[a-zA-Z0-9]+)* Repeat 0+ times a group which starts with a dot and matches 1+ times what is listed in the character class to prevent consecutive dots.
# Match # char
[a-zA-Z0-9]+ Match again 1+ chars
(?:\.[a-zA-Z0-9]+)* Match again repeating group
$ End of string
Regex101 demo
If you don't want consecutive periods, use [a-zA-Z0-9]+(?:\.[a-zA-Z0-9]+)*
Explanation
[a-zA-Z0-9]+ Match one or more letters or digits
(?: Start non-capturing group
\. Match exactly one period
[a-zA-Z0-9]+ Match one or more letters or digits
)* End optional repeating group
With this pattern, the value cannot start or end with a period, and cannot have consecutive periods.
Alternatively, use a negative lookahead: (?!.*[.]{2})[a-zA-Z0-9.]+#[a-zA-Z0-9]+
Explanation
(?!.*[.]{2}) Fail match if 2 consecutive periods are found
[a-zA-Z0-9.]+#[a-zA-Z0-9]+ Match normally

Regex validating eight or more char string that must contain at least two non-alphabetic characters

I am validating a password string that must consist of eight characters or more and must contain at least two non-alphabetic (i.e., not A-Za-z) characters using regular expression.
The code I have so far is
Pattern p = Pattern.compile("((?=2.*[^a-z[A-Z]]).{8,})");
Matcher m = p.matcher(pass);
I don't know whether my expression is correct.
I want to validate my password with eight characters or more and must contain at least two non-alphabetic (i.e., not A-Z) characters
You may use
s.matches("(?=(?:[^a-zA-Z]*[a-zA-Z]){2}).{8,}")
See the regex demo.
Another way of writing the same
s.matches("(?=.{8})(?:[^a-zA-Z]*[a-zA-Z]){2}.*")
Explanation
^ - (not necessary in .matches as the method requires a full string match) - start of string
(?= - start of a positive lookahead that requires, immediately to the right of the current location,
(?:[^a-zA-Z]*[a-zA-Z]){2} - a non-capturing group that matches 2 consecutive occurrences of:
[^a-zA-Z]* - any 0+ chars other than ASCII letters
[a-zA-Z] - an ASCII letter
) - end of the lookahead
.{8,} - any 8 or more chars other than line break chars, as many as possible
$ - (not necessary in .matches as the method requires a full string match) - end of string
In (?=.{8})(?:[^a-zA-Z]*[a-zA-Z]){2}.* pattern, the first lookahead requires at least 8 chars, then at least two letters are requires using the (?:[^a-zA-Z]*[a-zA-Z]){2} pattern, and then .* matches the rest of the string.

Wierd behaviour on regexp Matcher

My regexp below is supposed to filter out capital words with a length of 8-10, where 0-2 numbers may appear. It has been working for all of my tests, but for some reason it got stuck on the string below. And n.group(0) only contains an empty string instead of the matched "word".
static final Pattern PATTERN =
Pattern.compile("\\b(?=[A-Z\\d]{9,10}\\b)(?:[A-Z]*\\d){0,2}[A-Z]*\\b");
Matcher n = LONG_PASSWORD.matcher("foo ID:636152727 bar");
while (n.find()) {
String s = n.group(0);
resultArrayList.add(s);
}
Why does my pattern match ID:636152727?
Some examples that I want to filter out (which is working):
AAAAAAAAAA
1AAAAAAAAA
1AAAAAAAA1
etc...
I don't have a better solution to offer than the one in Ωmega's answer, but I think I can explain what's happening. What it boils down to is that the first \b and the last \b are matching the same spot: right after the colon.
That's the first place where the lookahead can match, since it's followed by nine digits and a word boundary. Then the next part of the regex tries to match two digits (interspersed with any number of uppercase letters) followed by a word boundary, and fails. So it tries to match just one digit (ditto), and fails again. Then it tries matching zero digits (interspersed with zero letters), and it succeeds, without advancing the match position. That position is still a word boundary, so the final \b succeeds as well.
A word boundary is just another zero-width assertion, like lookaheads and lookbehinds. There's no reason why two or more can't be applied at the same spot; you did that on purpose with the first word boundary and the lookahead. Some regex flavors treat it as an error if you apply a quantifier to an assertion (like \b+), but I don't think any of them would catch this problem. This is one of those rare instances where separate start-of-word and end-of-word assertions, like GNU's \< and \> or TCL's \y and \Y, would make a difference.
You need to use anchors ^ and $ »
Pattern.compile("^(?=[A-Z\\d]{9,10}$)(?:[A-Z]*\\d){0,2}[A-Z]*$");
Use this pattern:
"(?:^|(?<=\\s))(?=[A-Z\\d]{9,10}(?:\\s|$))(?:[A-Z]*\\d){0,2}[A-Z]*(?=\\s|$)"

Regex in JAVA at most one dot

I expect: \b([a-zA-Z]+\.?)\b or \b([a-zA-Z]+\.{0,1})\b to work as at least one letter and at most one dot.
But the matcher finds "ab" with an input of "ab" "ab." and "ab.." and I'm expecting it to do the following:
"ab" is found for input "ab"
"ab." is found for input "ab."
nothing is found for input "ab.."
If I replace the regex to work with 0 instead of a dot e.g. \b([a-zA-Z]+0?)\b than it works as expected:
"ab" is found for input "ab"
"ab0" is found for input "ab0"
nothing is found for input "ab00"
So, how do I get my regex to work?
The issue is that \b matches between word characters and non-word characters, not between whitespace and non-whitespace as you seem to be trying. The difference between a . and a 0 is that 0 is considered a "word" character, but . isn't.
So what's happening in your examples is this:
Let's take that last string ab.. and see where \b could match:
a b . .
^ x ^ x x
Remember, \b matches between characters. I've shown where \b could match with a ^, and where it can't with an x. Since \b can only match in front of a or right after b, we're limited to just matching ab so long as you have those \b bits in there.
I think you want something like \bab\.?(?!\S). That says "word boundary, then a then b then maybe a single dot where there is NOT a non-space character immediately after."
If I've misunderstood your question, and you do want the expression to find ab. in the string ab.c or find ab in abc you can do \bab\.?(?!\.)
\b([a-zA-Z]+\.+)\b is "at least one letter followed by at least one dot
\b([a-zA-Z]+\.{0,1})\b is "at least one letter followed by zero or one dot

Categories