I expect: \b([a-zA-Z]+\.?)\b or \b([a-zA-Z]+\.{0,1})\b to work as at least one letter and at most one dot.
But the matcher finds "ab" with an input of "ab" "ab." and "ab.." and I'm expecting it to do the following:
"ab" is found for input "ab"
"ab." is found for input "ab."
nothing is found for input "ab.."
If I replace the regex to work with 0 instead of a dot e.g. \b([a-zA-Z]+0?)\b than it works as expected:
"ab" is found for input "ab"
"ab0" is found for input "ab0"
nothing is found for input "ab00"
So, how do I get my regex to work?
The issue is that \b matches between word characters and non-word characters, not between whitespace and non-whitespace as you seem to be trying. The difference between a . and a 0 is that 0 is considered a "word" character, but . isn't.
So what's happening in your examples is this:
Let's take that last string ab.. and see where \b could match:
a b . .
^ x ^ x x
Remember, \b matches between characters. I've shown where \b could match with a ^, and where it can't with an x. Since \b can only match in front of a or right after b, we're limited to just matching ab so long as you have those \b bits in there.
I think you want something like \bab\.?(?!\S). That says "word boundary, then a then b then maybe a single dot where there is NOT a non-space character immediately after."
If I've misunderstood your question, and you do want the expression to find ab. in the string ab.c or find ab in abc you can do \bab\.?(?!\.)
\b([a-zA-Z]+\.+)\b is "at least one letter followed by at least one dot
\b([a-zA-Z]+\.{0,1})\b is "at least one letter followed by zero or one dot
Related
I am trying to arrive at a regex to detect tokens from a sentence. These tokens should be a combination of letters and digits (mandatory), with optional chars like , or .
Given the sentence:
M5 x 35mm Full Thread Hexagon Bolts (DIN 933) - PEEK DescriptionThe M5 x 0.035mm, and 6NB7 plus a Go9IuN.
It should find six tokens:
M5, 35mm, M5, 0.035mm, 6NB7, Go9IuN
I have tried the following which does not work:
Pattern alphanum=Pattern.compile("\\b(([A-Za-z].*[0-9])|([0-9].*[A-Za-z]))\\b");
Any suggestions please?
Thanks
You could use a positive lookahead to assert at least 1 digit and then match at least 1 char a-zA-Z
The .* part will over match as it will match any char 0+ times except a newline
\b(?=[a-zA-Z0-9.,]*[0-9])[a-zA-Z0-9.,]*[a-zA-Z][a-zA-Z0-9.,]*\b
Explanation
\b Word boundary
(?=[a-zA-Z0-9.,]*[0-9]) Assert at least 1 digit
[a-zA-Z0-9.,]*[a-zA-Z][a-zA-Z0-9.,]* Match at least 1 char a-zA-Z
\b Word boundary
Regex demo
In Java
final String regex = "\\b(?=[a-zA-Z0-9.,]*[0-9])[a-zA-Z0-9.,]*[a-zA-Z][a-zA-Z0-9.,]*\\b";
Perhaps the following regex will do the job
(?=[A-Za-z,.]*\d)(?=[\d,.]*[A-Za-z])[A-Za-z\d,.]{2,}(?<![,.])
It starts with two positive lookaheads which form an and condition.
The first lookahead (?=[A-Za-z,.]*\d) checks if a token contains at least one digit.
The second lookahead (?=[\d,.]*[A-Za-z]) checks if it contains at least one letter.
The actual match [A-Za-z\d,.]{2,} reads at least two letters, digits, , or ..
In the end it checks that the match does not end with those special characters: (?<![,.])
regex101 demo
I want to find all the words of length 3 with starting with 'l' and ending with 'f'.
Here's my code:
Pattern pt = Pattern.compile("\\bl.+?f{3}\\b");
Matcher mt = pt.matcher("#Java life! Go ahead Java,lyf,fly,luf,loof");
while(mt.find()) {
System.out.println(mt.group());
}
It's showing nothing. tried out this also Pattern pt = Pattern.compile("l.+?f{3}"); still not getting expected o/p.
The o/p should be:
lyf luf
You can use a word boundary \b, then match for l, a word character \w and then f ending with a word boundary \b.
\bl\wf\b
Explanation
Match a word boundary \b
Match l
Match a word character \w (\w is a shorthand character, matches the ASCII characters [A-Za-z0-9_])
Match a f
Match a word boundary \b
Demo
The regex you need is
\bl\wf\b
Explanation:
Since your word must be three character long, that means there can only be one letter between l and f, so that's why I didn't put a quantifier there.
Your regex is wrong because
f{3} means 3 f's, not 3 character long in total
. matches everything, including non word characters. Use \w instead.
I am new to regex going through the tutorial I found the regex [...] says Matches any single character in brackets.. So I tried
System.out.println(Pattern.matches("[...]","[l]"));
I also tried escaping brackets
System.out.println(Pattern.matches("[...]","\\[l\\]"));
But it gives me false I expected true because l is inside brackets.
It would be helpful if anybody clear my doubts.
Characters that are inside [ and ] (called a character class) are treated as a set of characters to choose from, except leading ^ which negates the result and - which means range (if it's between two characters). Examples:
[-123] matches -, 1, 2 or 3
[1-3] matches a single digit in the range 1 to 3
[^1-3] matches any character except any of the digits in the range 1 to 3
. matches any character
[.] matches the dot .
If you want to match the string [l] you should change your regex to:
System.out.println(Pattern.matches("...", "[l]"));
Now it prints true.
The regex [...] is equivalent to the regexes \. and [.].
The tutorial is a little misleading, it says:
[...] Matches any single character in brackets.
However what it means is that the regex will match a single character against any of the characters inside the brackets. The ... means "insert characters you want to match here". So you need replace the ... with the characters that you want to match against.
For example, [AP]M will match against "AM" and "PM".
If your regex is literally [...] then it will match against a literal dot. Note there is no point repeating characters inside the brackets.
The tutorial is saying:
Matches any single character in brackets.
It means you replace ... with a single character, for example [l]
These will print true:
System.out.println(Pattern.matches("[l]","l"));
System.out.println(Pattern.matches("[.]","."));
System.out.println(Pattern.matches("[.]*","."));
System.out.println(Pattern.matches("[.]*","......"));
System.out.println(Pattern.matches("[.]+","......"));
Right now I'm learning regular expression on Java and I have a question about the word boundaries. So when I looking for word boundaries on Java Regular Expression, I got this \b that accepts word bordered by non-word character so this regex
\b123\b
will accepts this string 123 456 but will rejects 456123456. Now I found that a condition like the word !$###%123^^%$# or "123" still got accepted by the regex above. Is there any word boundaries/pattern that rejects word that bordered by non-alphanumeric (except space) like the example above?
You want to use \s instead of \b. That will look for a whitespace character rather than a word boundary.
If you want your first example of 123 456 to be a match, however, then you will also need to use anchors to accept 123 at the immediate start or end of the string. This can be accomplished via (\s|^)123(\s|$). The carat ^ matches the start of the string and $ matches the end of the string.
(?<!\S)123(?!\S)
(?<!\S) matches a position that is not preceded by a non-whitespace character. (negative lookbehind)
(?!\S) matches a position that is not followed by a non-whitespace character. (negative lookahead)
I know this seems gratuitously complicated, but that's because \b conceals a lot of complexity. It's equivalent to this:
(?<=\w)(?!\w)|(?=\w)(?<!\w)
...meaning a position that's preceded by a word character and not followed by one, or a position that's followed by a word character and not preceded by one.
Any Regex masters out there? I need a regular expression in Java that matches:
"RANDOMSTUFF SPECIFICWORD"
Including the quotation marks.
Thus I need
to match the first quote,
RANDOMSTUFF (any number of words with spaces between preceding SPECIFICWORD)
SPECIFICWORD (a specific word which I won't specify here.)
and the ending quote.
I don't want to match things such as:
RANDOMSTUFF SPECIFICWORD
"RANDOMSTUFF NOTTHESPECIFICWORD"
"RANDOMSTUFF SPECIFICWORD MORERANDOMSTUFF"
\".*\sSPECIFICWORD\"
If you don't want to allow quotes in between, use \"[^"]*\sSPECIFICWORD\"
. matches any character
* says 0 or more of the preceding character (in this case, 0 or more of any characters)
\s matches any whitespace character
SPECIFICWORD will be treated as a string literal, assuming there are no special characters (escape them if there are)
\" matches the quote
[^"] means any character except a quote (the ^ is what makes it 'except')
Also, this link could be useful. Regex's are powerful expressions and are applicable across virtually any language, so it would be a good thing to become comfortable with using them.
EDIT:
As several other posters have pointed out, adding ^ to the beginning and $ to the end will only match if the entire line matches.
^ matches the beginning of the line
$ matches the end of the line
^.*\s+SPECIFICWORD"$
'^' matches 'from the start of the line'
.* matches anything
\s+ matches 'any amount of whitespace, but at least some'
SPECIFICWORD" is a string literal
$ means 'this is the end of the line'
Note that ^ and $ are not always 'line'-based; most languages allow you to specify a 'multiline' mode that would cause them to match 'start of the string/end of the string' instead of one line at a time.
Will this string be matched as a line by line basis or will it be found within the text? If so, you can add anchors to ensure that it matches the string.
^(\".*\sSPECIFICWPRD\")$
Saying, at the start of the line, look for a double quote followed by zero or more random characters followed by a single whitespace, followed by the specific word, followed by a double quote at the end of the string.
Optionally, there are excellent tools for designing regex patterns and seeing what they match in real time.
Here are a couple of examples:
http://gskinner.com/RegExr/
http://regex101.com/r/zC3fM1
Try:
\"[\w\s]*SPECIFICWORD\"
Works like this:
\" matches opening quote
[\w\s]* matches zero or more of the characters from the following sets:
[a-zA-Z_0-9] (\w part)
[ \t\n\x0B\f\r] (\s part)
SPECIFICWORD matches the SPECIFICWORD
\" matches closing quote