word boundary that rejects leading/end non-alphanumeric character - java

Right now I'm learning regular expression on Java and I have a question about the word boundaries. So when I looking for word boundaries on Java Regular Expression, I got this \b that accepts word bordered by non-word character so this regex
\b123\b
will accepts this string 123 456 but will rejects 456123456. Now I found that a condition like the word !$###%123^^%$# or "123" still got accepted by the regex above. Is there any word boundaries/pattern that rejects word that bordered by non-alphanumeric (except space) like the example above?

You want to use \s instead of \b. That will look for a whitespace character rather than a word boundary.
If you want your first example of 123 456 to be a match, however, then you will also need to use anchors to accept 123 at the immediate start or end of the string. This can be accomplished via (\s|^)123(\s|$). The carat ^ matches the start of the string and $ matches the end of the string.

(?<!\S)123(?!\S)
(?<!\S) matches a position that is not preceded by a non-whitespace character. (negative lookbehind)
(?!\S) matches a position that is not followed by a non-whitespace character. (negative lookahead)
I know this seems gratuitously complicated, but that's because \b conceals a lot of complexity. It's equivalent to this:
(?<=\w)(?!\w)|(?=\w)(?<!\w)
...meaning a position that's preceded by a word character and not followed by one, or a position that's followed by a word character and not preceded by one.

Related

Regex expression to match on hyphens in words within sentence based on occurrences of hyphen

I am trying to match on hyphens in a word but only if the hyphen occurs in said word say more than once
So in the phrase "Step-By-Step" the hyphens would be matched whereas in the phrase "Coca-Cola", the hyphens would not be matched.
In a full sentence combining phrases "Step-By-Step and Coca-Cola" only the hyphens within "Step-By-Step" would be expected to match.
I have the following expression currently, but this is matching all hyphens separated by non-digit characters regardless of occurences
((?=\D)-(?<=\D))
I can't seem to get the quantifiers to work with this expression, any ideas?
Java Regex Solution:
(?<=-[^\s-]{0,999})-|-(?=[^\s-]*-)
Java RegEx Demo
PCRE Regex Solution:
Here is a way to match all hyphens in a line with more than one hyphen in PCRE:
(?:(?:^|\s)(?=(?:[^\s-]*-){2})|(?!^)\G)[^\s-]*\K-
RegEx Demo
Explanation:
[^\s-]* matches a character that is not a whitespace and not a hyphen
(?=(?:[^\s-]*-){2}) is lookahead to make sure there are at least 2 hyphens in a non-whitespace substring
\G asserts position at the end of the previous match or the start of the string for the first match
\K resets match info
This matches at least two words each followed by hyphen, followed by another word (I'm assuming you don't want to allow hyphen at the very beginning or end, only between words).
(\w+-){2,}\w+

Regex not matching against ampersand

I'm trying to match the following regex:
\b(?:mr|mrs|ms|miss|messrs|mmes|dr|prof|rev|sr|jr|&|and)\.?\b
In other words, a word boundary followed by any of the strings above (optionally followed by a period character) followed by a word boundary.
I'm trying to match this in Java, but the ampersand will not match. For example:
Pattern p = Pattern.compile(
"\\b(?:mr|mrs|ms|miss|messrs|mmes|dr|prof|rev|sr|jr|&|and)\\.?\\b",
Pattern.CASE_INSENSITIVE);
String result = p.matcher("mr one and mrs.two and three & four").replaceAll(" ");
System.out.println("["+result+"]");
The output of this is: [ one two three & four]
I've also tried this at regex101, and again the ampersand does not match: https://regex101.com/r/klkmwl/1
Escaping the ampersand does not make a difference, and I've tried using the hex escape sequence \x26 instead of ampersand (as suggested in this question). Why is this not matching?
Your regex will match an ampersand if it is located in between word chars, e.g. three&four, see this regex demo. This happens because \b before a non-word char requires a word char to appear immediately before it. Also, as there is a \b after an optional dot, both the dot and ampersand will only match if there is a word char immediately on the left.
You need to re-write the pattern so that the word boundaries are applied to the words rather than symbols:
Pattern p = Pattern.compile(
"(?:\\b(?:mr|mrs|ms|miss|messrs|mmes|dr|prof|rev|sr|jr|and)\\b|&)\\.?",
Pattern.CASE_INSENSITIVE);
See the regex demo online.
Problem is due to use of word boundaries. There are no word boundaries before or after a non-word character like &.
In place of word boundary you can use lookarounds:
(?<!\w)(?:[jsdm]r|mr?s|miss|messrs|mmes|prof|re|&|and)\.?(?!\w)
Updated RegEx Demo
(?<!\w): Make sure that previous character is not a word character
(?!\w): Make sure that next character is not a word character
Note some tweaks in your regex to make it shorter.

regex capture includes too much

I have a string from which I would like to caputre all after and including colon until (excluding) white space or paranthesis.
Why does the following regex include the paranthesis in the string match?
:(.*?)[\(\)\s] or also :(.+?)[\)\s] (non-greedy) does not work.
Example input: WHERE t.operator_id = :operatorID AND (t.merchant_id = :merchantID) AND t.readerApplication_id = :readerApplicationID AND t.accountType in :accountTypes
Should exctract :operatorID, :merchantID, :readerApplicationID, :accountTypes.
But my regexes extract for the second match :marchantID)
What is wrong and why?
Even if I use an exacter mapping condition in the capture, it does not work: :([a-zA-z0-9_]+?)[\)\(\s]
Put your conditional "followed by space or paren" as a lookahead, so that it sees but doesn't match. Right now you are explicitly matching parentheses with [\(\)\s]:
:(.+?)(?=[\s\(\)])
https://regex101.com/r/im8KWF/1/
Or, use the built-in \b "word boundary", which is also a "zero-width" assertion meaning the same thing*:
:(.+?)\b
https://regex101.com/r/FnnzGM/3/
*Definition of word boundary from regular-expressions.info:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a
word character. After the last character in the string, if the last
character is a word character. Between two characters in the string,
where one is a word character and the other is not a word character.

How to match regex with dollar amounts and phrases in Java?

I have this regex
Pattern pa = Pattern.compile("\\b(\\$|hello|world|foo|blah blargh)\\b");
Matcher m = pa.matcher("$");
boolean b = m.matches();
System.out.println(b);
This prints out false, but I'm not sure why.
Why?
https://coderpad.io/GWFMKYQQ --> coderpad if it helps.
The point is that \b word boundary is ambiguous: when it appears after a word character (i.e. a letter, digit or underscore), the next character must a non-word one or the end of string. When \b stands after a non-word character it requires a word character to appear right after it, also excluding the end of the string.
So, if your intent is to match $ only if it is not enclosed with word characters, use unambiguous (?<!\w) and (?!\w) lookarounds:
Pattern pa = Pattern.compile("(?<!\\w)(\\$|hello|world|foo|blah blargh)(?!\\w)")
(?<!\w) will fail the match if the $ is preceded with a word character, and (?!\w) negative lookahead will fail the match if $ is followed with a word character.
NOTE: If you add (?U) (or Pattern.UNICODE_CHARACTER_CLASS flag), \w and \b will become Unicode aware (it might be important in some cases).
I did a bit of research on this, and it turns out, the \b metacharacter does not like dollar signs. You can match a dollar sign after a space by using the regular expression below:
Pattern.compile("(\\s|^)\\$")
And trimming out the preceding whitespace with another regular expression:
Pattern.compile("\\S+")
Alternatively, since this is Java, and not JavaScript's crap regex engine, you can just use this:
Pattern.compile("(?<=\\s)\\$")

Regex for "* word"

Any Regex masters out there? I need a regular expression in Java that matches:
"RANDOMSTUFF SPECIFICWORD"
Including the quotation marks.
Thus I need
to match the first quote,
RANDOMSTUFF (any number of words with spaces between preceding SPECIFICWORD)
SPECIFICWORD (a specific word which I won't specify here.)
and the ending quote.
I don't want to match things such as:
RANDOMSTUFF SPECIFICWORD
"RANDOMSTUFF NOTTHESPECIFICWORD"
"RANDOMSTUFF SPECIFICWORD MORERANDOMSTUFF"
\".*\sSPECIFICWORD\"
If you don't want to allow quotes in between, use \"[^"]*\sSPECIFICWORD\"
. matches any character
* says 0 or more of the preceding character (in this case, 0 or more of any characters)
\s matches any whitespace character
SPECIFICWORD will be treated as a string literal, assuming there are no special characters (escape them if there are)
\" matches the quote
[^"] means any character except a quote (the ^ is what makes it 'except')
Also, this link could be useful. Regex's are powerful expressions and are applicable across virtually any language, so it would be a good thing to become comfortable with using them.
EDIT:
As several other posters have pointed out, adding ^ to the beginning and $ to the end will only match if the entire line matches.
^ matches the beginning of the line
$ matches the end of the line
^.*\s+SPECIFICWORD"$
'^' matches 'from the start of the line'
.* matches anything
\s+ matches 'any amount of whitespace, but at least some'
SPECIFICWORD" is a string literal
$ means 'this is the end of the line'
Note that ^ and $ are not always 'line'-based; most languages allow you to specify a 'multiline' mode that would cause them to match 'start of the string/end of the string' instead of one line at a time.
Will this string be matched as a line by line basis or will it be found within the text? If so, you can add anchors to ensure that it matches the string.
^(\".*\sSPECIFICWPRD\")$
Saying, at the start of the line, look for a double quote followed by zero or more random characters followed by a single whitespace, followed by the specific word, followed by a double quote at the end of the string.
Optionally, there are excellent tools for designing regex patterns and seeing what they match in real time.
Here are a couple of examples:
http://gskinner.com/RegExr/
http://regex101.com/r/zC3fM1
Try:
\"[\w\s]*SPECIFICWORD\"
Works like this:
\" matches opening quote
[\w\s]* matches zero or more of the characters from the following sets:
[a-zA-Z_0-9] (\w part)
[ \t\n\x0B\f\r] (\s part)
SPECIFICWORD matches the SPECIFICWORD
\" matches closing quote

Categories