Regex not matching against ampersand - java

I'm trying to match the following regex:
\b(?:mr|mrs|ms|miss|messrs|mmes|dr|prof|rev|sr|jr|&|and)\.?\b
In other words, a word boundary followed by any of the strings above (optionally followed by a period character) followed by a word boundary.
I'm trying to match this in Java, but the ampersand will not match. For example:
Pattern p = Pattern.compile(
"\\b(?:mr|mrs|ms|miss|messrs|mmes|dr|prof|rev|sr|jr|&|and)\\.?\\b",
Pattern.CASE_INSENSITIVE);
String result = p.matcher("mr one and mrs.two and three & four").replaceAll(" ");
System.out.println("["+result+"]");
The output of this is: [ one two three & four]
I've also tried this at regex101, and again the ampersand does not match: https://regex101.com/r/klkmwl/1
Escaping the ampersand does not make a difference, and I've tried using the hex escape sequence \x26 instead of ampersand (as suggested in this question). Why is this not matching?

Your regex will match an ampersand if it is located in between word chars, e.g. three&four, see this regex demo. This happens because \b before a non-word char requires a word char to appear immediately before it. Also, as there is a \b after an optional dot, both the dot and ampersand will only match if there is a word char immediately on the left.
You need to re-write the pattern so that the word boundaries are applied to the words rather than symbols:
Pattern p = Pattern.compile(
"(?:\\b(?:mr|mrs|ms|miss|messrs|mmes|dr|prof|rev|sr|jr|and)\\b|&)\\.?",
Pattern.CASE_INSENSITIVE);
See the regex demo online.

Problem is due to use of word boundaries. There are no word boundaries before or after a non-word character like &.
In place of word boundary you can use lookarounds:
(?<!\w)(?:[jsdm]r|mr?s|miss|messrs|mmes|prof|re|&|and)\.?(?!\w)
Updated RegEx Demo
(?<!\w): Make sure that previous character is not a word character
(?!\w): Make sure that next character is not a word character
Note some tweaks in your regex to make it shorter.

Related

Regex for matching a character later in the string if a certain character is present before

Let's say I have the following string
['json.key']
I want a regex pattern that will match the entire string because it contains the matching closing '] to the opening ['.
But sometimes the [' and '] don't have to exist, and it should be okay too.
jsonKey
But I don't want strings like these to match
['jsonKey
jsonKey']
Because they are missing the matching [' and '].
The current regex pattern I have for this is
(\[')?[\w-]+('])?
But this doesn't quite work because it lets the two last cases pass.
I need a regex pattern for Java and JavaScript code. But they are separate modules, it could be different patterns.
In Java or Javascript you can use alternation and look arounds like this:
(?<!\S)(?:\['[\w-]+']|[\w-]+)(?!\S)
RegEx Demo
RegEx Details:
(?<!\S): Assert that previous char is not a non-whitespace
(?:: Start non-capture group
\['[\w-]+']: Match ['<1+ word char>']
|: OR
[\w-]+: Match 1+ of word char or hyphen
): End non-capture group
(?!\S): Assert that next char is not a non-whitespace

Match starting and ending character using Java Matcher class

I want to get words from string that starts with # and end with space. I've tried using this Pattern.compile("#\\s*(\\w+)") but it doesn't include characters like ' or :.
I want the solution with only Pattern Matching method.
We can try matching using the pattern (?<=\\s|^)#\\S+, which would match any word starting with #, followed by any number of non whitespace characters.
String line = "Here is a #hashtag and here is #another has tag.";
String pattern = "(?<=\\s|^)#\\S+";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(line);
while (m.find()) {
System.out.println(m.group(0));
}
#hashtag
#another
Demo
Note: The above solution might give you an edge case of pulling in punctuation which appears at the end of a hashtag. If you don't want this, then we can rephrase the regex to only match positive certain characters, e.g. letters and numbers. But, maybe this is not a concern for you.
The opposite of \s is \S, so you can use a regex like this:
#\s*(\S+)
Or for Java:
Pattern.compile("#\\s*(\\S+)")
It will capture anything that is not a white space.
See demo here.
If you want to stop on the space character and not any white space change the \S to [^ ].
The ^ inside the brackets means it will negate whatever comes after it.
Pattern.compile("#\\s*([^ ]+)")
See demo here.

How to match regex with dollar amounts and phrases in Java?

I have this regex
Pattern pa = Pattern.compile("\\b(\\$|hello|world|foo|blah blargh)\\b");
Matcher m = pa.matcher("$");
boolean b = m.matches();
System.out.println(b);
This prints out false, but I'm not sure why.
Why?
https://coderpad.io/GWFMKYQQ --> coderpad if it helps.
The point is that \b word boundary is ambiguous: when it appears after a word character (i.e. a letter, digit or underscore), the next character must a non-word one or the end of string. When \b stands after a non-word character it requires a word character to appear right after it, also excluding the end of the string.
So, if your intent is to match $ only if it is not enclosed with word characters, use unambiguous (?<!\w) and (?!\w) lookarounds:
Pattern pa = Pattern.compile("(?<!\\w)(\\$|hello|world|foo|blah blargh)(?!\\w)")
(?<!\w) will fail the match if the $ is preceded with a word character, and (?!\w) negative lookahead will fail the match if $ is followed with a word character.
NOTE: If you add (?U) (or Pattern.UNICODE_CHARACTER_CLASS flag), \w and \b will become Unicode aware (it might be important in some cases).
I did a bit of research on this, and it turns out, the \b metacharacter does not like dollar signs. You can match a dollar sign after a space by using the regular expression below:
Pattern.compile("(\\s|^)\\$")
And trimming out the preceding whitespace with another regular expression:
Pattern.compile("\\S+")
Alternatively, since this is Java, and not JavaScript's crap regex engine, you can just use this:
Pattern.compile("(?<=\\s)\\$")

Regex - How to recognize a String + white spaces + String

I need to recognize some pattern which goes like this:
[letters][some spaces][letters]
What I done so far is this:
String regex = "[a-zA-Z]\\s+[a-zA-Z]";
As per the requirement, you wrote letters (with a s at the end).
[letters][some spaces][letters]
So to do that you must be quantifying the character class as
String regex = "[a-zA-Z]+\\s+[a-zA-Z]+";
[a-zA-Z]+ Matches one or more letters. Here + is the quantifier which quantifies [a-zA-Z] One or more times.
Regex Demo
Where as if you write [a-zA-Z]\\s+[a-zA-Z], it would only match a single character before and after the space.
Regex Demo
If you want the entire string to follow this pattern, you must be adding anchors as well to the pattern as
String regex = "^[a-zA-Z]+\\s+[a-zA-Z]+$";
^ Anchors the regex at the start of the string.
$ Anchors the regex at the end of the string.
These anchors ensure that immediatly following start of string, ^ number of letters occure, [a-zA-Z]+ followed by space and again letters. The second group of letters is followed by end of string $

java regex until certain word/text/characters

Please consider the following text :
That is, it matches at any position that has a non-word character to the left of it, and a word character to the right of it.
How can I get the following result :
That is, it matches at any position that has a non-word character to the
That is everything until left
input.replace("^(.*?)\\bleft.*$", "$1");
^ anchors to the beginning of the string
.*? matches as little as possible of any character
\b matches a word boundary
left matches the string literal "left"
.* matches the remainder of the string
$ anchors to the end of the string
$1 replaces the matched string with group 1 in ()
If you want to use any word (not just "left"), be careful to escape it. You can use Pattern.quote(word) to escape the string.
The answer is actually /(.*)\Wleft\w/ but it won't match anything in
That is, it matches at any position that has a non-word character to the left of it, and a word character to the right of it.
String result = inputString.replace("(.*?)left.*", "$1");

Categories