Matching whole words with special characters with a dynamically built pattern - java

I need to match an exact substring in a string in Java. I've tried with
String pattern = "\\b"+subItem+"\\b";
But it doesn't work if my substring contains non alphanumerical characters.
I want this to work exactly as the "Match whole word only" function in Notepad++.
Could you help?

I suggest either unambigous word boundaries (that match a string only if the search pattern is not enclosed with letters, digits or underscores):
String pattern = "(?<!\\w)"+Pattern.quote(subItem)+"(?!\\w)";
where (?<!\w) matches a location not preceded with a word char and (?!\w) fails if there is no word char immediately after the current position (see this regex demo), or, you can use a variation that takes into account leading/trailing special chars of the potential match:
String pattern = "(?:\\B(?!\\w)|\\b(?=\\w))" + Pattern.quote(subword) + "(?:(?<=\\w)\\b|(?<!\\w)\\B)";
See the regex demo.
Details:
(?:\B(?!\w)|\b(?=\w)) - either a non-word boundary if the next char is not a word char, or a word boundary if the next char is a word char
Data\[3\] - this is a quoted subItem
(?:(?<=\w)\b|(?<!\w)\B) - either a word boundary if the preceding char is a word char, or a non-word boundary if the preceding char is not a word char.

Related

How to check if substring is contained within word Java

I want to write a regex pattern that looks at a string to see if there is a "." followed by letters or numbers or both with no space in between.
Currently I have:
Pattern.matches(".*(\\W+|\\d+|[a-z]+)\\.[a-z]+", testStr)
But this doesn't work if there are numbers or symbols after the "." Can someone help me find a regex string that will return true for the string:
asdad-asdd/asdcs.pd(210)fsd
Just to reiterate the criteria for a successful match is the string contains any possible combination of letters, numbers, and/or symbols before and after a "."
You can match these strings by replacing [a-z]+ with [a-z\d\p{Punct}]+:
Pattern.matches(".*(\\W+|\\d+|[a-z]+)\\.[a-z\\d\\p{Punct}]+", testStr)
The [a-z\d\p{Punct}]+ pattern matches lowercase ASCII letters, digits or punctuation. Add A-Z into the brackets if you plan to allow uppercase ASCII letters. See the regex demo.
However, you might also match any non-whitespace chars with \S+:
Pattern.matches(".*(\\W+|\\d+|[a-z]+)\\.\\S+", testStr)
If you do not want to allow another dot:
Pattern.matches(".*(\\W+|\\d+|[a-z]+)\\.[^\\s.]+", testStr)
Here, [^\\s.]+ matches one or more chars other than whitespace and . chars.

Regex not matching against ampersand

I'm trying to match the following regex:
\b(?:mr|mrs|ms|miss|messrs|mmes|dr|prof|rev|sr|jr|&|and)\.?\b
In other words, a word boundary followed by any of the strings above (optionally followed by a period character) followed by a word boundary.
I'm trying to match this in Java, but the ampersand will not match. For example:
Pattern p = Pattern.compile(
"\\b(?:mr|mrs|ms|miss|messrs|mmes|dr|prof|rev|sr|jr|&|and)\\.?\\b",
Pattern.CASE_INSENSITIVE);
String result = p.matcher("mr one and mrs.two and three & four").replaceAll(" ");
System.out.println("["+result+"]");
The output of this is: [ one two three & four]
I've also tried this at regex101, and again the ampersand does not match: https://regex101.com/r/klkmwl/1
Escaping the ampersand does not make a difference, and I've tried using the hex escape sequence \x26 instead of ampersand (as suggested in this question). Why is this not matching?
Your regex will match an ampersand if it is located in between word chars, e.g. three&four, see this regex demo. This happens because \b before a non-word char requires a word char to appear immediately before it. Also, as there is a \b after an optional dot, both the dot and ampersand will only match if there is a word char immediately on the left.
You need to re-write the pattern so that the word boundaries are applied to the words rather than symbols:
Pattern p = Pattern.compile(
"(?:\\b(?:mr|mrs|ms|miss|messrs|mmes|dr|prof|rev|sr|jr|and)\\b|&)\\.?",
Pattern.CASE_INSENSITIVE);
See the regex demo online.
Problem is due to use of word boundaries. There are no word boundaries before or after a non-word character like &.
In place of word boundary you can use lookarounds:
(?<!\w)(?:[jsdm]r|mr?s|miss|messrs|mmes|prof|re|&|and)\.?(?!\w)
Updated RegEx Demo
(?<!\w): Make sure that previous character is not a word character
(?!\w): Make sure that next character is not a word character
Note some tweaks in your regex to make it shorter.

regex capture includes too much

I have a string from which I would like to caputre all after and including colon until (excluding) white space or paranthesis.
Why does the following regex include the paranthesis in the string match?
:(.*?)[\(\)\s] or also :(.+?)[\)\s] (non-greedy) does not work.
Example input: WHERE t.operator_id = :operatorID AND (t.merchant_id = :merchantID) AND t.readerApplication_id = :readerApplicationID AND t.accountType in :accountTypes
Should exctract :operatorID, :merchantID, :readerApplicationID, :accountTypes.
But my regexes extract for the second match :marchantID)
What is wrong and why?
Even if I use an exacter mapping condition in the capture, it does not work: :([a-zA-z0-9_]+?)[\)\(\s]
Put your conditional "followed by space or paren" as a lookahead, so that it sees but doesn't match. Right now you are explicitly matching parentheses with [\(\)\s]:
:(.+?)(?=[\s\(\)])
https://regex101.com/r/im8KWF/1/
Or, use the built-in \b "word boundary", which is also a "zero-width" assertion meaning the same thing*:
:(.+?)\b
https://regex101.com/r/FnnzGM/3/
*Definition of word boundary from regular-expressions.info:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a
word character. After the last character in the string, if the last
character is a word character. Between two characters in the string,
where one is a word character and the other is not a word character.

Regex matching entire string

How can I make my string only pass a test if every character in the string is in the regex?
Here is what I have so far:
String w = theApplet.Word.getText().toLowerCase();
if(w.matches(".*[a-z-_]+.*")){
theApplet.words.add(w);
theApplet.str.setText("The word: "+w+" has been added to the list");
}
However, the string is valid even if it contains invalid characters, as long as it contains at least 1 of the characters in the regex.
.* means "match any character zero or more times"
[a-z-_]+ means "match any lowercase character or dash (-) or underscore (_) one or more times".
So the first part is consuming nearly the entire string and the regex is returning true if there is at least one lowercase character/dash/underscore.
Simply remove the .*'s to force all characters to be lowercase characters/dashes/underscores.

java regex until certain word/text/characters

Please consider the following text :
That is, it matches at any position that has a non-word character to the left of it, and a word character to the right of it.
How can I get the following result :
That is, it matches at any position that has a non-word character to the
That is everything until left
input.replace("^(.*?)\\bleft.*$", "$1");
^ anchors to the beginning of the string
.*? matches as little as possible of any character
\b matches a word boundary
left matches the string literal "left"
.* matches the remainder of the string
$ anchors to the end of the string
$1 replaces the matched string with group 1 in ()
If you want to use any word (not just "left"), be careful to escape it. You can use Pattern.quote(word) to escape the string.
The answer is actually /(.*)\Wleft\w/ but it won't match anything in
That is, it matches at any position that has a non-word character to the left of it, and a word character to the right of it.
String result = inputString.replace("(.*?)left.*", "$1");

Categories