Regex Word Boundary, Pattern.quote and Parentheses - java

I am writing a function that allows users to search a field of text for search terms that they can enter, and mark them up in some way such as highlighting. What I currently have is:
String text = "This is my (simple) test.";
String searchExpression = "(?i)\\b(" + Pattern.quote(searchTerm) + ")\\b";
String replaceExpression = markupToken + "$1" + markupToken;
String newText = text.replaceAll(searchExpression, replaceExpression);
This works great if search term is "simple"; however, if the user searches for "(simple)" it will not successfully match. If I remove Pattern.quote or the \b's this works fine.
Is there a way to modify the searchExpression that it will work in both of these scenarios?

Your regex is failing because you cannot match \b (word boundary) before and and after ( and ) since these are not considered word characters.
You can tweak your regex as this:
String searchExpression = "(?i)(?<!\\w(?=\\w))(" + Pattern.quote(searchTerm) +
")(?!(?<=\\w)\\w)";
i.e. use lookarounds on either side which means there should not a word character before and after the pattern if search term has a word character at start and end.

Related

Replace string for exact match with conditions

Sample string
astabD (tabD) tabD .tabD tabD. (tabD tabD)
I need to replace tabD with something like temp.tabD for every of the occurrence in the above string except the first and second one.
For this I tried replaceAll with word boundary
str.replaceAll("\b"+ "tabD" + "\b","temp.tabD"))
Works except for the second occurrence. Would appreciate any help since '(' and ')' are also keywords and only the occurrence of both have to be ignored.
You may use
.replaceAll("\\b(?<!\\((?=\\w+\\)))tabD\\b", "")
Or, if tabD comes in from user input:
String s = "astabD (tabD) tabD .tabD tabD. (tabD tabD)";
String word = "tabD";
String wordRx = Pattern.quote(word);
s = s.replaceAll("(?<!\\w)(?<!\\((?=" + wordRx + "\\)))" + wordRx + "(?!\\w)", "");
See the regex demo.
Details
\b - a word boundary ((?<!\w) is an unambiguous left word boundary)
(?<!\((?=\w+\))) - a negative lookbehind that fails the match if right before the current location there is a ( that is followed with 1+ word chars (\w+ is required to match the tabD word) followed with ) (NOTE: If your IDE tells you the + is inside a lookbehind, it is the IDE bug since the + is in a lookahead here and + / * quantifiers are allowed in lookaheads)
tabD - the word to find
\b - a word boundary ((?!\w) is an unambiguous right word boundary)
Java demo:
String s = "astabD (tabD) tabD .tabD tabD. (tabD tabD)";
System.out.println(s.replaceAll("\\b(?<!\\((?=\\w+\\)))tabD\\b", ""));
// => astabD (tabD) . . ( )

Matching a word with pound (#) symbol in a regex

I have regexp for check if some text containing word (with ignoring boundary)
String regexp = ".*\\bSOME_WORD_HERE\\b.*";
but this regexp return false when "SOME_WORD" starts with # (hashtag).
Example, without #
String text = "some text and test word";
String matchingWord = "test";
boolean contains = text.matches(".*\\b" + matchingWord + "\\b.*");
// now contains == true;
But with hashtag `contains` was false. Example:
text = "some text and #test word";
matchingWord = "#test";
contains = text.matches(".*\\b" + matchingWord + "\\b.*");
//contains == fasle; but I expect true
The \b# pattern matches a # that is preceded with a word character: a letter, digit or underscore.
If you need to match # that is not preceded with a word char, use a negative lookbehind (?<!\w). Similarly, to make sure the trailing \b matches if a non-word char is there, use (?!\w) negative lookahead:
text.matches("(?s).*(?<!\\w)" + matchingWord + "(?!\\w).*");
Using Pattern.quote(matchingWord) is a good idea if your matchingWord can contain special regex metacharacters.
Alternatively, if you plan to match your search words in between whitespace or start/end of string, you can use (?<!\S) as the initial boundary and (?!\S) as the trailing one
text.matches("(?s).*(?<!\\S)" + matchingWord + "(?!\\S).*");
And one more thing: the .* in the .matches is not the best regex solution. A regex like "(?<!\\S)" + matchingWord + "(?!\\S)" with Matcher#find() will be processed in a much more optimized way, but you will need to initialize the Matcher object for that.
If you are looking for words with leading '#', just simple remove the leading '#' from the searchword and use following regex.
text.matches("#\\b" + matchingWordWithoutLeadingHash + "\\b");

Java Pattern / Matcher not finding word break

I am having trouble with Java Pattern and Matcher. I've included a very simplified example of what I'm trying to do.
I had expected the pattern ".\b" to find the last character of the first word (or "4" in the example), but as I step through the code, m.find() always returns false. What am I missing here?
Why does the following Java code always print out "Not Found"?
Pattern p = Pattern.compile(".\b");
Matcher m = p.matcher("102939384 is a word");
int ixEndWord = 0;
if (m.find()) {
ixEndWord = m.end();
System.out.println("Found: " + ixEndWord);
} else {
System.out.println("Not Found");
}
You need to escape special characters in the regex: ".\\b"
Basically, in a String the backslash has to be escaped. So "\\" becomes the character '\'.
So the String ".\\b" becomes the litteral String ".\b", which will be used by the Pattern.
To expand upton AntonH's comment, whenever you want the "\" character to appear in a regex expression, you have to escape it so that it first appears in the string you are passing in.
As is, ".\b" is the string of a dot . followed by the special backspace character represented by \b, compared to ".\\b", which is the regex .\b.

How to replace last letter to another letter in java using regular expression

i have seen to replace "," to "." by using ".$"|",$", but this logic is not working with alphabets.
i need to replace last letter of a word to another letter for all word in string containing EXAMPLE_TEST using java
this is my code
Pattern replace = Pattern.compile("n$");//here got the real problem
matcher2 = replace.matcher(EXAMPLE_TEST);
EXAMPLE_TEST=matcher2.replaceAll("k");
i also tried "//n$" ,"\n$" etc
Please help me to get the solution
input text=>njan ayman
output text=> njak aymak
Instead of the end of string $ anchor, use a word boundary \b
String s = "njan ayman";
s = s.replaceAll("n\\b", "k");
System.out.println(s); //=> "njak aymak"
You can use lookahead and group matching:
String EXAMPLE_TEST = "njan ayman";
s = EXAMPLE_TEST.replaceAll("(n)(?=\\s|$)", "k");
System.out.println("s = " + s); // prints: s = njak aymak
Explanation:
(n) - the matched word character
(?=\\s|$) - which is followed by a space or at the end of the line (lookahead)
The above is only an example! if you want to switch every comma with a period the middle line should be changed to:
s = s.replaceAll("(,)(?=\\s|$)", "\\.");
Here's how I would set it up:
(?=.\b)\w
Which in Java would need to be escaped as following:
(?=.\\b)\\w
It translates to something like "a character (\w) after (?=) any single character (.) at the end of a word (\b)".
String s = "njan ayman aowkdwo wdonwan. wadawd,.. wadwdawd;";
s = s.replaceAll("(?=.\\b)\\w", "");
System.out.println(s); //nja ayma aowkdw wdonwa. wadaw,.. wadwdaw;
This removes the last character of all words, but leaves following non-alphanumeric characters. You can specify only specific characters to remove/replace by changing the . to something else.
However, the other answers are perfectly good and might achieve exactly what you are looking for.
if (word.endsWith("char oldletter")) {
name = name.substring(0, name.length() - 1 "char newletter");
}

When the following Regex matches?

I found the following regex in one of the Android Source file:
String regex = "\\s+(?i)src=\"cid(?-i):\\Q" + attachment.mContentId + "\\E\"";
if(string.matches(regex)) {
Print -- Matched
} else {
Print -- Not Found
}
NOTE: attachment.mContentId will basically have values like C4EA83841E79F643970AF3F20725CB04#gmail.com
I made a sample code as below:
String content = "Hello src=\"cid:something#gmail.com\" is present";
String contentId = "something#gmail.com";
String regex = "\\s+(?i)src=\"cid(?-i):\\Q" + contentId + "\\E\"";
if(content.matches(regex))
System.out.println("Present");
else
System.out.println("Not Present");
This always gives "Not Present" as output.
But when I am doing the below:
System.out.println(content.replaceAll(regex, " Replaced Value"));
And the output is replaced with new value. If it is Not Present, then how could replaceAll work and replace the new value? Please clear my confusions.
Can anybody say what kind of content in string will make the control go to the if part?
String regex = "\\s+(?i)src=\"cid(?-i):\\Q" + attachment.mContentId + "\\E\"";
Break it down:
\\s+ - Match 1 or more spaces
(?i) - Turn on case-insensitive matching for the subsequent string
src=\"cid - match src="cid
(?-i) - Turn off case-insensitive matching
: - Obviously a colon
\\Q - Treat all following stuff before \\E as literal characters,
and not control characters. Special regex characters are disabled until \\E
attachment.mContentId - whatever your string is
\\E - End the literal quoting sandwich started by \\Q
\" - End quote
So it will match a string like src="cid:YOUR-STRING-LITERAL"
Or, to use your own example, something like this string will match (there are leading white space characters):
src="cid:C4EA83841E79F643970AF3F20725CB04#gmail.com"
For your update
The problem you're running into is using java.lang.String.matches() and expecting it does what you think it should.
String.matches() (and Matcher) has a problem: it tries to match the entire string against the regular expression.
If you use this regex:
String regex = "\\s+(?i)src=\"cid(?-i):\\Q" + attachment.mContentId + "\\E\"";
And this input:
String content = "Hello src=\"cid:something#gmail.com\" is present";
content will never match the regex because the entire string doesn't match the regular expression.
What you want to do is use Matcher.find - this should work for you.
String content = "Hello src=\"cid:something#gmail.com\" is present";
String contentId = "something#gmail.com";
Pattern pattern = Pattern.compile("\\s+(?i)src=\"cid(?-i):\\Q" + contentId + "\\E\"");
Matcher m = pattern.matcher(content);
if(m.find())
System.out.println("Present");
else
System.out.println("Not Present");
IDEone example: https://ideone.com/8RTf0e
That regex will match any
src="cid:contentId"
where only contentId needs to match case sensitive.
For instance giving your example contentId (C4EA83841E79F643970AF3F20725CB04#gmail.com) these strings will match:
SrC="CiD:C4EA83841E79F643970AF3F20725CB04#gmail.com"
src="cid:C4EA83841E79F643970AF3F20725CB04#gmail.com"
SRC="CID:C4EA83841E79F643970AF3F20725CB04#gmail.com"
while these will not match:
src="cid:c4Ea83841e79F643970aF3f20725Cb04#GmaiL.com"
src="cid:C4EA83841E79F643970AF3F20725CB04#GMAIL.COM"
Also the contentId part is escaped (\Q ... \E) so that the regex engine will not consider special characters inside it.

Categories