Matching a word with pound (#) symbol in a regex - java

I have regexp for check if some text containing word (with ignoring boundary)
String regexp = ".*\\bSOME_WORD_HERE\\b.*";
but this regexp return false when "SOME_WORD" starts with # (hashtag).
Example, without #
String text = "some text and test word";
String matchingWord = "test";
boolean contains = text.matches(".*\\b" + matchingWord + "\\b.*");
// now contains == true;
But with hashtag `contains` was false. Example:
text = "some text and #test word";
matchingWord = "#test";
contains = text.matches(".*\\b" + matchingWord + "\\b.*");
//contains == fasle; but I expect true

The \b# pattern matches a # that is preceded with a word character: a letter, digit or underscore.
If you need to match # that is not preceded with a word char, use a negative lookbehind (?<!\w). Similarly, to make sure the trailing \b matches if a non-word char is there, use (?!\w) negative lookahead:
text.matches("(?s).*(?<!\\w)" + matchingWord + "(?!\\w).*");
Using Pattern.quote(matchingWord) is a good idea if your matchingWord can contain special regex metacharacters.
Alternatively, if you plan to match your search words in between whitespace or start/end of string, you can use (?<!\S) as the initial boundary and (?!\S) as the trailing one
text.matches("(?s).*(?<!\\S)" + matchingWord + "(?!\\S).*");
And one more thing: the .* in the .matches is not the best regex solution. A regex like "(?<!\\S)" + matchingWord + "(?!\\S)" with Matcher#find() will be processed in a much more optimized way, but you will need to initialize the Matcher object for that.

If you are looking for words with leading '#', just simple remove the leading '#' from the searchword and use following regex.
text.matches("#\\b" + matchingWordWithoutLeadingHash + "\\b");

Related

Replace string for exact match with conditions

Sample string
astabD (tabD) tabD .tabD tabD. (tabD tabD)
I need to replace tabD with something like temp.tabD for every of the occurrence in the above string except the first and second one.
For this I tried replaceAll with word boundary
str.replaceAll("\b"+ "tabD" + "\b","temp.tabD"))
Works except for the second occurrence. Would appreciate any help since '(' and ')' are also keywords and only the occurrence of both have to be ignored.
You may use
.replaceAll("\\b(?<!\\((?=\\w+\\)))tabD\\b", "")
Or, if tabD comes in from user input:
String s = "astabD (tabD) tabD .tabD tabD. (tabD tabD)";
String word = "tabD";
String wordRx = Pattern.quote(word);
s = s.replaceAll("(?<!\\w)(?<!\\((?=" + wordRx + "\\)))" + wordRx + "(?!\\w)", "");
See the regex demo.
Details
\b - a word boundary ((?<!\w) is an unambiguous left word boundary)
(?<!\((?=\w+\))) - a negative lookbehind that fails the match if right before the current location there is a ( that is followed with 1+ word chars (\w+ is required to match the tabD word) followed with ) (NOTE: If your IDE tells you the + is inside a lookbehind, it is the IDE bug since the + is in a lookahead here and + / * quantifiers are allowed in lookaheads)
tabD - the word to find
\b - a word boundary ((?!\w) is an unambiguous right word boundary)
Java demo:
String s = "astabD (tabD) tabD .tabD tabD. (tabD tabD)";
System.out.println(s.replaceAll("\\b(?<!\\((?=\\w+\\)))tabD\\b", ""));
// => astabD (tabD) . . ( )

Matching a whole word with leading or trailing special symbols like dollar in a string

I can replace dollar signs by using Matcher.quoteReplacement. I can replace words by adding boundary characters:
from = "\\b" + from + "\\b";
outString = line.replaceAll(from, to);
But I can't seem to combine them to replace words with dollar signs.
Here's an example. I am trying to replace "$temp4" (NOT $temp40) with "register1".
String line = "add, $temp4, $temp40, 42";
String to = "register1";
String from = "$temp4";
String outString;
from = Matcher.quoteReplacement(from);
from = "\\b" + from + "\\b"; //do whole word replacement
outString = line.replaceAll(from, to);
System.out.println(outString);
Outputs
"add, $temp4, $temp40, 42"
How do I get it to replace $temp4 and only $temp4?
Use unambiguous word boundaries, (?<!\w) and (?!\w), instead of \b that are context dependent:
from = "(?<!\\w)" + Pattern.quote(from) + "(?!\\w)";
See the regex demo.
The (?<!\w) is a negative lookbehind that fails the match if there is a non-word char immediately to the left of the current location and (?!\w) is a negative lookahead that fails the match if there is a non-word char immediately to the right of the current location. The Pattern.quote(from) is necessary to escape any special chars in the from variable.
See the Java demo:
String line = "add, $temp4, $temp40, 42";
String to = "register1";
String from = "$temp4";
String outString;
from = "(?<!\\w)" + Pattern.quote(from) + "(?!\\w)";
outString = line.replaceAll(from, to);
System.out.println(outString);
// => add, register1, $temp40, 42
Matcher.quoteReplacement() is for the replacement string (to), not the regex (from). To include a string literal in the regex, use Pattern.quote():
from = Pattern.quote(from);
$ has special meaning in regex (it means “end of input”). To remove any special meaning from characters in your target, wrap it in regex quote/unquote expressions \Q...\E. Also, because $ is not ”word” character, the word boundary won’t wiork, so use look arounds instead:
line = line.replaceAll("(?<!\\S)\\Q" + from + "\\E(?![^ ,])", to);
Normally, Pattern.quote is the way to go to escape characters that may be specially interpreted by the regex engine.
However, the regular expression is still incorrect, because there is no word boundary before the $ in line; space and $ are both non-word characters. You need to place the word boundary after the $ character. There is no need for Pattern.quote here, because you're escaping things yourself.
String from = "\\$\\btemp4\\b";
Or more simply, because you know there is a word boundary between $ and temp4 already:
String from = "\\$temp4\\b";
The from variable can be constructed from the expression to replace. If from has "$temp4", then you can escape the dollar sign and add a word boundary.
from = "\\" + from + "\\b";
Output:
add, register1, $temp40, 42

How to find and skip special characters at the start and end of the word

New to regex and using following code to find if a word contains special characters at the end/start.
String s = "K-factor:";
String regExp = "^[^<>{}\"/|;:.,~!?##$%^=&*\\]\\\\()\\[0-9_+]*$";
Matcher matcher = Pattern.compile(regExp).matcher(s);
while (matcher.find()) {
System.out.println("Start: "+ matcher.start());
System.out.println("End: "+ matcher.end());
System.out.println("Group: "+ matcher.group());
s = s.substring(0, matcher.start());
}
Would like to find if there's any special character(: in this sample code) at the start or end of the string. Trying to skip the character.
Neither compile time error nor output.
Note that your regex matches a whole string that does not contain the chars you defined in the character class. The string in question does not match that pattern since it contains :.
You might consider splitting the pattern into two parts to check for the unwanted chars at the start or end using an alternation group:
String regExp = "^[<>{}\"/|;:.,~!?##$%^=&*\\]\\\\()\\[0-9_+]|[<>{}\"/|;:.,~!?##$%^=&*\\]\\\\()\\[0-9_+]$";
Here, the pattern has a ^<special_char_class>|<special_char_class>$ structure, ^ anchors the match at start, $ anchors the match at the string end, and | is the alternation operator. Note I removed the ^ from the start of the character class to make them positive rather than negated, so that they could match those chars/ranges defined in the class.
Alternatively, since you seem to just match a string if it contains a non-letter at the start/end, you may use a
String regExp = "^\\P{L}|\\P{L}$";
that is Unicode letter aware or - ASCII only:
String regExp = "^\\P{Alpha}|\\P{Alpha}$";

Regex Word Boundary, Pattern.quote and Parentheses

I am writing a function that allows users to search a field of text for search terms that they can enter, and mark them up in some way such as highlighting. What I currently have is:
String text = "This is my (simple) test.";
String searchExpression = "(?i)\\b(" + Pattern.quote(searchTerm) + ")\\b";
String replaceExpression = markupToken + "$1" + markupToken;
String newText = text.replaceAll(searchExpression, replaceExpression);
This works great if search term is "simple"; however, if the user searches for "(simple)" it will not successfully match. If I remove Pattern.quote or the \b's this works fine.
Is there a way to modify the searchExpression that it will work in both of these scenarios?
Your regex is failing because you cannot match \b (word boundary) before and and after ( and ) since these are not considered word characters.
You can tweak your regex as this:
String searchExpression = "(?i)(?<!\\w(?=\\w))(" + Pattern.quote(searchTerm) +
")(?!(?<=\\w)\\w)";
i.e. use lookarounds on either side which means there should not a word character before and after the pattern if search term has a word character at start and end.

Can you help with regular expressions in Java?

I have a bunch of strings which may of may not have random symbols and numbers in them. Some examples are:
contains(reserved[j])){
close();
i++){
letters[20]=word
I want to find any character that is NOT a letter, and replace it with a white space, so the above examples look like:
contains reserved j
close
i
letters word
What is the best way to do this?
It depends what you mean by "not a letter", but assuming you mean that letters are a-z or A-Z then try this:
s = s.replaceAll("[^a-zA-Z]", " ");
If you want to collapse multiple symbols into a single space then add a plus at the end of the regular expression.
s = s.replaceAll("[^a-zA-Z]+", " ");
yourInputString = yourInputString.replaceAll("[^\\p{Alpha}]", " ");
^ denotes "all characters except"
\p{Alpha} denotes all alphabetic characters
See Pattern for details.
I want to find any character that is NOT a letter
That will be [^\p{Alpha}]+. The [] indicate a group. The \p{Alpha} matches any alphabetic character (both uppercase and lowercase, it does basically the same as \p{Upper}\p{Lower} and a-zA-Z. The ^ inside group inverses the matches. The + indicates one-or-many matches in sequence.
and replace it with a white space
That will be " ".
Summarized:
string = string.replaceAll("[^\\p{Alpha}]+", " ");
Also see the java.util.regex.Pattern javadoc for a concise overview of available patterns. You can learn more about regexs at the great site http://regular-expression.info.
Use the regexp /[^a-zA-Z]/ which means, everything that is not in the a-z/A-Z characters
In ruby I would do:
"contains(reserved[j]))".gsub(/[^a-zA-Z]/, " ")
=> "contains reserved j "
In Java should be something like:
import java.util.regex.*;
...
String inputStr = "contains(reserved[j])){";
String patternStr = "[^a-zA-Z]";
String replacementStr = " ";
// Compile regular expression
Pattern pattern = Pattern.compile(patternStr);
// Replace all occurrences of pattern in input
Matcher matcher = pattern.matcher(inputStr);
String output = matcher.replaceAll(replacementStr);

Categories