String matching in java - java

I am currently struggling with my "dirty word" filter finding partial matches.
example: if I pass in these two params replaceWord("ass", "passing pass passed ass")
to this method
private static String replaceWord(String word, String input) {
Pattern legacyPattern = Pattern.compile(word, Pattern.CASE_INSENSITIVE);
Matcher matcher = legacyPattern.matcher(input);
StringBuilder returnString = new StringBuilder();
int index = 0;
while(matcher.find()) {
returnString.append(input.substring(index,matcher.start()));
for(int i = 0; i < word.length() - 1; i++) {
returnString.append('*');
}
returnString.append(word.substring(word.length()-1));
index = matcher.end();
}
if(index < input.length() - 1){
returnString.append(input.substring(index));
}
return returnString.toString();
}
I get p*sing p*s p**sed **s
When I really just want "passing pass passed **s.
Does anyone know how to avoid this partial matching with this method??
Any help would be great thanks!

This tutorial from Oracle should point you in the right direction.
You want to use a word boundary in your pattern:
Pattern p = Pattern.compile("\\bword\\b", Pattern.CASE_INSENSITIVE);
Note, however that this still is problematic (as profanity filtering always is). A "non-word character" that defines the boundary is anything not included in [0-9A-Za-z_]
So for example, _ass would not match.
You also have the problem of profanity derived terms ... where the term is prepended to say, "hole", "wipe", etc

I'm working on a dirty word filter as we speak, and the option I chose to go with was Soundex and some regex.
I first filter out strange character with \w which is [a-zA-Z_0-9].
Then use soundex(String) to make a string that you can check against the soundex string of the word you want to test.
String soundExOfDirtyWord = Soundex.soundex(dirtyWord);
String soundExOfTestWord = Soundex.soundex(testWord);
if (soundExOfTestWord.equals(soundExOfDirtyWord)) {
System.out.println("The test words sounds like " + dirtyWord);
}
I just keep a list of dirty words in the program and have SoundEx run through them to check. The algorithm is something worth looking at.

You could also use replaceAll() method from the Matcher class. It replaces all the occurences of the pattern with your specified replacement word. Something like below.
private static String replaceWord(String word, String input) {
Pattern legacyPattern = Pattern.compile("\\b" + word + "\\b", Pattern.CASE_INSENSITIVE);
Matcher matcher = legacyPattern.matcher(input);
String replacement = "";
for (int i = 0; i < word.length() - 1; i++) {
replacement += "*";
}
replacement += word.charAt(word.length() - 1);
return matcher.replaceAll(replacement);
}

Related

How do I replace the same word but different case in the same sentence separately?

For example, replace "HOW do I replace different how in the same sentence by using Matcher?" with "LOL do I replace different lol in the same sentence?"
If HOW is all caps, replace it with LOL. Otherwise, replace it with lol.
I only know how to find them:
String source = "HOW do I replace different how in the same " +
"sentence by using Matcher?"
Pattern pattern = Pattern.compile(how, Pattern.CASE_INSENSITIVE);
Matcher m = pattern.matcher(source);
while (m.find()) {
if(m.group.match("^[A-Z]*$"))
System.out.println("I am uppercase");
else
System.out.println("I am lowercase");
}
But I don't know how to replace them by using matcher and pattern.
Here's one way to achieve your goal: (not necessarily the most efficient, but it works and is simply understood)
String source = "HOW do I replace different how in the same sentence by using Matcher?";
String[] split = source.replaceAll("HOW", "LOL").split(" ");
String newSource = "";
for(int i = 0; i < split.length; i++) {
String at = split[i];
if(at.equalsIgnoreCase("how")) at = "lol";
newSource+= " " + at;
}
newSource.substring(1, newSource.length());
//The output string is newSource
Replace all uppercase, then iterate over each word and replace the remaining "how"s with "lol". That substring at the end is simply to remove the extra space.
I came up with a really dumb solution:
String result = source;
result = result.replaceAll(old_Word, new_Word);
result = result.replaceAll(old_Word.toUpperCase(),
newWord.toUpperCase());

Regex replace space and word to toFirstUpper of word

I was trying to use regex to change the following string
String input = "Creation of book orders"
to
String output = "CreationOfBookOrders"
I tried the following expecting to replace the space and word with word.
input.replaceAll("\\s\\w", "(\\w)");
input.replaceAll("\\s\\w", "\\w");
but here the string is replacing space and word with character 'w' instead of the word.
I am in a position not to use any WordUtils or StringUtils or such Util classes. Else I could have replaced all spaces with empty string and applied WordUtils.capitalize or similar methods.
How else (preferably using regex) can I get the above output from input.
I don't think you can do that with String.replaceAll. The only modifications that you can make in the replacement string are to interpolate groups matched by the regex.
The javadoc for Matcher.replaceAll explains how the replacement string is handled.
You will need use a loop. Here's a simple version:
StringBuilder sb = new StringBuilder(input);
Pattern pattern = Pattern.compile("\\s\\w");
Matcher matcher = pattern.matcher(s);
int pos = 0;
while (matcher.find(pos)) {
String replacement = matcher.group().substring(1).toUpperCase();
pos = matcher.start();
sb.replace(pos, pos + 2, replacement);
pos += 1;
}
output = sb.toString();
(This could be done more efficiently, but it is complicated.)

Extract every complete word that contains a certain substring

I'm trying to write a function that extracts each word from a sentence that contains a certain substring e.g. Looking for 'Po' in 'Porky Pork Chop' will return Porky Pork.
I've tested my regex on regexpal but the Java code doesn't seem to work. What am I doing wrong?
private static String foo()
{
String searchTerm = "Pizza";
String text = "Cheese Pizza";
String sPattern = "(?i)\b("+searchTerm+"(.+?)?)\b";
Pattern pattern = Pattern.compile ( sPattern );
Matcher matcher = pattern.matcher ( text );
if(matcher.find ())
{
String result = "-";
for(int i=0;i < matcher.groupCount ();i++)
{
result+= matcher.group ( i ) + " ";
}
return result.trim ();
}else
{
System.out.println("No Luck");
}
}
In Java to pass \b word boundaries to regex engine you need to write it as \\b. \b represents backspace in String object.
Judging by your example you want to return all words that contains your substring. To do this don't use for(int i=0;i < matcher.groupCount ();i++) but while(matcher.find()) since group count will iterate over all groups in single match, not over all matches.
In case your string can contain some special characters you probably should use Pattern.quote(searchTerm)
In your code you are trying to find "Pizza" in "Cheese Pizza" so I assume that you also want to find strings that same as searched substring. Although your regex will work fine for it, you can change your last part (.+?)?) to \\w* and also add \\w* at start if substring should also be matched in the middle of word (not only at start).
So your code can look like
private static String foo() {
String searchTerm = "Pizza";
String text = "Cheese Pizza, Other Pizzas";
String sPattern = "(?i)\\b\\w*" + Pattern.quote(searchTerm) + "\\w*\\b";
StringBuilder result = new StringBuilder("-").append(searchTerm).append(": ");
Pattern pattern = Pattern.compile(sPattern);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
result.append(matcher.group()).append(' ');
}
return result.toString().trim();
}
While the regex approach is certainly a valid method, I find it easier to think through when you split the words up by whitespace. This can be done with String's split method.
public List<String> doIt(final String inputString, final String term) {
final List<String> output = new ArrayList<String>();
final String[] parts = input.split("\\s+");
for(final String part : parts) {
if(part.indexOf(term) > 0) {
output.add(part);
}
}
return output;
}
Of course it is worth nothing that doing this will effectively be doing two passes through your input String. The first pass to find the characters that are whitespace to split on, and the second pass looking through each split word for your substring.
If one pass is necessary though, the regex path is better.
I find nicholas.hauschild's answer to be the best.
However if you really wanted to use regex, you could do it as such:
String searchTerm = "Pizza";
String text = "Cheese Pizza";
Pattern pattern = Pattern.compile("\\b" + Pattern.quote(searchTerm)
+ "\\b", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group());
}
Output:
Pizza
The pattern should have been
String sPattern = "(?i)\\b("+searchTerm+"(?:.+?)?)\\b";
You want to capture the whole (pizza)string.?: ensures you don't capture a part of the string twice.
Try this pattern:
String searchTerm = "Po";
String text = "Porky Pork Chop oPod zzz llPo";
Pattern p = Pattern.compile("\\p{Alpha}+" + substring + "|\\p{Alpha}+" + substring + "\\p{Alpha}+|" + substring + "\\p{Alpha}+");
Matcher m = p.matcher(myString);
while(m.find()) {
System.out.println(">> " + m.group());
}
Ok, I give you a pattern in raw style (not java style, you must double escape yourself):
(?i)\b[a-z]*po[a-z]*\b
And that's all.

Performing multiple string replacements with metacharacter regex patterns

I am trying to perform multiple string replacements using Java's Pattern and Matcher, where the regex pattern may include metacharacters (e.g. \b, (), etc.). For example, for the input string fit i am, I would like to apply the replacements:
\bi\b --> EYE
i --> I
I then followed the coding pattern from two questions (Java Replacing multiple different substring in a string at once, Replacing multiple substrings in Java when replacement text overlaps search text). In both, they create an or'ed search pattern (e.g foo|bar) and a Map of (pattern, replacement), and inside the matcher.find() loop, they look up and apply the replacement.
The problem I am having is that the matcher.group() function does not contain information on matching metacharacters, so I cannot distinguish between i and \bi\b. Please see the code below. What can I do to fix the problem?
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.*;
public class ReplacementExample
{
public static void main(String argv[])
{
Map<String, String> replacements = new HashMap<String, String>();
replacements.put("\\bi\\b", "EYE");
replacements.put("i", "I");
String input = "fit i am";
String result = doit(input, replacements);
System.out.printf("%s\n", result);
}
public static String doit(String input, Map<String, String> replacements)
{
String patternString = join(replacements.keySet(), "|");
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(input);
StringBuffer resultStringBuffer = new StringBuffer();
while (matcher.find())
{
System.out.printf("match found: %s at start: %d, end: %d\n",
matcher.group(), matcher.start(), matcher.end());
String matchedPattern = matcher.group();
String replaceWith = replacements.get(matchedPattern);
// Do the replacement here.
matcher.appendReplacement(resultStringBuffer, replaceWith);
}
matcher.appendTail(resultStringBuffer);
return resultStringBuffer.toString();
}
private static String join(Set<String> set, String delimiter)
{
StringBuilder sb = new StringBuilder();
int numElements = set.size();
int i = 0;
for (String s : set)
{
sb.append(Pattern.quote(s));
if (i++ < numElements-1) { sb.append(delimiter); }
}
return sb.toString();
}
}
This prints out:
match found: i at start: 1, end: 2
match found: i at start: 4, end: 5
fIt I am
Ideally, it should be fIt EYE am.
You mistyped one of your regexes:
replacements.put("\\bi\\", "EYE"); //Should be \\bi\\b
replacements.put("i", "I");
You may also want to make your regexes unique. There is no guarantee of order with map.getKeySet() so it may just be replacing i with I before checking \\bi\\b.
You could use capture groups, without straying too far from your existing design. So instead of using the matched pattern as the key, you look up based on the order within a List.
You would need to change the join method to put parantheses around each of the patterns, something like this:
private static String join(Set<String> set, String delimiter) {
StringBuilder sb = new StringBuilder();
sb.append("(");
int numElements = set.size();
int i = 0;
for (String s : set) {
sb.append(s);
if (i++ < numElements - 1) {
sb.append(")");
sb.append(delimiter);
sb.append("("); }
}
sb.append(")");
return sb.toString();
}
As a side note, the use of Pattern.quote in the original code listing would have caused the match to fail where those metacharacters were present.
Having done this, you would now need to determine which of the capture groups was responsible for the match. For simplicity I'm going to assume that none of the match patterns will themselves contain capture groups, in which case something like this would work, within the matcher while loop:
int index = -1;
for (int j=1;j<=replacements.size();j++){
if (matcher.group(j) != null) {
index = j;
break;
}
}
if (index >= 0) {
System.out.printf("Match on index %d = %s %d %d\n", index, matcher.group(index), matcher.start(index), matcher.end(index));
}
Next, we would like to use the resulting index value to index straight back into the replacements. The original code uses a HashMap, which is not suitable for this; you're going to have to refactor that to use a pair of Lists in some form, one containing the list of match patterns and the other the corresponding list of replacement strings. I won't do that here, but I hope that provides enough detail to create a working solution.

How to determine where a regex failed to match using Java APIs

I have tests where I validate the output with a regex. When it fails it reports that output X did not match regex Y.
I would like to add some indication of where in the string the match failed. E.g. what is the farthest the matcher got in the string before backtracking. Matcher.hitEnd() is one case of what I'm looking for, but I want something more general.
Is this possible to do?
If a match fails, then Match.hitEnd() tells you whether a longer string could have matched. In addition, you can specify a region in the input sequence that will be searched to find a match. So if you have a string that cannot be matched, you can test its prefixes to see where the match fails:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class LastMatch {
private static int indexOfLastMatch(Pattern pattern, String input) {
Matcher matcher = pattern.matcher(input);
for (int i = input.length(); i > 0; --i) {
Matcher region = matcher.region(0, i);
if (region.matches() || region.hitEnd()) {
return i;
}
}
return 0;
}
public static void main(String[] args) {
Pattern pattern = Pattern.compile("[A-Z]+[0-9]+[a-z]+");
String[] samples = {
"*ABC",
"A1b*",
"AB12uv",
"AB12uv*",
"ABCDabc",
"ABC123X"
};
for (String sample : samples) {
int lastMatch = indexOfLastMatch(pattern, sample);
System.out.println(sample + ": last match at " + lastMatch);
}
}
}
The output of this class is:
*ABC: last match at 0
A1b*: last match at 3
AB12uv: last match at 6
AB12uv*: last match at 6
ABCDabc: last match at 4
ABC123X: last match at 6
You can take the string, and iterate over it, removing one more char from its end at every iteration, and then check for hitEnd():
int farthestPoint(Pattern pattern, String input) {
for (int i = input.length() - 1; i > 0; i--) {
Matcher matcher = pattern.matcher(input.substring(0, i));
if (!matcher.matches() && matcher.hitEnd()) {
return i;
}
}
return 0;
}
You could use a pair of replaceAll() calls to indicate the positive and negative matches of the input string. Let's say, for example, you want to validate a hex string; the following will indicate the valid and invalid characters of the input string.
String regex = "[0-9A-F]"
String input = "J900ZZAAFZ99X"
Pattern p = Pattern.compile(regex)
Matcher m = p.matcher(input)
String mask = m.replaceAll('+').replaceAll('[^+]', '-')
System.out.println(input)
System.out.println(mask)
This would print the following, with a + under valid characters and a - under invalid characters.
J900ZZAAFZ99X
-+++--+++-++-
If you want to do it outside of the code, I use rubular to test the regex expressions before sticking them in the code.

Categories