RegEx pattern with unusual unicode character and word boundaries

RegEx pattern with unusual unicode character and word boundaries - java

I'm stuck with a problem concerning RegEx patterns and I hope somebody would explain it to me:
The task is to match object names and remove them from a description that's stored in one of the object's field. I tried the following expression:
final String description= object.getDescrition();
final Matcher descriptionMatcher=
Pattern.compile("\\b" + object.getName() + "\\b", Pattern.UNICODE_CASE | Pattern.CASE_INSENSITIVE)
.matcher(description);
All works fine until the code encounters a "registered trademark" symbol added to the name: String name = ObjectName®
If I remove the last word boundary, it is matched again. What is the reason for this behaviour and how can I improve this code to possibly find every such special case?
Note: the trademark sign is not separated from the object name via space.

In this case, change your pattern to:
"\\b\\Q" + object.getName() + "\\E(?<=\\b|®)"
if you need to deal with more complex cases, use alternations in lookarounds instead of word boundaries. Example:
"(?<=\\s|^)\\Q" + object.getName() + "\\E(?=\\s|$)"
or
"(?<=\\s|^)" + Pattern.quote(object.getName()) + "(?=\\s|$)"

The ® character is not considered a word character, therefore your Pattern will not match.
A quick and dirty solution would be to alternate it with the word boundary, if you only have this case:
Pattern.compile("\\b" + object.getName() + "\\b|®"

Related

Java Regex complex ID expression filtering

I am using Java to implement PDF to plain text conversion. Right now I am facing the problem of filtering out ID expressions from String representation of the text.
The idea here is to capture IDs as whole words of length only greater than 4 and remove them. IDs must comprise of both letters and numbers at the same time, in any order. They can have optional special symbols like :.- and are generally all uppercase except several cases when there might be one and (for now) exactly one lowercase letter in them. IDs can be encountered at any place in the sentence, and there are multiple sentences inside the String. I am also trying to capture the preceding space (if there is one) so there is no double space after I remove the ID. It is acceptable to split the expression into several pieces if it gets too complex.
I've created a small test snippet to show exactly what needs and doesn't need to be caught by the regular expression, as well as display my progress so far. I am using standard java.util.regex package for implementation.
String testString = "Remove this (ACTDIK002), ACTDIK002, (L1:3.CI), 9-12.CT.d.12, and 1A-CS-01 "
+ "but not (DLCS), 781-338-3000, (DTC), (200), K-12, K or 12. "
+ "Also not (), A.I., AI, A or a. . ...";
System.out.println(testString);
String regex = "[\\s]{0,1}[[A-Z]+[\\d]+[-:\\(\\)\\.]*]{4,}[a-z]{0,1}[\\d\\.]*";
//"[\\s]{0,1}[[A-Z]+[\\d]+[-:\\(\\)\\.]*]{4,}[[a-z]{0,1}[\\d\\.]+]*" //for comma removal
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(testString);
testString = matcher.replaceAll("*");
System.out.println(testString);
It may be necessary to remove IDs together with their commas, so it would be great if the revised expression was capable of capturing commas or omitting them via minor alterations like the alternative regex I've provided.
My current solution filters out everything that needs to be filtered but also most of the things it shouldn't. It appears the rule that there must be at least one capital letter and one digit in the word isn't working, possibly because I need to use Lookahead/Lookbehind/Grouping, sadly none of which I managed to get to work properly. I also suspect the use of [] is completely incorrect in my example, but this is the only way I managed to get it to (mostly) work for now. Please help me.

My colleague and I were able to solve this issue in an elegant way. Below is a snippet from my current solution. I hope one day this proves useful to someone.
String testString = "Remove this (ACTDIK002), ACTDIK002, (L1:3.CI), 9-12.CT.d.12, and 1A-CS-01 "
+ "but not (DLCS), 781-338-3000, (DTC), (200), K-12, K or 12. "
+ "Also not (), A.I., AI, A or a. . ...";
System.out.println(testString);
String regex = "(?i)(?=[\\dA-Z\\(\\)\\.:-]*\\d)(?=[\\dA-Z\\(\\)\\.:-]*[A-Z])[\\dA-Z\\(\\)\\.:-]{5,}";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(testString);
testString = matcher.replaceAll("");
System.out.println(testString);
//Clean-up extra spaces and unneeded commas
//testString = testString.replaceAll("\\s{2,}", " ").replaceAll("(\\s\\.)|(\\s\\,)", "");
testString = testString.replaceAll("[ ]{2,}", " ").replaceAll("([ ]\\.)|([ ]\\,)", "");
System.out.println(testString);

Java regex throws java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence for the letter g

I need to see if a whole word exists in a string. This is how I try to do it:
if(text.matches(".*\\" + word + "\\b.*"))
// do something
It's running for most words, but words that start with a g cause an error:
Exception in thread "main" java.util.regex.PatternSyntaxException:
Illegal/unsupported escape sequence near index 3
.*\great life\b.*
^
How can I fix this?

The actual reason for the error is that you cannot escape an alphabetical character in a Java regex pattern that does not form a valid escape construct.
See Java regex documentation:
It is an error to use a backslash prior to any alphabetic character that does not denote an escaped construct; these are reserved for future extensions to the regular-expression language. A backslash may be used prior to a non-alphabetic character regardless of whether that character is part of an unescaped construct.
I'd use
Matcher m = Pattern.compile("\\b" + word + "\\b").matcher(text);
if (m.find()) {
// A match is found
}
If a word may contain/start/end with special chars, I'd use
Matcher m = Pattern.compile("(?!\\B\\w)" + Pattern.quote(word) + "(?<!\\w\\B)").matcher(text);
if (m.find()) {
// A match is found
}

The \\ thing proceeded by whatever character will be a interpreted as a metacharacter. E.g. ".*\\geza\\b.*" will try to find the \g escape sequence, ".*\\jani\\b.*" will try to find \j, etc.
Some of these sequences exist, others don't, you can check the Pattern docs for details. What's really troubling is that probably this isn't what you want.
I agree with Thomas Ayoub that probably you need to match \\b...\\b to find a word. I would go one step further and I'd use Pattern.quote to avoid unintended regex features that might come from word:
String text = "Lorem Ipsum a[asd]a sad";
String word = "a[asd]a";
if (text.matches(".*\\b" + Pattern.quote(word) + "\\b.*")) {
// do something
}

Using ".*\\" + word + "\\b.*" with word = great life will generate the string ".*\\great life\\b.*" which, as a value is .*\great life\b.*. The issue is that \g does not belong to the list of the escape sequences in JAVA (see What are all the escape characters in Java?)
You can use
if(text.matches(".*\\b" + word + "\\b.*"))
^

Regex replace word while preserving spaces/punctuation

I am trying to go through a document and change all instances of a name using regular expressions in Java. My code looks something like this:
Pattern replaceWordPattern = Pattern.compile("(^|\\s)" + replaceWord + "^|\\W");
followed by:
String line = matcher.replaceAll("Alice");
The problem is that this does not preserve the spaces or punctuation or other non-word characters that followed. If I had "Jack jumped" it becomes "Alicejumped". Does anyone know a way to fix this?

\W consumes the space after the replaceWord. Replace ^|\\W with word boundary \\b which does not consume symbols. Consider doing same for the first delimiter group, as I suspect you do not want to consume anything there too.
Pattern replaceWordPattern = Pattern.compile("\\b" + replaceWord + "\\b");
If semantic of word boundaries is not suitable for you, consider using lookahead and lookbehind constructs which do not consume input too.

You're missing brackets on the second non-whitespace character expression:
Pattern replaceWordPattern = Pattern.compile("[^|\\s]" + replaceWord + "[^|\\W]");

Regex Lookahead and Lookbehinds: followed by this or that

I'm trying to write a regular expression that checks ahead to make sure there is either a white space character OR an opening parentheses after the words I'm searching for.
Also, I want it to look back and make sure it is preceded by either a non-Word (\W) or nothing at all (i.e. it is the beginning of the statement).
So far I have,
"(\\W?)(" + words.toString() + ")(\\s | \\()"
However, this also matches the stuff at either ends - I want this pattern to match ONLY the word itself - not the stuff around it.
I'm using Java flavor Regex.

As you tagged your question yourself, you need lookarounds:
String regex = "(?<=\\W|^)(" + Pattern.quote(words.toString()) + ")(?= |[(])"
(?<=X) means "preceded by X"
(?<!=X) means "not preceded by X"
(?=X) means "followed by X"
(?!=X) means "not followed by X"

What about the word itself: will it always start with a word character (i.e., one that matches \w)? If so, you can use a word boundary for the leading condition.
"\\b" + theWord + "(?=[\\s(])"
Otherwise, you can use a negative lookbehind:
"(?<!\\w)" + theWord + "(?=[\\s(])"
I'm assuming the word is either quoted like so:
String theWord = Pattern.quote(words.toString());
...or doesn't need to be.

If you don't want a group to be captured by the matching, you can use the special construct (?:X)
So, in your case:
"(?:\\W?)(" + words.toString() + ")(?:\\s | \\()"
You will only have two groups then, group(0) for the whole string and group(1) for the word you are looking for.

Using regex to remove quote

I saw a good sample, but I cannot adapt it for my problem.
I would like to remove only enclosing field " from a CSV line like :
" kkl ";"aa bb D";;12 "AA";;"SSS"-;" gg 12";" vv";"sdqs ";
expected result :
kkl ;aa bb D;;12 "AA";;"SSS"-; gg 12; vv;sdqs ;
I use Pattern and Matcher tools

This solution assumes that there is no escaped quote \" in the quoted string
.replaceAll("(?<=^|;)\"([^\"]*?)\"(?=;|$)", "$1")
I assume that you also want to strip off the " in these case: "sdfkjhksdf", ;;;"dffff"
Another solution uses possessive quantifier, whose effect relies on the assumption that " doesn't appear inside the quoted portion.
.replaceAll("(?<=^|;)(?:\"(.*?)\"){1}+(?=;|$)", "$1")

Small modification to #nhahtdh's regex in order to keep it from greedily matching outside of a CSV boundary:
.replaceAll("(?<=^|;)\"([^;]*)\"(?=;|$)", "$1");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

RegEx pattern with unusual unicode character and word boundaries - java

The ® character is not considered a word character, therefore your Pattern will not match. A quick and dirty solution would be to alternate it with the word boundary, if you only have this case: Pattern.compile("\\b" + object.getName() + "\\b|®"

Related

Java Regex complex ID expression filtering

Java regex throws java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence for the letter g

Regex replace word while preserving spaces/punctuation

Regex Lookahead and Lookbehinds: followed by this or that

Using regex to remove quote

Categories

Resources