Regex and lookahead : java - java

I'm trying to remove punctuation except dots (to keep the sentence structure) from a String with regex
Actually, i have no clue how it's working, i just code this :
public static String removePunctuation(String s){
s = s.replaceAll("(?!.)\\p{Punct}" , " ");
return s;
}
I found that we could use "negative lookahead" for this kind of problem, but when i run this code, it doesn't erase anything. The negative lookahead cancelled the \p{Punct} regex.

The unescaped dot matches anything (except newlines). You need at least
s = s.replaceAll("(?!\\.)\\p{Punct}" , " ");
but for that sort of thing I'd much rather use a character class (within which the dot is no longer a metacharacter and therefore doesn't need to be escaped):
s = s.replaceAll("[^\\P{Punct}.]" , " ");
Explanation:
[^abc] matches any character that's not an a, b, or c.
[^\P{Punct}] matches any character that's "not a not a" punctuation character, effectively matching identically to \p{Punct}.
[^\P{Punct}.] therefore matches any character that's a punctuation character except a dot.

The . character has special meaning in regular expressions. It essentially means 'any character except new lines' (unless the DOTALL flag is specified, in which case it means 'any character'), so your pattern will match 'any punctuation character that is a new line character—in other words, it never match anything.
If you want it to mean a literal . character, you need to escape it like this:
s = s.replaceAll("(?!\\.)\\p{Punct}" , " ");
Or wrap it in a character class, like this:
s = s.replaceAll("(?![.])\\p{Punct}" , " ");

Related

How to escape a character in Regex expression in Java

I have a regex expression which removes all non alphanumeric characters. It is working fine for all special characters apart from ^. Below is the regex expression I am using.
String strRefernce = strReference.replaceAll("[^\\p{IsAlphabetic}^\\p{IsDigit}]", "").toUpperCase();
I tried modifying it to
String strRefernce = strReference.replaceAll("[^\\p{IsAlphabetic}^\\p{IsDigit}]\\^", "").toUpperCase();
and
String strRefernce = strReference.replaceAll("[^\\p{IsAlphabetic}^\\p{IsDigit}\\^]", "").toUpperCase();
But these are also not able to remove this symbol.
Can someone please help me with this.
The first ^ inside [^...] is a negation mark making the character class a negated one (matching characters other than what is inside).
The second one inside is considered a literal - thus, it should not be matched with the regex. Remove it, and a caret will get matched with it:
"[^\\p{IsAlphabetic}\\p{IsDigit}]"
or even shorter:
"(?U)\\P{Alnum}"
The \P{Alnum} class stands for any character other than an alphanumeric character: [\p{Alpha}\p{Digit}] (see Java regex reference). When you pass (?U), the \P{Alnum} class will not match Unicode letters. See this IDEONE demo.
Add a + at the end if you want to remove whole chunks of symbols other than \\p{IsAlphabetic} and \\p{IsDigit}.
This works as well.
System.out.println("Text 尖酸[刻薄 ^, More _0As text °ÑÑ"".replaceAll("(?U)[^[\\W_]]+", " "));
Output
Text 尖酸 刻薄 More 0As text Ñ Ñ
Not sure but the word might be the more comprehensive list of alphanum characters.
[\\W_] is a class containing non-words and an underscore.
When put into a negative Java class construct it becomes
[^[\\W_]] is a negative class of a union between nothing and
a class containing non-words and an underscore.

Regex Lookahead and Lookbehinds: followed by this or that

I'm trying to write a regular expression that checks ahead to make sure there is either a white space character OR an opening parentheses after the words I'm searching for.
Also, I want it to look back and make sure it is preceded by either a non-Word (\W) or nothing at all (i.e. it is the beginning of the statement).
So far I have,
"(\\W?)(" + words.toString() + ")(\\s | \\()"
However, this also matches the stuff at either ends - I want this pattern to match ONLY the word itself - not the stuff around it.
I'm using Java flavor Regex.
As you tagged your question yourself, you need lookarounds:
String regex = "(?<=\\W|^)(" + Pattern.quote(words.toString()) + ")(?= |[(])"
(?<=X) means "preceded by X"
(?<!=X) means "not preceded by X"
(?=X) means "followed by X"
(?!=X) means "not followed by X"
What about the word itself: will it always start with a word character (i.e., one that matches \w)? If so, you can use a word boundary for the leading condition.
"\\b" + theWord + "(?=[\\s(])"
Otherwise, you can use a negative lookbehind:
"(?<!\\w)" + theWord + "(?=[\\s(])"
I'm assuming the word is either quoted like so:
String theWord = Pattern.quote(words.toString());
...or doesn't need to be.
If you don't want a group to be captured by the matching, you can use the special construct (?:X)
So, in your case:
"(?:\\W?)(" + words.toString() + ")(?:\\s | \\()"
You will only have two groups then, group(0) for the whole string and group(1) for the word you are looking for.

A regex that doesn't match with this character sequence

Here is my Regex, I am trying to search all special characters so that I can escape them.
(\(|\)|\[|\]|\{|\}|\?|\+|\\|\.|\$|\^|\*|\||\!|\&|\-|\#|\#|\%|\_|\"|\:|\<|\>|\/|\;|\'|\`|\~)
My problem here is, I don't want to escape some sepcial characters only when the come in a sequence
like this (.*)
So, Lets consider an example.
Sting message = "Hi, Mr.Xyz! Your account number is :- (1234567890) , (,*) &$#%#*(....))(((";
After escaping according to current regex what i get is,
Hi, Mr\.Xyz\! Your account number is \:\- \(1234567890\) , \(,\*\) \&\$\#\%\#\*\(\.\.\.\.\)\)\(\(\(
But is don't want to escape this part (.*) want to keep it as it is.
My above regex is only used for searching, So i just don't want to match with this part (.*) and my problem will be solved
Can anyone suggest regex that doesn't escape that part of the string?
See #nhahtdh for how to do this with a regex.
As an alternative, Here is a solution which does not use a regex, using Guava's CharMatcher instead:
private static final CharMatcher SPECIAL
= CharMatcher.anyOf("allspecialcharshere");
private static final String NO_ESCAPE = "(.*)";
public String doEncode(String input)
{
StringBuilder sb = new StringBuilder(input.length());
String tmp = input;
while (!tmp.isEmpty()) {
if (tmp.startsWith(NO_ESCAPE)) {
sb.append(NO_ESCAPE);
tmp = tmp.substring(NO_ESCAPE.length());
continue;
}
char c = tmp.charAt(0);
if (SPECIAL.matches(c))
sb.append('\\');
sb.append(c);
tmp = tmp.substring(1);
}
return sb.toString();
}
This answer is to demonstrate the possibility only. Using it in production code is questionable.
It is possible with Java String replaceAll function:
String input = "Hi, Mr.Xyz! Your account number is :- (1234567890) , (.*) &$#%#*(....))(((";
String output = input.replaceAll("\\G((?:[^()\\[\\]{}?+\\\\.$^*|!&##%_\":<>/;'`~-]|\\Q(.*)\\E)*+)([()\\[\\]{}?+\\\\.$^*|!&##%_\":<>/;'`~-])", "$1\\\\$2");
Result:
"Hi, Mr\.Xyz\! Your account number is \:\- \(1234567890\) , (.*) \&\$\#\%\#\*\(\.\.\.\.\)\)\(\(\("
Another test:
String input = "(.*) sdfHi test message <> >>>>><<<<f<f<,,,,<> <>(.*) sdf (.*) sdf (.*)";
Result:
"(.*) sdfHi test message \<\> \>\>\>\>\>\<\<\<\<f\<f\<,,,,\<\> \<\>(.*) sdf (.*) sdf (.*)"
Explanation
Raw regex:
\G((?:[^()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-]|\Q(.*)\E)*+)([()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-])
Note that \ is escaped once more when the regex is specified inside the string, and " needs to be escaped. The resulting regex in string can be seen above.
Raw replacement string:
$1\\$2
Since $ has special meaning in replacement string, and you want to keep it for $2, you need to escape the \ so that \ won't escape the $. And putting the replacement string in quoted string, you need to double up the number of \ to escape the \.
Before we dissect the monster, let's talk about the idea. We will consume non-special characters, and the sequence that we don't want to replace, and as many times as possible. The next character will either be a special character not forming sequence we don't want to replace, or is the end of the string (which means that we have found all character that needs replacing if any).
Naturally, we can think of any arbitrary string as consisting of many of the following pattern consecutively: [0 or more (non-special character or special pattern not to be replace)][special character], and the string ends with [0 or more (non-special character or special pattern not to be replace)].
replaceAll function when used with a regex without \G may find matches that are not consecutive, which can cut in the middle of the sequence not to be replaced and mess it up. \G means the boundary of last match, and can be used to make sure the next match starts from where the last match left off.
\G: Starts from last match
((?:[^()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-]|\Q(.\*)\E)*+): Capture 0 or more of, the non-special character or the special pattern not to be replaced. Note that I have added the possessive qualifier + after *. This will prevent the engine from backtracking when it cannot find the special character that we specify after this.
[^()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-]: Negated character class of special characters.
\Q(.*)\E: Special sequence (.*) not to be replaced, literal quoted by \Q and \E.
([()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-]): Capture the single special character.
The whole regex will match string with minimum length of 1 (the special character). The first capturing group contains the parts that shouldn't be replaced, and the 2nd capturing group contains the special character that should be replaced.

Match word in String in Java

I'm trying to match Strings that contain the word "#SP" (sans quotes, case insensitive) in Java. However, I'm finding using Regexes very difficult!
Strings I need to match:
"This is a sample #sp string",
"#SP string text...",
"String text #Sp"
Strings I do not want to match:
"Anything with #Spider",
"#Spin #Spoon #SPORK"
Here's what I have so far: http://ideone.com/B7hHkR .Could someone guide me through building my regexp?
I've also tried: "\\w*\\s*#sp\\w*\\s*" to no avail.
Edit: Here's the code from IDEone:
java.util.regex.Pattern p =
java.util.regex.Pattern.compile("\\b#SP\\b",
java.util.regex.Pattern.CASE_INSENSITIVE);
java.util.regex.Matcher m = p.matcher("s #SP s");
if (m.find()) {
System.out.println("Match!");
}
(edit: positive lookbehind not needed, only matching is done, not replacement)
You are yet another victim of Java's misnamed regex matching methods.
.matches() quite unfortunately so tries to match the whole input, which is a clear violation of the definition of "regex matching" (a regex can match anywhere in the input). The method you need to use is .find().
This is a braindead API, and unfortunately Java is not the only language having such misguided method names. Python also pleads guilty.
Also, you have the problem that \\b will detect on word boundaries and # is not part of a word. You need to use an alternation detecting either the beginning of input or a space.
Your code would need to look like this (non fully qualified classes):
Pattern p = Pattern.compile("(^|\\s)#SP\\b", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("s #SP s");
if (m.find()) {
System.out.println("Match!");
}
You're doing fine, but the \b in front of the # is misleading. \b is a word boundary, but # is already not a word character (i.e. it isn't in the set [0-9A-Za-z_]). Therefore, the space before the # isn't considered a word boundary. Change to:
java.util.regex.Pattern p =
java.util.regex.Pattern.compile("(^|\\s)#SP\\b",
java.util.regex.Pattern.CASE_INSENSITIVE);
The (^|\s) means: match either ^ OR \s, where ^ means the beginning of your string (e.g. "#SP String"), and \s means a whitespace character.
The regular expression "\\w*\\s*#sp\\w*\s*" will match 0 or more words, followed by 0 or more spaces, followed by #sp, followed by 0 or more words, followed by 0 or more spaces. My suggestion is to not use \s* to break words up in your expression, instead, use \b.
"(^|\b)#sp(\b|$)"

How can I remove all leading and trailing punctuation?

I want to remove all the leading and trailing punctuation in a string. How can I do this?
Basically, I want to preserve punctuation in between words, and I need to remove all leading and trailing punctuation.
., #, _, &, /, - are allowed if surrounded by letters
or digits
\' is allowed if preceded by a letter or digit
I tried
Pattern p = Pattern.compile("(^\\p{Punct})|(\\p{Punct}$)");
Matcher m = p.matcher(term);
boolean a = m.find();
if(a)
term=term.replaceAll("(^\\p{Punct})", "");
but it didn't work!!
Ok. So basically you want to find some pattern in your string and act if the pattern in matched.
Doing this the naiive way would be tedious. The naiive solution could involve something like
while(myString.StartsWith("." || "," || ";" || ...)
myString = myString.Substring(1);
If you wanted to do a bit more complex task, it could be even impossible to do the way i mentioned.
Thats why we use regular expressions. Its a "language" with which you can define a pattern. the computer will be able to say, if a string matches that pattern. To learn about regular expressions, just type it into google. One of the first links: http://www.codeproject.com/Articles/9099/The-30-Minute-Regex-Tutorial
As for your problem, you could try this:
myString.replaceFirst("^[^a-zA-Z]+", "")
The meaning of the regex:
the first ^ means that in this pattern, what comes next has to be at
the start of the string.
The [] define the chars. In this case, those are things that are NOT
(the second ^) letters (a-zA-Z).
The + sign means that the thing before it can be repeated and still
match the regex.
You can use a similar regex to remove trailing chars.
myString.replaceAll("[^a-zA-Z]+$", "");
the $ means "at the end of the string"
You could use a regular expression:
private static final Pattern PATTERN =
Pattern.compile("^\\p{Punct}*(.*?)\\p{Punct}*$");
public static String trimPunctuation(String s) {
Matcher m = PATTERN.matcher(s);
m.find();
return m.group(1);
}
The boundary matchers ^ and $ ensure the whole input is matched.
A dot . matches any single character.
A star * means "match the preceding thing zero or more times".
The parentheses () define a capturing group whose value is retrieved by calling Matcher.group(1).
The ? in (.*?) means you want the match to be non-greedy, otherwise the trailing punctuation would be included in the group.
Use this tutorial on patterns. You have to create a regex that matches string starting with alphabet or number and ending with alphabet or number and do inputString.matches("regex")

Categories