Java regular expression to match _all_ whitespace characters - java

I'm looking for a regular expression in Java which matches all whitespace characters in a String. "\s" matches only some, it does not match and similar non-ascii whitespaces. I'm looking for a regular expression which matches all (common) white-space characters which can occur in a Java String.
[Edit]
To clarify: I do not mean the string sequence " " I mean the sincle unicode character U+00A0 that is often represented by " ", e.g. in HTML, and all other unicode characters with a similar white-space meainig, e.g. "NARROW NO-BREAK SPACE" (U+202F), Word joiner encoded in Unicode 3.2 and above as U+2060, "ZERO WIDTH NO-BREAK SPACE" (U+FEFF) and any other character that can be regareded as white-space.
[Answer]
For my pupose, ie catching all whitespace characters, unicode + traditional, the following expression does the job:
[\p{Z}\s]
The answer is in the comments below but since it is a bit hidden I repeat it here.

is not a whitespace character, as far as regexpes are concerned. You need to either modify the regexp to include those strings in addition to \s, like /(\s| |%20)/, or previously parse the string contents to get the ASCII or Unicode representation of the data.
You are mixing abstraction levels here.
If, what after a careful reread of the question seems to be the case, you are after a way to match all whitespace characters referring to standard ASCII plus the whitespace codepoints, \p{Z} or \p{Zs} will do the work.
You should really clarify your question because it has misled a lot of people (even making the correct answer to have some downvotes).

You clarified the question the way as I expected: you're actually not looking for the String literal as many here seem to think and for which the solution is too obvious.
Well, unfortunately, there's no way to match them using regex. Best is to include the particular codepoints in the pattern, for example: "[\\s\\xA0]".
Edit as turned out in one of the comments, you could use the undocumented "\\p{Z}" for this. Alan, can you please leave comment how you found that out? This one is quite useful.

The is only whitespace in HTML. Use an HTML parser to extract the plain text. and \s should work just fine.

In case anyone runs into this question again looking for help, I suggest pursuing the following answer: https://stackoverflow.com/a/6255512/1678392
The short version: \\p{javaSpaceChar}
Why: Per the Pattern class, this maps the Character.isSpaceChar method:
Categories that behave like the java.lang.Character boolean ismethodname methods (except for the deprecated ones) are available through the same \p{prop} syntax where the specified property has the name javamethodname.
👍

Click here for a summary I made of several competing definitions of "whitespace".
You might end up having to explicitly list the additional ones you care about that aren't matched by one of the prefab ones.

is not white space. It is a character encoding sequence that represents whitespace in HTML. You most likely want to convert HTML encoded text into plain text before running your string match against it. If that is the case, go look up
javax.swing.text.html

The regex characters are the only ones independent of encoding. Here is a list of some characters which - in Unicode - are non-printing:
How many non-printing characters are in common use?

Related

java 8 regular expression for meta characters [duplicate]

This question already has answers here:
What special characters must be escaped in regular expressions?
(13 answers)
Closed 3 years ago.
Trying to write a regular expression to check if the sentence as metacharacters "I need to make payment of $50 for the purchase, should i use CASH|CC". In this sentence i need to identify if metacharacters are present.
\\\\$ or ^(\\\\$)\\$. What is the right syntax for Pattern.matches("^([\\\\$]$)", text); to identify the special characters. I don't need to replace just identify if the sentence contains these characters.
If you want to know whether a string contains meta characters, you can use some like this:
boolean hasIt = sentence.chars().anyMatch(c -> "\\.[]{}()*+?^$|".indexOf(c) >= 0);
By not using the Regex engine, you don’t need to quote the characters which have a special meaning to it.
Using Pattern.matches creates three unnecessary obstacles to the task. First, you have to quote all characters correctly, then, you need a regex construct to turn the characters into alternatives, e.g. [abc] or a|b|c, third, matches checks whether the entire string matches the pattern, rather than contains an occurrences, so you’d need something like .*pattern.* to make matches to behave like find, if you insist on it.
Which leads to the xy-problem of this task. It’s not clear which metacharacters you actually want to check and why you need this information in the first place.
If you want to search for this sentence within another text, just use Pattern.compile(sentence, Pattern.LITERAL) to disable interpretation of meta characters. Or Pattern.quote(sentence) when you want to assemble a pattern containing the sentence.
But if you don’t want to search for it, this information has no relevance. Note that “Is this a meta character?” may lead to a different answer than “Does it need quoting?”. Even this tutorial combines these questions in a misleading way. At two close places it names the metacharacters and describes the quoting syntax, leading to the wrong impression that all of these characters need quoting.
For example, - only has a special meaning within a character class, so if there is no character class, which you detect by the presence of [, the - does not imply the presence of metacharacters. But while - truly needs quoting within the character class, the characters = and ! are metacharacters only in a certain context, which requires a metacharacter, so they never require quoting.
But if you are trying to check for a metacharacter to decide whether to use the Regex engine or to perform a plain text search, e.g. via String.indexOf, you are performing premature optimization. This is not only a waste of development effort, optimizing before you even have an actual code you could measure often leads to the opposite result. Performing a pattern matching using the Regex engine with a string containing no metacharacters can lead to a more efficient search than a plain indexOf on the String. In the reference implementation, the Regex engine uses the Boyer Moore algorithm while the plaintext search methods on String use a naive search.
Edit: As mentioned by commenters Andreas and Holger, the meta characters used by regular expressions are sometimes depending on a syntactical subdefinition, like character classes, specific sequences (lookahead, lookbehind,...) and are therefore not intrinsically metacaracters per se. Some are only meta characters in a specific context. However the answer provided here will include all possible meta characters, with the exception of the operators that only become meta characters when prefixed by \. However, this means, that sometimes characters will be matched, in locations where they are not actually meta characters.
This question has half the answer: List of all special characters that need to be escaped in a regex
You can look at the javadoc of the Pattern class: http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
The Java regular expression system exposes no character class for it's own special characters (regrettably).
Special constructs (named-capturing and non-capturing)
(?X) X, as a named-capturing group
(?:X) X, as a non-capturing group
(?idmsuxU-idmsuxU) Nothing, but turns match flags i d m s u x U on - off
(?idmsux-idmsux:X) X, as a non-capturing group with the given flags i d m s u x on - off
(?=X) X, via zero-width positive lookahead
(?!X) X, via zero-width negative lookahead
This block alone contains a lot (though not all) of the meta characters. The last two rows of the citation I had ot leave out, because the character sequences confused the parser of this page.
I would suggest the following:
public static final Pattern META_CHARS = Pattern.compile("[\\\\\\]\\[(){}\\-!$?*+<>\\:\\.\\=\\,\\|^]");
But be aware, that this list might very well be incomplete, and that this contains typical characters such as , and . which are part of the regex syntax. So you probably got a lot of escaping to do...
From there you can:
Matcher metaDetector = META_CHARS.matcher(stringToTest);
if (metaDetector.find()) {
// this is the found meta character...
String metaCharacter = metaDetector.group(0);
System.out.print(metaCharacter);
}
And if you want to find all meta characters, then make a while out of if in the above code snippet. If you do, for the line "I need to make \\payment{[ of $50 for !!the purc\"hase, sh###ould i use CASH|CC." you receive \{[$!!,|., which is correct, as # and " are not meta characters in regex.
As Andreas correctly mentions, the exact pattern can be reduced to "[\\\\\\]\\[(){}^$?*+.|]", because this will tell you, whether or not at least one meta character is present. However this might miss some meta characters, if multiple are present. If this is not important, then the shorter chain is sufficient.

Pattern matching with special character "-" hyphen isn't working as expected .? [duplicate]

How to rewrite the [a-zA-Z0-9!$* \t\r\n] pattern to match hyphen along with the existing characters ?
The hyphen is usually a normal character in regular expressions. Only if it’s in a character class and between two other characters does it take a special meaning.
Thus:
[-] matches a hyphen.
[abc-] matches a, b, c or a hyphen.
[-abc] matches a, b, c or a hyphen.
[ab-d] matches a, b, c or d (only here the hyphen denotes a character range).
Escape the hyphen.
[a-zA-Z0-9!$* \t\r\n\-]
UPDATE:
Never mind this answer - you can add the hyphen to the group but you don't have to escape it. See Konrad Rudolph's answer instead which does a much better job of answering and explains why.
It’s less confusing to always use an escaped hyphen, so that it doesn't have to be positionally dependent. That’s a \- inside the bracketed character class.
But there’s something else to consider. Some of those enumerated characters should possibly be written differently. In some circumstances, they definitely should.
This comparison of regex flavors says that C♯ can use some of the simpler Unicode properties. If you’re dealing with Unicode, you should probably use the general category \p{L} for all possible letters, and maybe \p{Nd} for decimal numbers. Also, if you want to accomodate all that dash punctuation, not just HYPHEN-MINUS, you should use the \p{Pd} property. You might also want to write that sequence of whitespace characters simply as \s, assuming that’s not too general for you.
All together, that works out to apattern of [\p{L}\p{Nd}\p{Pd}!$*] to match any one character from that set.
I’d likely use that anyway, even if I didn’t plan on dealing with the full Unicode set, because it’s a good habit to get into, and because these things often grow beyond their original parameters. Now when you lift it to use in other code, it will still work correctly. If you hard‐code all the characters, it won’t.
[-a-z0-9]+,[a-z0-9-]+,[a-z-0-9]+ and also [a-z-0-9]+ all are same.The hyphen between two ranges considered as a symbol.And also [a-z0-9-+()]+ this regex allow hyphen.
use "\p{Pd}" without quotes to match any type of hyphen. The '-' character is just one type of hyphen which also happens to be a special character in Regex.
Is this what you are after?
MatchCollection matches = Regex.Matches(mystring, "-");

Java - Regex Replace All will not replace matched text

Trying to remove a lot of unicodes from a string but having issues with regex in java.
Example text:
\u2605 StatTrak\u2122 Shadow Daggers
Example Desired Result:
StatTrak Shadow Daggers
The current regex code I have that will not work:
list.replaceAll("\\\\u[0-9]+","");
The code will execute but the text will not be replaced. From looking at other solutions people seem to use only two "\\" but anything less than 4 throws me the typical error:
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal Unicode escape sequence near index 2
\u[0-9]+
I've tried the current regex solution in online test environments like RegexPlanet and FreeFormatter and both give the correct result.
Any help would be appreciated.
Assuming that you would like to replace a "special string" to empty String. As I see, \u2605 and \u2122 are POSIX character class. That's why we can try to replace these printable characters to "". Then, the result is the same as your expectation.
Sample would be:
list = list.replaceAll("\\P{Print}", "");
Hope this help.
In Java, something like your \u2605 is not a literal sequence of six characters, it represents a single unicode character — therefore your pattern "\\\\u[0-9]{4}" will not match it.
Your pattern describes a literal character \ followed by the character u followed by exactly four numeric characters 0 through 9 but what is in your string is the single character from the unicode code point 2605, the "Black Star" character.
This is just as other escape sequences: in the string "some\tmore" there is no character \ and there is no character t ... there is only the single character 0x09, a tab character — because it is an escape sequence known to Java (and other languages) it gets replaced by the character that it represents and the literal \ t are no longer characters in the string.
Kenny Tai Huynh's answer, replacing non-printables, may be the easiest way to go, depending on what sorts of things you want removed, or you could list the characters you want (if that is a very limited set) and remove the complement of those, such as mystring.replaceAll("[^A-Za-z0-9]", "");
I'm an idiot. I was calling the replaceAll on the string but not assigning it as I thought it altered the string anyway.
What I had previously:
list.replaceAll("\\\\u[0-9]+","");
What I needed:
list = list.replaceAll("\\\\u[0-9]+","");
Result works fine now, thanks for the help.

What's the difference between BREAKING_WHITESPACE and WHITESPACE in Guava CharMatcher

In guava CharMatcher there are 2 inner class BREAKING_WHITESPACE and WHITESPACE, and the definition of BREAKING_WHITESPACE is : a whitespace which can be interpreted as a break between words for formatting purposes
What does it mean?
Can anyone answer this question ?
If you can provide an example for the diff it would be nice
Thx in advance
This is fairly general character stuff, not Guava specific:
http://en.wikipedia.org/wiki/Non-breaking_space#Non-breaking_behavior
Actually the Javadoc says it all:
Determines whether a character is a breaking whitespace (that is, a whitespace which can be interpreted as a break between words for formatting purposes).
You put a non-breaking space in text between two words which must stay on the same line. Often it gets used for numbers with units like 123.456 MPa. In HTML you'd write , ever seen it?
Out of all whitespace chars, there are some non-breaking, e.g. U+0202F and U+000A0.

Compare non-English characters

I have a problem when I try compare 'Đ' and 'D' and I need to return true, but Locale English returns false, because 'Đ' cannot be replaced with regex:
"\\p{Block=CombiningDiacriticalMarks}+".
The character class [\u0110D] will match either a Đ or a D (Đ is code point U+0110).
Matching Non-English Characters, a Primer
One common example is the word 'Ăźber-geek'. How do I match that word whether or not there is an umlaut above the u? Simple: [\u0252u]ber-geek will match either 'Ăźber-geek' or 'uber-geek'.
Depending on your regular expression engine, there are multiple great ways to match locale-specific characters. Buy a book on your specific implementation to discover its wrinkles. By the way, you can find an excellent resource for Unicode-specific regex information at Regular-Expressions.info's Unicode page.
What if I want to match any character? If you have access to the \X character class, it will act as a . in a Unicode context. This means that multiple Unicode code points which combine to form one grapheme will register as one 'character' to the engine.
NOTE: I'm not trying to 'steal' an answer to this one, and I'll delete mine if Ted Hopp moves his out of the comments. I just wanted to make sure that people looking for non-English regex matches can see that this question did indeed get answered.

Categories