Remove escaped unicode string in java with regex - java

I have string like below
"them coming \nLove it \ud83d\ude00"
I want to remove this character "\ud83d\ude00". so it will be
"them coming \nLove it "
How can I achieve this in java? I have tried with code like below but it won't works
payload.toString().replaceAll("\\\\u\\b{4}.", "")
Thanks :)

I think \\\\u\\b{4}. will not work, because regex treat \ud83d as a symbol �, not a literal string. So to match this kind unwanted (for any reason) unicode characters it will be better to exclude character you accept(don't want to replace), so for ecample all ASCII character, and match everything else (what you want to replace). Try with:
[^\x00-\x7F]+
The \x00-\x7F includes Unicode Basic Latin block.
String str = "them coming \nLove it \ud83d\ude00";
System.out.println(str.replaceAll("[^\\x00-\\x7F]+", ""));
will result with:
them coming
Love it
However, you willl hava a problem, if you use national character, any other non-ASCII symbols (ś,ą,♉,☹,etc.).

Related

How to eliminate the special character ^\^# from String in Java

My application is reading a file which contains following data:
MENS HEALTH^\^# P
while actual text should be
MENS HEALTH P
I have already replaced the '\u0000' but still "^\" is still remaining in the string. I am not sure what is code for this characters, so I can replace it.
When I open the file in intelliJ editor it's displayed as FS symbol.
Please suggest how I can eliminate this.
Thanks,
Rather than worry about what characters the junk consists of, remove everything that isn't what you want to keep:
str = str.replaceAll("[^\\w ]+", "");
This deletes any characters that are not word characters or spaces.
You can use a regular expression with String.replaceAll() to replace these characters.
Note that backslash has a special meaning and need to be escaped (with a backslash).
"my\\^#String".replaceAll("[\\\\^#]", "");
Online Demo

Java - Regex Replace All will not replace matched text

Trying to remove a lot of unicodes from a string but having issues with regex in java.
Example text:
\u2605 StatTrak\u2122 Shadow Daggers
Example Desired Result:
StatTrak Shadow Daggers
The current regex code I have that will not work:
list.replaceAll("\\\\u[0-9]+","");
The code will execute but the text will not be replaced. From looking at other solutions people seem to use only two "\\" but anything less than 4 throws me the typical error:
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal Unicode escape sequence near index 2
\u[0-9]+
I've tried the current regex solution in online test environments like RegexPlanet and FreeFormatter and both give the correct result.
Any help would be appreciated.
Assuming that you would like to replace a "special string" to empty String. As I see, \u2605 and \u2122 are POSIX character class. That's why we can try to replace these printable characters to "". Then, the result is the same as your expectation.
Sample would be:
list = list.replaceAll("\\P{Print}", "");
Hope this help.
In Java, something like your \u2605 is not a literal sequence of six characters, it represents a single unicode character — therefore your pattern "\\\\u[0-9]{4}" will not match it.
Your pattern describes a literal character \ followed by the character u followed by exactly four numeric characters 0 through 9 but what is in your string is the single character from the unicode code point 2605, the "Black Star" character.
This is just as other escape sequences: in the string "some\tmore" there is no character \ and there is no character t ... there is only the single character 0x09, a tab character — because it is an escape sequence known to Java (and other languages) it gets replaced by the character that it represents and the literal \ t are no longer characters in the string.
Kenny Tai Huynh's answer, replacing non-printables, may be the easiest way to go, depending on what sorts of things you want removed, or you could list the characters you want (if that is a very limited set) and remove the complement of those, such as mystring.replaceAll("[^A-Za-z0-9]", "");
I'm an idiot. I was calling the replaceAll on the string but not assigning it as I thought it altered the string anyway.
What I had previously:
list.replaceAll("\\\\u[0-9]+","");
What I needed:
list = list.replaceAll("\\\\u[0-9]+","");
Result works fine now, thanks for the help.

Java Replace Unicode Characters in a String

I have a string which contains multiple unicode characters. I want to identify all these unicode characters, ex: \ uF06C, and replace it with a back slash and four hexa digits without "u" in it.
Example:
Source String: "add \uF06Cd1 Clause"
Result String: "add \F06Cd1 Clause"
How can achieve this in Java?
Edit:
Question in link Java Regex - How to replace a pattern or how to is different from this as my question deals with unicode character. Though it has multiple literals, it is considered as one single character by jvm and hence regex won't work.
The correct way to do this is using a regex to match the entire unicode definition and use group-replacement.
The regex to match the unicode-string:
A unicode-character looks like \uABCD, so \u, followed by a 4-character hexnumber string. Matching these can be done using
\\u[A-Fa-f\d]{4}
But there's a problem with this:
In a String like "just some \\uabcd arbitrary text" the \u would still get matched. So we need to make sure the \u is preceeded by an even number of \s:
(?<!\\)(\\\\)*\\u[A-Fa-f\d]{4}
Now as an output, we want a backslash followed by the hexnum-part. This can be done by group-replacement, so let's get start by grouping characters:
(?<!\\)(\\\\)*(\\u)([A-Fa-f\d]{4})
As a replacement we want all backlashes from the group that matches two backslashes, followed by a backslash and the hexnum-part of the unicode-literal:
$1\\$3
Now for the actual code:
String pattern = "(?<!\\\\)(\\\\\\\\)*(\\\\u)([A-Fa-f\\d]{4})";
String replace = "$1\\\\$3";
Matcher match = Pattern.compile(pattern).matcher(test);
String result = match.replaceAll(replace);
That's a lot of backslashes! Well, there's an issue with java, regex and backslash: backslashes need to be escaped in java and regex. So "\\\\" as a pattern-string in java matches one \ as regex-matched character.
EDIT:
On actual strings, the characters need to be filtered out and be replaced by their integer-representation:
StringBuilder sb = new StringBuilder();
for(char c : in.toCharArray())
if(c > 127)
sb.append("\\").append(String.format("%04x", (int) c));
else
sb.append(c);
This assumes by "unicode-character" you mean non-ASCII-characters. This code will print any ASCII-character as is and output all other characters as backslash followed by their unicode-code. The definition "unicode-character" is rather vague though, as char in java always represents unicode-characters. This approach preserves any control-chars like "\n", "\r", etc., which is why I chose it over other definitions.
Try using String.replaceAll() method
s = s.replaceAll("\u", "\");

Remove non printable utf8 characters except controlchars from String

I've got a String containing text, control characters, digits, umlauts (german) and other utf8 characters.
I want to strip all utf8 characters which are not "part of the language". Special characters like (non complete list) ":/\ßä,;\n \t" should all be preserved.
Sadly stackoverflow removes all those characters so I have to append a picture (link).
Any ideas? Help is very appreciated!
PS: If anybody does know a pasting service which does not kill those special characters I would happily upload the strings.. I just wasn't able to find one..
[Edit]: I THINK the regex "\P{Cc}" are all characters I want to PRESERVE. Could this regex be inverted so all characters not matching this regex be returned?
You have already found Unicode character properties.
You can invert the character property, by changing the case of the leading "p"
e.g.
\p{L} matches all letters
\P{L} matches all characters that does not have the property letter.
So if you think \P{Cc} is what you need, then \p{Cc} would match the opposite.
More details on regular-expressions.info
I am quite sure \p{Cc} is close to what you want, but be careful, it does include, e.g. the tab (0x09), the Linefeed (0x0A) and the Carriage return (0x0D).
But you can create you own character class, like this:
[^\P{Cc}\t\r\n]
This class [^...] is a negated character class, so this would match everything that is not "Not control character" (double negation, so it matches control chars), and not tab, CR and LF.
You can use,
your_string.replaceAll("\\p{C}", "");

Regex is eating too much stuff

So I recently opened a question and ended up solving it by using a regex. The regex I used essentially ate ALL my non-english characters.
Let me retry this:
I want to eat all non-keyboard characters that may exist in a string
the regex that I'm using is:
[^\\p{L}\\p{N}]
However this turns stuff like
10/10/2012 10:51:25 AM
into
10102012105125AM
Is there some way to easily exclude all alt-code characters from a string with replaceALL and leave keyboard characters like % / \ : and others intact?
Thanks!
You probably want to save only the ASCII characters. The character range [ -~] will achieve that. If you also want whitespace chars, you can add them in: [ -~\s].
System.out.println(input.replaceAll("[^ -~\\s]+", ""));
To remove all non-ASCII characters:
String mystring = <your_input_string>;
mystring.replaceAll("[^ -~\\s]+", "");
What about \p{Print}? It matches all printable characters, that sounds like exactly what you need.

Categories