Remove special Unicode characters from a Java String

Remove special Unicode characters from a Java String - java

I am converting HTML data (data with bullet styling) to Java String, but we are getting junk values (�� - default Unicode value replaced) in the String, I tried to remove these values using replaceAll() but it's not working.
Any suggestions about how to remove these Unicode characters from the String?

You can remove all non-ASCII characters with:
s.replaceAll("[^\\p{ASCII}]", "")

Related

Remove escaped unicode string in java with regex

I have string like below
"them coming \nLove it \ud83d\ude00"
I want to remove this character "\ud83d\ude00". so it will be
"them coming \nLove it "
How can I achieve this in java? I have tried with code like below but it won't works
payload.toString().replaceAll("\\\\u\\b{4}.", "")
Thanks :)

I think \\\\u\\b{4}. will not work, because regex treat \ud83d as a symbol �, not a literal string. So to match this kind unwanted (for any reason) unicode characters it will be better to exclude character you accept(don't want to replace), so for ecample all ASCII character, and match everything else (what you want to replace). Try with:
[^\x00-\x7F]+
The \x00-\x7F includes Unicode Basic Latin block.
String str = "them coming \nLove it \ud83d\ude00";
System.out.println(str.replaceAll("[^\\x00-\\x7F]+", ""));
will result with:
them coming
Love it
However, you willl hava a problem, if you use national character, any other non-ASCII symbols (ś,ą,♉,☹,etc.).

How to take a certain set of characters from a string and use those as a unicode value?

[Java] So I have this hexadecimal: 0x6c6c6548. I need to take two characters out at a time, use those two characters to get a unicode value and then concatenate them all into a string.
My idea was to take the last two digits using the charAt() method, and then adding them to a string starting with "\u00", but that doesn't work because the compiler thinks of the slash as an escape and you can't add another slash in front of the first because then it just prints a slash and doesn't convert it to unicode.
So like I need to take the 48 out and somehow convert it to it's unicode value, which is 'H' and then do that for all the pairs and put them into one string.

Base64 encoding Allowed Characters

I'm using Base64 encoding for encoding user id field in Java.
String abc = new String(Base64.encodeBase64("Actualuseridfield"));
I want to know whether the string abc above will contain the character " , : or not?
When the Base64 encoded string is returned in abc, will it ever contain below characters?
" <double quote>
, <comma>
: <colon>

You will not see any commas, colons, or double quotes in a Base64 encoded string. You will see equals signs since they're used to pad the ending content.

If you have a proper encoder for Base64, you will not see special characters except:
[A-Z][a-z][0-9][+/] and the padding char '=' at the end to indicate the number of zero fill bytes
There is another Base64 character set available which replaces [+/] by [_-] making the encoding URL-safe.
Nevertheless the specification allows to include any other character. Often the Base64 encoded data contains a line feed '\n' every 76 characters. Any character except the ones mentioned above has to be removed during decoding. The padding characters indicate the number of zero bytes appended to apply to n*4 output characters.

transforming "weird" and non printable characters is kind of the whole point of base64, so no, you wont see those. more info here http://email.about.com/cs/standards/a/base64_encoding.htm

Removing all non-word characters in a Cyrillic UTF-8 encoded String

Normally, in order to remove non-word characters from a String the replaceAll method can be used:
String cleanWords = "some string with non-words such as ';'".replaceAll("\\W", "");
The above returns a cleaned string "somestringwithnonwordssuchas".
However, if the string contains Cyrillic characters they get recognised as non-word, and get removed from the string. It is expected that Cyrillic characters would remain. Hence the question.
What is a proper way to deal with the task of removing non-word characters regardless of the language, assuming that string has UTF-8 encoding?

Try [^\\p{L}]. That should match every Unicode codepoint except for letters.
The Pattern class has a pretty thorough description of the possible character classes. Note that the POSIX character classes are ASCII-only by default and won't help you a lot, you'll need to use the Unicode-specific classes.
Note that there's the UNICODE_CHARACTER_CLASS flag that changes the behavior of the POSIX classes to conform to this section of the Unicode Standard (basically making them equivalent to their closest Unicode-aware equivalents).

How to encode a string to replace all special characters

I have a string which contains special character. But I have to convert the string into a string without having any special character so I used Base64 But in Base64 we are using equals to symbol (=) which is a special character. But I want to convert the string into a string which will have only alphanumerical letters. Also I can't remove special character only i have to replace all the special characters to maintain unique between two different strings. How to achieve this, Which encoding will help me to achieve this?

The simplest option would be to encode the text to binary using UTF-8, and then convert the binary back to text as hex (two characters per byte). It won't be terribly efficient, but it will just be alphanumeric.
You could use base32 instead to be a bit more efficient, but that's likely to be significantly more work, unless you can find a library which supports it out of the box. (Libraries to perform hex encoding are very common.)

There are a number of variations of base64, some of which don't use padding. (You still have a couple of non-alphanumeric characters for characters 62 and 63.)
The Wikipedia page on base64 goes into the details, including the "standard" variations used for a number of common use-cases. (Does yours match one of those?)
If your strings have to be strictly alphanumeric, then you'll need to use hex encoding (one byte becomes 2 hex digits), or roll your own encoding scheme. Your stated requirements are rather unusual ...

Commons codec has a url safe version of base64, which emits - and _ instead of + and / characters
http://commons.apache.org/codec/apidocs/org/apache/commons/codec/binary/Base64.html#encodeBase64URLSafe(byte[])

The easiest way would be to use a regular expression to match all nonalphanumeric characters and replace them with an empty string.
// This will remove all special characters except space.
var cleaned = stringToReplace.replace(/[^\w\s]/gm, '')
Adding any special characters to the above regex will skip that character.
// This will remove all special characters except space and period.
var cleaned = stringToReplace.replace(/[^\w\s.]/gm, '')
A working example.
const regex = /[^\w\s]/gm;
const str = `This is a text with many special characters.
Hello, user, your password is 543#!\$32=!`;
const subst = ``;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
Regex explained.
[^\w\s]/gm
Match a single character not present in the list below [^\w\s]
\w matches any word character (equivalent to [a-zA-Z0-9_])
\s matches any whitespace character (equivalent to [\r\n\t\f\v \u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff])
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

If you truly can only use alphanumerical characters you will have to come up with an escaping scheme that uses one of those chars for example, use 0 as the escape, and then encode the special char as a 2 char hex encoding of the ascii. Use 000 to mean 0.
e.g.
This is my special sentence with a 0.
encodes to:
This020is020my020special020sentence020with020a02000002e

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Remove special Unicode characters from a Java String - java

You can remove all non-ASCII characters with: s.replaceAll("[^\\p{ASCII}]", "")

Related

Remove escaped unicode string in java with regex

How to take a certain set of characters from a string and use those as a unicode value?

Base64 encoding Allowed Characters

Removing all non-word characters in a Cyrillic UTF-8 encoded String

How to encode a string to replace all special characters

Categories

Resources