Java Unicode replacement char with hex code - java

I need to replace all special unicode chars in a string with an escape char "\" and its hex value.
For example, this string:
String test="This is a string with the special è unicode char";
Shuld be replaced with:
"This is a string with the special \E8 unicode char";
where E8 is the hex value of unicode value of char "è".

I have two problem:
1) How to find "special chars", may I check for each char value if it is >127?
That depends what your definition of "special" character is. Testing for characters greater than 127 is testing for non-ASCII characters. It is up to you to decide if that is what you want.
2) And how to get hex value as string.
The Integer.toHexString method can be used for that .
3) May I use a regex to find it or a loop for every char of the string?
A loop is simpler.

I'm trying this solution with a regex replacement (I don't know if it is better to loop for each char in string or to use the regex...)
String test="This is a string with the special è unicode char";
Pattern pat=Pattern.compile("([^\\x20-\\x7E])");
int offset=0;
Matcher m=pat.matcher(test);
while(!m.hitEnd()) {
if (m.find(offset)) {
out.append(test.substring(offset,m.start()));
for(int i=0;i<m.group(1).length();i++){
out.append(ESCAPE_CHAR);
out.append(Integer.toHexString(m.group(1).charAt(i)).toUpperCase());
}
offset=m.end();
}
}
return out.toString();

If your unicode characters are all in the range \u0080 - \u00FF you might use this solution.
public class Convert {
public static void main(String[] args) {
String in = "This is a string with the special è unicode char";
StringBuilder out = new StringBuilder(in.length());
for (int i = 0; i < in.length(); i++) {
char charAt = in.charAt(i);
if (charAt > 127) {
// if there is a high number (several e.g. 100.000) of conversions
// this lead in less objects to be garbadge collected
out.append('\\').append(Integer.toHexString(charAt).toUpperCase());
// out.append(String.format("\\%X", (int) charAt));
} else {
out.append(charAt);
}
}
System.out.println(out);
}
}

Related

Replace special characters (non ASCII) in String by corresponding unicode [duplicate]

I have strings "A função", "Ãugent" in which I need to replace characters like ç, ã, and à with empty strings.
How can I remove those non-ASCII characters from my string?
I have attempted to implement this using the following function, but it is not working properly. One problem is that the unwanted characters are getting replaced by the space character.
public static String matchAndReplaceNonEnglishChar(String tmpsrcdta) {
String newsrcdta = null;
char array[] = Arrays.stringToCharArray(tmpsrcdta);
if (array == null)
return newsrcdta;
for (int i = 0; i < array.length; i++) {
int nVal = (int) array[i];
boolean bISO =
// Is character ISO control
Character.isISOControl(array[i]);
boolean bIgnorable =
// Is Ignorable identifier
Character.isIdentifierIgnorable(array[i]);
// Remove tab and other unwanted characters..
if (nVal == 9 || bISO || bIgnorable)
array[i] = ' ';
else if (nVal > 255)
array[i] = ' ';
}
newsrcdta = Arrays.charArrayToString(array);
return newsrcdta;
}
This will search and replace all non ASCII letters:
String resultString = subjectString.replaceAll("[^\\x00-\\x7F]", "");
FailedDev's answer is good, but can be improved. If you want to preserve the ascii equivalents, you need to normalize first:
String subjectString = "öäü";
subjectString = Normalizer.normalize(subjectString, Normalizer.Form.NFD);
String resultString = subjectString.replaceAll("[^\\x00-\\x7F]", "");
=> will produce "oau"
That way, characters like "öäü" will be mapped to "oau", which at least preserves some information. Without normalization, the resulting String will be blank.
This would be the Unicode solution
String s = "A função, Ãugent";
String r = s.replaceAll("\\P{InBasic_Latin}", "");
\p{InBasic_Latin} is the Unicode block that contains all letters in the Unicode range U+0000..U+007F (see regular-expression.info)
\P{InBasic_Latin} is the negated \p{InBasic_Latin}
You can try something like this. Special Characters range for alphabets starts from 192, so you can avoid such characters in the result.
String name = "A função";
StringBuilder result = new StringBuilder();
for(char val : name.toCharArray()) {
if(val < 192) result.append(val);
}
System.out.println("Result "+result.toString());
[Updated solution]
can be used with "Normalize" (Canonical decomposition) and "replaceAll", to replace it with the appropriate characters.
import java.text.Normalizer;
import java.text.Normalizer.Form;
import java.util.regex.Pattern;
public final class NormalizeUtils {
public static String normalizeASCII(final String string) {
final String normalize = Normalizer.normalize(string, Form.NFD);
return Pattern.compile("\\p{InCombiningDiacriticalMarks}+")
.matcher(normalize)
.replaceAll("");
} ...
Or you can use the function below for removing non-ascii character from the string.
You will get know internal working.
private static String removeNonASCIIChar(String str) {
StringBuffer buff = new StringBuffer();
char chars[] = str.toCharArray();
for (int i = 0; i < chars.length; i++) {
if (0 < chars[i] && chars[i] < 127) {
buff.append(chars[i]);
}
}
return buff.toString();
}
The ASCII table contains 128 codes, with a total of 95 printable characters, of which only 52 characters are letters:
[0-127] ASCII codes
[32-126] printable characters
[48-57] digits [0-9]
[65-90] uppercase letters [A-Z]
[97-122] lowercase letters [a-z]
You can use String.codePoints method to get a stream over int values of characters of this string and filter out non-ASCII characters:
String str1 = "A função, Ãugent";
String str2 = str1.codePoints()
.filter(ch -> ch < 128)
.mapToObj(Character::toString)
.collect(Collectors.joining());
System.out.println(str2); // A funo, ugent
Or you can explicitly specify character ranges. For example filter out everything except letters:
String str3 = str1.codePoints()
.filter(ch -> ch >= 'A' && ch <= 'Z'
|| ch >= 'a' && ch <= 'z')
.mapToObj(Character::toString)
.collect(Collectors.joining());
System.out.println(str3); // Afunougent
See also: How do I not take Special Characters in my Password Validation (without Regex)?
String s = "A função";
String stripped = s.replaceAll("\\P{ASCII}", "");
System.out.println(stripped); // Prints "A funo"
or
private static final Pattern NON_ASCII_PATTERN = Pattern.compile("\\P{ASCII}");
public static String matchAndReplaceNonEnglishChar(String tmpsrcdta) {
return NON_ASCII_PATTERN.matcher(s).replaceAll("");
}
public static void main(String[] args) {
matchAndReplaceNonEnglishChar("A função"); // Prints "A funo"
}
Explanation
The method String.replaceAll(String regex, String replacement) replaces all instances of a given regular expression (regex) with a given replacement string.
Replaces each substring of this string that matches the given regular expression with the given replacement.
Java has the "\p{ASCII}" regular expression construct which matches any ASCII character, and its inverse, "\P{ASCII}", which matches any non-ASCII character. The matched characters can then be replaced with the empty string, effectively removing them from the resulting string.
String s = "A função";
String stripped = s.replaceAll("\\P{ASCII}", "");
System.out.println(stripped); // Prints "A funo"
The full list of valid regex constructs is documented in the Pattern class.
Note: If you are going to be calling this pattern multiple times within a run, it will be more efficient to use a compiled Pattern directly, rather than String.replaceAll. This way the pattern is compiled only once and reused, rather than each time replaceAll is called:
public class AsciiStripper {
private static final Pattern NON_ASCII_PATTERN = Pattern.compile("\\P{ASCII}");
public static String stripNonAscii(String s) {
return NON_ASCII_PATTERN.matcher(s).replaceAll("");
}
}
An easily-readable, ascii-printable, streams solution:
String result = str.chars()
.filter(c -> isAsciiPrintable((char) c))
.mapToObj(c -> String.valueOf((char) c))
.collect(Collectors.joining());
private static boolean isAsciiPrintable(char ch) {
return ch >= 32 && ch < 127;
}
To convert to "_": .map(c -> isAsciiPrintable((char) c) ? c : '_')
32 to 127 is equivalent to the regex [^\\x20-\\x7E] (from comment on the regex solution)
Source for isAsciiPrintable: http://www.java2s.com/Code/Java/Data-Type/ChecksifthestringcontainsonlyASCIIprintablecharacters.htm
CharMatcher.retainFrom can be used, if you're using the Google Guava library:
String s = "A função";
String stripped = CharMatcher.ascii().retainFrom(s);
System.out.println(stripped); // Prints "A funo"

Replace special characters in a string with their UTF-8 encoded character java?

I want to convert only the special characters to their UTF-8 equivalent character.
For example given a String: Abcds23#$_ss, it should get converted to Abcds23353695ss.
The following is how i did the above conversion:
The utf-8 in hexadecimal for # is 23 and in decimal is 35. The utf-8 in hexadecimal for $ is 24 and in decimal is 36. The utf-8 in hexadecimal for _ is 5f and in decimal is 95.
I know we have the String.replaceAll(String regex, String replacement) method. But I want to replace specific character with their specific UTF-8 equivalent.
How do I do the same in java?
I don't know how do you define "special characters", but this function should give you an idea:
public static String convert(String str)
{
StringBuilder buf = new StringBuilder();
for (int index = 0; index < str.length(); index++)
{
char ch = str.charAt(index);
if (Character.isLetterOrDigit(ch))
buf.append(ch);
else
buf.append(str.codePointAt(index));
}
return buf.toString();
}
#Test
public void test()
{
Assert.assertEquals("Abcds23353695ss", convert("Abcds23#$_ss"));
}
The following uses java 8 or above and checks whether a Unicode code point (symbol) is a letter or digit, pure ASCII (< 128) and otherwise output the Unicode code point as string of the numerical value.
static String convert(String str) {
int[] cps = str.codePoints()
.flatMap((cp) ->
Character.isLetterOrDigit(cp) && cp < 128
? IntStream.of(cp)
: String.valueOf(cp).codePoints())
.toArray();
return new String(cps, 0, cps.length);
}
String.codePoints() yields an IntStream, flatMap adds IntStreams in a single flattened stream, and toArray collects it in an array. So we can construct a new String from those code points. Entirely Unicode safe.
Conversion is not undoable without delimiters.
On Unicode:
Unicode numbers symbols, called code points, from 0 upwards, into the 3 byte range.
To be coded (formated) in bytes there exist UTF-8 (multi-byte), UTF-16LE and UTF-16BE (2byte-sequences) and UTF-32 (code points as-is more or less).
Java string constants in a .class file are in UTF-8. A String is composed of UTF-16BE chars. And String can give code points as above. So java by design uses Unicode for text.

How to remove invalid characters from a string?

I have no idea how to remove invalid characters from a string in Java. I'm trying to remove all the characters that are not numbers, letters, or ( ) [ ] . How can I do this?
Thanks
String foo = "this is a thing with & in it";
foo = foo.replaceAll("[^A-Za-z0-9()\\[\\]]", "");
Javadocs are your friend. Regular expressions are also your friend.
Edit:
That being siad, this is only for the Latin alphabet; you can adjust accordingly. \\w can be used for a-zA-Z to denote a "word" character if that works for your case though it includes _.
Using Guava, and almost certainly more efficient (and more readable) than regexes:
CharMatcher desired = CharMatcher.JAVA_DIGIT
.or(CharMatcher.JAVA_LETTER)
.or(CharMatcher.anyOf("()[]"))
.precomputed(); // optional, may improve performance, YMMV
return desired.retainFrom(string);
Try this:
String s = "123abc&^%[]()";
s = s.replaceAll("[^A-Za-z0-9()\\[\\]]", "");
System.out.println(s);
The above will remove characters "&^%" in the sample string, leaving in s only "123abc[]()".
public static void main(String[] args) {
String c = "hjdg$h&jk8^i0ssh6+/?:().,+-#";
System.out.println(c);
Pattern pt = Pattern.compile("[^a-zA-Z0-9/?:().,'+/-]");
Matcher match = pt.matcher(c);
if (!match.matches()) {
c = c.replaceAll(pt.pattern(), "");
}
System.out.println(c);
}
Use this code:
String s = "Test[]"
s = s.replaceAll("[");
s = s.replaceAll("]");
myString.replaceAll("[^\\w\\[\\]\\(\\)]", "");
replaceAll method takes a regex as first parameter and replaces all matches in string. This regex matches all characters which are not digit, letter or underscore (\\w) and braces you need (\\[\\]\\(\\)])
You can remove specials characters from your String/Url or any request parameters you have get from user side
public static String removeSpecialCharacters(String inputString){
final String[] metaCharacters = {"../","\\..","\\~","~/","~"};
String outputString="";
for (int i = 0 ; i < metaCharacters.length ; i++){
if(inputString.contains(metaCharacters[i])){
outputString = inputString.replace(metaCharacters[i],"");
inputString = outputString;
}else{
outputString = inputString;
}
}
return outputString;
}
You can specify the range of characters to keep/remove based on the order of characters in the ASCII table. The regex can use actual characters or character hex codes:
// Example - remove characters outside of the range of "space to tilde".
// 1) using characters
someString.replaceAll("[^ -~]", "");
// 2) using hex codes for "space" and "tilde"
someString.replaceAll("[^\\u0020-\\u007E]", "");

How can non-ASCII characters be removed from a string?

I have strings "A função", "Ãugent" in which I need to replace characters like ç, ã, and à with empty strings.
How can I remove those non-ASCII characters from my string?
I have attempted to implement this using the following function, but it is not working properly. One problem is that the unwanted characters are getting replaced by the space character.
public static String matchAndReplaceNonEnglishChar(String tmpsrcdta) {
String newsrcdta = null;
char array[] = Arrays.stringToCharArray(tmpsrcdta);
if (array == null)
return newsrcdta;
for (int i = 0; i < array.length; i++) {
int nVal = (int) array[i];
boolean bISO =
// Is character ISO control
Character.isISOControl(array[i]);
boolean bIgnorable =
// Is Ignorable identifier
Character.isIdentifierIgnorable(array[i]);
// Remove tab and other unwanted characters..
if (nVal == 9 || bISO || bIgnorable)
array[i] = ' ';
else if (nVal > 255)
array[i] = ' ';
}
newsrcdta = Arrays.charArrayToString(array);
return newsrcdta;
}
This will search and replace all non ASCII letters:
String resultString = subjectString.replaceAll("[^\\x00-\\x7F]", "");
FailedDev's answer is good, but can be improved. If you want to preserve the ascii equivalents, you need to normalize first:
String subjectString = "öäü";
subjectString = Normalizer.normalize(subjectString, Normalizer.Form.NFD);
String resultString = subjectString.replaceAll("[^\\x00-\\x7F]", "");
=> will produce "oau"
That way, characters like "öäü" will be mapped to "oau", which at least preserves some information. Without normalization, the resulting String will be blank.
This would be the Unicode solution
String s = "A função, Ãugent";
String r = s.replaceAll("\\P{InBasic_Latin}", "");
\p{InBasic_Latin} is the Unicode block that contains all letters in the Unicode range U+0000..U+007F (see regular-expression.info)
\P{InBasic_Latin} is the negated \p{InBasic_Latin}
You can try something like this. Special Characters range for alphabets starts from 192, so you can avoid such characters in the result.
String name = "A função";
StringBuilder result = new StringBuilder();
for(char val : name.toCharArray()) {
if(val < 192) result.append(val);
}
System.out.println("Result "+result.toString());
[Updated solution]
can be used with "Normalize" (Canonical decomposition) and "replaceAll", to replace it with the appropriate characters.
import java.text.Normalizer;
import java.text.Normalizer.Form;
import java.util.regex.Pattern;
public final class NormalizeUtils {
public static String normalizeASCII(final String string) {
final String normalize = Normalizer.normalize(string, Form.NFD);
return Pattern.compile("\\p{InCombiningDiacriticalMarks}+")
.matcher(normalize)
.replaceAll("");
} ...
Or you can use the function below for removing non-ascii character from the string.
You will get know internal working.
private static String removeNonASCIIChar(String str) {
StringBuffer buff = new StringBuffer();
char chars[] = str.toCharArray();
for (int i = 0; i < chars.length; i++) {
if (0 < chars[i] && chars[i] < 127) {
buff.append(chars[i]);
}
}
return buff.toString();
}
The ASCII table contains 128 codes, with a total of 95 printable characters, of which only 52 characters are letters:
[0-127] ASCII codes
[32-126] printable characters
[48-57] digits [0-9]
[65-90] uppercase letters [A-Z]
[97-122] lowercase letters [a-z]
You can use String.codePoints method to get a stream over int values of characters of this string and filter out non-ASCII characters:
String str1 = "A função, Ãugent";
String str2 = str1.codePoints()
.filter(ch -> ch < 128)
.mapToObj(Character::toString)
.collect(Collectors.joining());
System.out.println(str2); // A funo, ugent
Or you can explicitly specify character ranges. For example filter out everything except letters:
String str3 = str1.codePoints()
.filter(ch -> ch >= 'A' && ch <= 'Z'
|| ch >= 'a' && ch <= 'z')
.mapToObj(Character::toString)
.collect(Collectors.joining());
System.out.println(str3); // Afunougent
See also: How do I not take Special Characters in my Password Validation (without Regex)?
String s = "A função";
String stripped = s.replaceAll("\\P{ASCII}", "");
System.out.println(stripped); // Prints "A funo"
or
private static final Pattern NON_ASCII_PATTERN = Pattern.compile("\\P{ASCII}");
public static String matchAndReplaceNonEnglishChar(String tmpsrcdta) {
return NON_ASCII_PATTERN.matcher(s).replaceAll("");
}
public static void main(String[] args) {
matchAndReplaceNonEnglishChar("A função"); // Prints "A funo"
}
Explanation
The method String.replaceAll(String regex, String replacement) replaces all instances of a given regular expression (regex) with a given replacement string.
Replaces each substring of this string that matches the given regular expression with the given replacement.
Java has the "\p{ASCII}" regular expression construct which matches any ASCII character, and its inverse, "\P{ASCII}", which matches any non-ASCII character. The matched characters can then be replaced with the empty string, effectively removing them from the resulting string.
String s = "A função";
String stripped = s.replaceAll("\\P{ASCII}", "");
System.out.println(stripped); // Prints "A funo"
The full list of valid regex constructs is documented in the Pattern class.
Note: If you are going to be calling this pattern multiple times within a run, it will be more efficient to use a compiled Pattern directly, rather than String.replaceAll. This way the pattern is compiled only once and reused, rather than each time replaceAll is called:
public class AsciiStripper {
private static final Pattern NON_ASCII_PATTERN = Pattern.compile("\\P{ASCII}");
public static String stripNonAscii(String s) {
return NON_ASCII_PATTERN.matcher(s).replaceAll("");
}
}
An easily-readable, ascii-printable, streams solution:
String result = str.chars()
.filter(c -> isAsciiPrintable((char) c))
.mapToObj(c -> String.valueOf((char) c))
.collect(Collectors.joining());
private static boolean isAsciiPrintable(char ch) {
return ch >= 32 && ch < 127;
}
To convert to "_": .map(c -> isAsciiPrintable((char) c) ? c : '_')
32 to 127 is equivalent to the regex [^\\x20-\\x7E] (from comment on the regex solution)
Source for isAsciiPrintable: http://www.java2s.com/Code/Java/Data-Type/ChecksifthestringcontainsonlyASCIIprintablecharacters.htm
CharMatcher.retainFrom can be used, if you're using the Google Guava library:
String s = "A função";
String stripped = CharMatcher.ascii().retainFrom(s);
System.out.println(stripped); // Prints "A funo"

char[] to String sequence mismatching in Java for Unicode characters

I have a method like below (please ignore the code optimization issue.) This method replaces the Unicode character (Bengali characters)
static String swap(String temp, char c)
{
Integer length=temp.length();
char[] charArray = temp.toCharArray();
for(int u=0;u<length;u++)
{
if(charArray[u]==c)
{
char g=charArray[u];
charArray[u]=charArray[u-1];
charArray[u-1]=g;
}
}
String string2 = new String(charArray);
return string2;
}
while debugging, i got the values of charArray like the below image:
please note that the characters are in a sequenced format what I want. But after the execution of the statement, the value stored in String variable is mismatched. like below:
I want to display the string as "রেরেরে" but it is displaying "েরেরের" what i not want. Please tell me what I am doing wrong.
Note - I don't know Bengali, but I know a bit (or a lot, depending on whom you ask) about Unicode and how Java supports it. The answer assumes knowledge of the latter and not the former.
Going by the Unicode 6.0 Bengali chart, রে is a combination of the dependent vowel sign ে (0x09C7) and the consonant র (0x09B0) and is represented as a sequence of two characters in the character array.
If you are getting the dependent vowel sign alone, in the resulting character sequence (and hence the string), then your optimization is likely to be kooky, as it appears to assume that Bengali characters in Unicode can be represented as a single Unicode codepoint or a single char variable in Java; this would result in the scenario where a consonant would be replaced by another consonant, but the dependent vowel preceding the consonant would never be replaced.
I think a correct optimization must therefore consider the presence of dependent vowels, and compare the following consonant in addition to the vowel , i.e. it must compare two characters in the character array, instead of comparing individual characters. This might also imply that your method signature must be changed to allow for a char[] to be passed, instead of a single char, so that Bengali characters can be replaced with the intended Bengali character, instead of replacing a Unicode codepoint with another, which is what is being done currently.
The notes in other answers on the ArrayIndexOutofBoundsException is valid. The following example that uses your character replacement algorithm demonstrates that not only is your algorithm incorrect, but it is quite possible for the exception to be thrown:
class CodepointReplacer
{
public static void main(String[] args)
{
String str1 = "রেরেরে";
/*
* The following is a linguistically invalid sequence,
* but Java does not concern itself with linguistical correctness
* if the String or char sequence has been constructed incorrectly.
*/
String str2 = "েরেরের";
/*
* replacement character র for our strings
* It is not রে as one would anticipate.
*/
char c = str1.charAt(1);
optimizeKookily(str1, c);
optimizeKookily(str2, c);
}
private static void optimizeKookily(String temp, char c)
{
Integer length = temp.length();
char[] charArray = temp.toCharArray();
for (int u = 0; u < length; u++)
{
if (charArray[u] == c)
{
char g = charArray[u];
charArray[u] = charArray[u - 1]; //throws exception on second invocation of this method.
charArray[u - 1] = g;
}
}
}
}
A better character replacement strategy would therefore be to use the String.replace (the CharSequence variant) or String.replaceAll functions, assuming that you would know how to use these with Bengali characters.
problem is in
for(int u=0;u<length;u++)
{
if(charArray[u]==c)
{
char g=charArray[u];
charArray[u]=charArray[u-1];
charArray[u-1]=g;
}
}
See when u=0 what is the value of charArray[u-1] that is the index -1.Modify your for loop or just put the condition where u=0.
Your code will cause an IndexOutOfBound Exception.
When u=0, charArray[u-1]=-1.

Categories