How can UTF-8 value like =D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0 be converted in Java?
I have tried something like:
Character.toCodePoint((char)(Integer.parseInt("D0", 16)),(char)(Integer.parseInt("93", 16));
but it does not convert to a valid code point.
That string is an encoding of bytes in hex, so the best way is to decode the string into a byte[], then call new String(bytes, StandardCharsets.UTF_8).
Update
Here is a slightly more direct version of decoding the string, than provided by "sstan" in another answer. Of course both versions are good, so use whichever makes you more comfortable, or write your own version.
String src = "=D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0";
assert src.length() % 3 == 0;
byte[] bytes = new byte[src.length() / 3];
for (int i = 0, j = 0; i < bytes.length; i++, j+=3) {
assert src.charAt(j) == '=';
bytes[i] = (byte)(Character.digit(src.charAt(j + 1), 16) << 4 |
Character.digit(src.charAt(j + 2), 16));
}
String str = new String(bytes, StandardCharsets.UTF_8);
System.out.println(str);
Output
Газета
In UTF-8, a single character is not always encoded with the same amount of bytes. Depending on the character, it may require 1, 2, 3, or even 4 bytes to be encoded. Therefore, it's definitely not a trivial matter to try to map UTF-8 bytes yourself to a Java char which uses UTF-16 encoding, where each char is encoded using 2 bytes. Not to mention that, depending on the character (code point > 0xffff), you may also have to worry about dealing with surrogate characters, which is just one more complication that you can easily get wrong.
All this to say that Andreas is absolutely right. You should focus on parsing your string to a byte array, and then let the built-in libraries convert the UTF-8 bytes to a Java string for you. From a Java String, it's trivial to extract the Unicode code points if that's what you want.
Here is some sample code that shows one way this can be achieved:
public static void main(String[] args) throws Exception {
String src = "=D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0";
// Parse string into hex string tokens.
String[] tokens = Arrays.stream(src.split("="))
.filter(s -> s.length() != 0)
.toArray(String[]::new);
// Convert the hex string representations to a byte array.
byte[] utf8bytes = new byte[tokens.length];
for (int i = 0; i < utf8bytes.length; i++) {
utf8bytes[i] = (byte) Integer.parseInt(tokens[i], 16);
}
// Convert UTF-8 bytes to Java String.
String str = new String(utf8bytes, StandardCharsets.UTF_8);
// Display string + individual unicode code points.
System.out.println(str);
str.codePoints().forEach(System.out::println);
}
Output:
Газета
1043
1072
1079
1077
1090
1072
Related
I am currently storing a String as an array of bytes. However, when I try to use the following code to convert the bytes back to a String using Charset, I have diamonds at the end:
byte[] testbytes = "abc123".getBytes(); // tried getBytes("UTF-8"/StandardCharsets.UTF_8) too
Charset charset = Charset.forName("UTF-8"); // ISO-8859-1 has no diamonds
CharBuffer charBuffer = charset.decode( ByteBuffer.wrap( Arrays.copyOfRange(testbytes,0,testbytes.length) ) );
System.out.println("converted = " + String.valueOf(charBuffer.array()) );
// returns this - abc123����������
If I set the encoding to ISO-8859-1 instead, it converts fine. I thought it might be the encoding of the source code file but opening that in Notepad++ suggests it is also in UTF-8.
Am I missing something or is this just a problem with Android Studio's Logcat window?
- Edit 1 -
Further testing shows that 3 character strings do not have this padding at the end problem. If you use longer strings, Charset.decode seems to pad out the char array with \u0000 values according to the break point.
String.valueOf will end up printing the padded characters as diamonds while creating a new String object removes the padding but, I would like to not use String at all to convert a byte array to a char array due to sensitive values.
- Edit 2 -
It appears the above happens if you call charset.decode() again so, I'm guessing there's a buffer that's being appended to but not sure at what point. Tried clearing with charBuffer.clear() but the second block of code's output appears to be the same i.e. 3 char + 2 spaces + 6 new chars.
String test1 = "123";
byte[] test1b = test1.getBytes();
char[] expected1 = test1.toCharArray();
CharBuffer charBuffer = charset.decode( ByteBuffer.wrap( test1b ) );
char[] actual1 = charBuffer.array(); // size 3, correct
String test2 = "123456";
byte[] test2b = test2.getBytes();
char[] expected2 = test2.toCharArray();
CharBuffer charBuffer2 = charset.decode( ByteBuffer.wrap( test2b ) );
char[] actual2 = charBuffer2.array(); // size 11, padded with '\u0000' 0
Did you try to use the String constructor that receives an array of bytes?
Like:
byte[] testbytes = "abc123".getBytes(StandardCharsets.UTF_8);
String stringDecoded = new String(testbytes, StandardCharsets.UTF_8);
Maybe it can solve your problem.
If a string of data contains characters with different encodings, is there a way to change charset encoding after an input stream is created or suggestions on how it could be achieved?
Example to help explain:
// data need to read first 4 characters using UTF-8 and next 4 characters using ISO-8859-2?
String data = "testўёѧẅ"
// use default charset of platform, could pass in a charset
try (InputStream in = new ByteArrayInputStream(data.getBytes())) {
// probably an input stream reader to use char instead of byte would be clearer but hopefully the idea comes across
byte[] bytes = new byte[4];
while (in.read(bytes) != -1) {
// TODO: change the charset here to UTF-8 then read values
// TODO: change the charset here to ISO-8859-2 then read values
}
}
Been looking at decoders, might be the way to go:
What is CharsetDecoder.decode(ByteBuffer, CharBuffer, endOfInput)
Encoding conversion in java
Attempt using same input stream:
String data = "testўёѧẅ";
InputStream inputStream = new ByteArrayInputStream(data.getBytes());
Reader r = new InputStreamReader(inputStream, "UTF-8");
int intch;
int count = 0;
while ((intch = r.read()) != -1) {
System.out.println((char)ch);
if ((++count) == 4) {
r = new InputStreamReader(inputStream, Charset.forName("ISO-8859-2"));
}
}
//outputs test and not the 2nd part
Assuming that you know there will be n UTF-8 characters and m ISO 8859-2 characters in your stream (n=4, m=4 in your example), you can do by using two different InputStreamReaders working on the same InputStream:
try (InputStream in = new ByteArrayInputStream(data.getBytes())) {
InputStreamReader inUtf8 = new InputStreamReader(in, StandardCharsets.UTF_8);
InputStreamReader inIso88592 = new InputStreamReader(in, Charset.forName("ISO-8859-2"));
// read `n` characters using inUtf8, then read `m` characters using inIso88592
}
Note that you need to read characters not bytes (i.e. check how many characters how been read so far, as in UTF-8 a single character may be encoded on 1-4 bytes).
String contains Unicode so it can combine all language scripts.
String data = "testўёѧẅ";
For that String uses a char array, where char is UTF-16. Sometimes a Unicode symbol, a code point, needs to be encoded as two chars. So: char only for a part of the Unicode maps Unicode code points exactly. Here it might do:
String d1 = data.substring(0, 4);
byte[] b1 = data.getBytes(StandardCharsets.UTF_8); // Binary data, UTF-8 text
String d2 = data.substring(4);
Charset charset = Charset.from("ISO-8859-2");
byte[] b2 = data.getBytes(charset); // Binary data, Latin-2 text
The number of bytes do not need to correspond to the number of code points.
Also é might be 1 code point é, or two code points: e and a zero width ´.
To split text by script or Unicode block:
data.codePoints().forEach(cp -> System.out.printf("%-35s - %-25s - %s%n",
Character.getName(cp),
Character.UnicodeBlock.of(cp),
Character.UnicodeScript.of(cp)));
Name: Unicode block: Script:
LATIN SMALL LETTER T - BASIC_LATIN - LATIN
LATIN SMALL LETTER E - BASIC_LATIN - LATIN
LATIN SMALL LETTER S - BASIC_LATIN - LATIN
LATIN SMALL LETTER T - BASIC_LATIN - LATIN
CYRILLIC SMALL LETTER SHORT U - CYRILLIC - CYRILLIC
CYRILLIC SMALL LETTER IO - CYRILLIC - CYRILLIC
CYRILLIC SMALL LETTER LITTLE YUS - CYRILLIC - CYRILLIC
LATIN SMALL LETTER W WITH DIAERESIS - LATIN_EXTENDED_ADDITIONAL - LATIN
I am working on the Matasano CryptoChallenge, and the first one is to create a Hex to Base 64 converter. I honestly don't know how to continue from here. My code:
public class HexToBase64 {
public static void main(String[] args) {
// String hex = "49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d";
String hex = "DA65A";
convertHexTo64(hex);
}
public static String convertHexTo64(String hex) {
//convert each letter in the hex string to a 4-digit binary string to create a binary representation of the hex string
StringBuilder binary = new StringBuilder();
for (int i = 0; i < hex.length(); i++) {
int dec = Integer.parseInt(hex.charAt(i) + "", 16);
StringBuilder bin = new StringBuilder(Integer.toBinaryString(dec));
while(bin.length() < 4){
bin.insert(0,'0');
}
binary.append(bin);
}
//now take 6 bits at a time and convert to a single b64 digit to create the final b64 representation
StringBuilder b64 = new StringBuilder();
for (int i = 0; i < binary.length(); i++) {
String temp = binary.substring(i, i+5);
int dec = Integer.parseInt(temp, 10);
//convert dec to b64 with the lookup table here then append to b64
}
return b64.toString();
}
}
So after I separate the binary 6 bits at a time and convert to decimal, how do I map the decimal number to the corresponding digit in b64? Would a Hashmap/Hashtable implementation be efficient?
Also, this algorithm displays how I would go about doing the conversion by hand. Is there a better way of doing this? I am looking for a way to convert that will take a reasonable amount of time, so time, and implicitly efficiency, is relevant.
Thank you for your time
EDIT: And the page also mentions that "Always operate on raw bytes, never on encoded strings. Only use hex and base64 for pretty-printing." What does that mean exactly?
Extracted from this Stack Overflow post, which references Apache Commons Codec
byte[] decodedHex = Hex.decodeHex(hex);
byte[] encodedHexB64 = Base64.codeBase64(decodedHex);
String hex = "00bc9d2a05ef06c79a6e972f8a36737e";
byte[] decodedHex = org.apache.commons.codec.binary.Hex.decodeHex(hex.toCharArray());
String result = Base64.encodeBase64String(decodedHex);
System.out.println("==> " + result);
I am receiving a string text via USB communication in android in form of extended ASCII characters like
String receivedText = "5286T11ɬ ªË ¦¿¯¾ ¯¾ ɬ ¨¬°:A011605286 ª¿ª ¾®:12:45 ¸Í®°:(9619441121)ª¿ª:-, ®¹¿¦Í°¾ ¡ ®¹¿¦Í°¾ ª¨À, ¾¦¿µ²À ¸Í, ¾¦¿µ²À ªÂ°Íµ °¿®¾°Í͸:- ¡Í°Éª:-, ¬¾¹°, ¸¾¤¾Í°Â¼ ªÂ°Íµ~";
Now these character represents a string in hindi.
I am not getting how to convert this received string into hindi equivalent text.
Any one knows how to convert this into equivalent hindi text using java
Following is the piece of code which I am using to convert byte array to byte string
public String byteArrayToByteString(byte[] arayValue, int size) {
byte ch = 0x00;
int i = 0;
if (arayValue == null || arayValue.length <= 0)
return null;
String pseudo[] = { "0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
"A", "B", "C", "D", "E", "F" };
StringBuffer out = new StringBuffer();
while (i < size) {
ch = (byte) (arayValue[i] & 0xF0); // Strip off high nibble
ch = (byte) (ch >>> 4); // shift the bits down
ch = (byte) (ch & 0x0F); // must do this is high order bit is on!
out.append(pseudo[(int) ch]); // convert the nibble to a String
// Character
ch = (byte) (arayValue[i] & 0x0F); // Strip off low nibble
out.append(pseudo[(int) ch]); // convert the nibble to a String
// Character
i++;
}
String rslt = new String(out);
return rslt;
}
Let me know if this helps in finding solution
EDIT:
Its an UTF-16 encoding and the characters in receivedText string is in form of extended ASCII for hindi characters
New Edit
I have new characters
String value = "?®Á?Ƕ ¡??°¿¯¾";
Which says मुकेश in hindi and dangaria in hindi. Google translator is not translating dangaria in hindi so I cannot provide you hindi version of it.
I talked to the person who is encoding he said that he removed 2 bits from the input before encoding i.e. if \u0905 represents अ in hindi then he removed \u09 from the input and converted remaining 05 in extended hexadecimal form.
So the new input string I provided you is decoded in form of above explanation. i.e. \u09 is been removed and rest is converted into extended ascii and then sent to device using USB.
Let me know if this explanation helps you in finding out solution
I've been playing around with this a bit and have an idea of what you might need to do. It looks like the value for receivedText that you have in your posting is encoded in windows-1252 for some reason. It was probably from pasting it into this post perhaps. Providing the raw byte values would be better to avoid any encoding errors. Regardless, I was able to get that String into the following Unicode Devanagari characters:
5286T11फए ऋभ इडऒठ ऒठ फए उएओ:A011605286 ऋडऋ ठऍ:12:45 चयऍओ:(9619441121)ऋडऋ:-, ऍछडइयओठ ँ ऍछडइयओठ ऋउढ, ठइडगऑढ चय, ठइडगऑढ ऋतओयग ओडऍठओययच:- ँयओफऋ:-, एठछओ, चठअठयओतञ ऋतओयग~
With the following code:
final String receivedText = "5286T11ɬ ªË ¦¿¯¾ ¯¾ ɬ ¨¬°:A011605286 ª¿ª ¾®:12:45 ¸Í®°:(9619441121)ª¿ª:-, ®¹¿¦Í°¾ ¡ ®¹¿¦Í°¾ ª¨À, ¾¦¿µ²À ¸Í, ¾¦¿µ²À ªÂ°Íµ °¿®¾°Í͸:- ¡Í°Éª:-, ¬¾¹°, ¸¾¤¾Í°Â¼ ªÂ°Íµ~";
final Charset fromCharset = Charset.forName("x-ISCII91");
final CharBuffer decoded = fromCharset.decode(ByteBuffer.wrap(receivedText.getBytes("windows-1252")));
final Charset toCharset = Charset.forName("UTF-16");
final byte[] encoded = toCharset.encode(decoded).array();
System.out.println(new String(encoded, toCharset.displayName()));
Whether or not those are the expected characters is something you would need to tell me :)
Also, I'm not sure if the x-ISCII91 character encoding is available in Android.
Generally, for a byte array that you know to be a string value, you can use the following.
Assuming byte[] someBytes:
String stringFromBytes = new String(someBytes, "UTF-16");
You may replace "UTF-16" with the approprate charset, which you can find after some experimentation. This link detailing java's supported character encodings may be of help.
From the details you have provided I would suggest considering the following:
If you're reading a file from a USB drive, android might have existing frameworks that will help you do this in a more standard way.
If you most certainly need to read in and manipulate the bytes from the USB port directly, make sure that you are familiar with the API/protocol of the data you are reading. It may be that some of the bytes are control messages or something similar that cannot be converted to strings, and you will need to identify exactly where in the byte stream the string begins (and ends).
hindi = new String(receivedText.getBytes(), "UTF-16");
But this does not really look like hindi.. are you sure it is encoded as UTF-16?
Edit:
String charset = "UTF-8";
hindi = new String(hindi.getBytes(Charset.forName(charset)), "UTF-16");
Replace UTF-8 with the actual charsed that resulted in your loooong String.
I have dirty data. Sometimes it contains characters like this. I use this data to make queries like
WHERE a.address IN ('mydatahere')
For this character I get
org.hibernate.exception.GenericJDBCException: Illegal mix of collations (utf8_bin,IMPLICIT), (utf8mb4_general_ci,COERCIBLE), (utf8mb4_general_ci,COERCIBLE) for operation ' IN '
How can I filter out characters like this? I use Java.
Thanks.
When I had problem like this, I used Perl script to ensure that data is converted to valid UTF-8 by using code like this:
use Encode;
binmode(STDOUT, ":utf8");
while (<>) {
print Encode::decode('UTF-8', $_);
}
This script takes (possibly corrupted) UTF-8 on stdin and re-prints valid UTF-8 to stdout. Invalid characters are replaced with � (U+FFFD, Unicode replacement character).
If you run this script on good UTF-8 input, output should be identical to input.
If you have data in database, it makes sense to use DBI to scan your table(s) and scrub all data using this approach to make sure that everything is valid UTF-8.
This is Perl one-liner version of this same script:
perl -MEncode -e "binmode STDOUT,':utf8';while(<>){print Encode::decode 'UTF-8',\$_}" < bad.txt > good.txt
EDIT: Added Java-only solution.
This is an example how to do this in Java:
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CodingErrorAction;
public class UtfFix {
public static void main(String[] args) throws InterruptedException, CharacterCodingException {
CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPLACE);
decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
ByteBuffer bb = ByteBuffer.wrap(new byte[] {
(byte) 0xD0, (byte) 0x9F, // 'П'
(byte) 0xD1, (byte) 0x80, // 'р'
(byte) 0xD0, // corrupted UTF-8, was 'и'
(byte) 0xD0, (byte) 0xB2, // 'в'
(byte) 0xD0, (byte) 0xB5, // 'е'
(byte) 0xD1, (byte) 0x82 // 'т'
});
CharBuffer parsed = decoder.decode(bb);
System.out.println(parsed);
// this prints: Пр?вет
}
}
You can encode and then decode it to/from UTF-8:
String label = "look into my eyes 〠.〠";
Charset charset = Charset.forName("UTF-8");
label = charset.decode(charset.encode(label)).toString();
System.out.println(label);
output:
look into my eyes ?.?
edit: I think this might only work on Java 6.
You can filter surrogate characters with this regex:
String str = "𠀀"; //U+20000, represented by 2 chars in java (UTF-16 surrogate pair)
str = str.replaceAll( "([\\ud800-\\udbff\\udc00-\\udfff])", "");
System.out.println(str.length()); //0
Once you convert the byte array to String on the java machine, you'll get (by default on most machines) UTF-16 encoded string. The proper solution to get rid of non UTF-8 characters is with the following code:
String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa", "Ok"};
for (int i = 0; i < values.length; i++) {
System.out.println(values[i].replaceAll(
//"[\\\\x00-\\\\x7F]|" + //single-byte sequences 0xxxxxxx - commented because of capitol letters
"[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences 110xxxxx 10xxxxxx
"[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences 1110xxxx 10xxxxxx * 2
"[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
, ""));
}
or if you want to validate if some string contains non utf8 characters you would use Pattern.matches like:
String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa", "Ok"};
for (int i = 0; i < values.length; i++) {
System.out.println(Pattern.matches(
".*(" +
//"[\\\\x00-\\\\x7F]|" + //single-byte sequences 0xxxxxxx - commented because of capitol letters
"[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences 110xxxxx 10xxxxxx
"[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences 1110xxxx 10xxxxxx * 2
"[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
+ ").*"
, values[i]));
}
For making a whole web app be UTF8 compatible read here:
How to get UTF-8 working in Java webapps
More on Byte Encodings and Strings.
You can check your pattern here.
The same in PHP here.
May be this will help someone as it helped me.
public static String removeBadChars(String s) {
if (s == null) return null;
StringBuilder sb = new StringBuilder();
for(int i=0;i<s.length();i++){
if (Character.isHighSurrogate(s.charAt(i))) continue;
sb.append(s.charAt(i));
}
return sb.toString();
}
In PHP - I approach this by only allowing printable data. This really helps in cleaning the data for DB.
It's pre-processing though and sometimes you don't have that luxury.
$str = preg_replace('/[[:^print:]]/', '', $str);