Converting unicode chars like 0x00 to bytes

Converting unicode chars like 0x00 to bytes - java

I am trying to port small library from Java to C#.
I encountered problem during conversion of unicode strings to the bytes. This can be displayed by snippets below:
import java.io.*;
public class Test {
public static void method(String x){
System.out.println(x);
byte[] bytes = x.getBytes();
for (byte z : bytes) {
System.out.println(z);
}
System.out.println("Array length: "+bytes.length);
}
public static void main(String args[]) {
method(""+(char)0xEE+(char)0x00+"testowy wydruk");
}
}
This will do 3 things:
print string
get bytes
print that array + its length
I rewrote this snippet to C#:
string x = "" + (char)0xEE + (char)0x00 + "testowy wydruk";
Console.WriteLine(x);
byte[] d = System.Text.Encoding.ASCII.GetBytes(x);
foreach(byte z in d)
{
Console.WriteLine(z);
}
Console.WriteLine("Array length: "+d.Count());
Don't know why in Java Array has 17 elems and in C# it has 16.
difference is in first elements of bytes arrays:
Unfortunately this differences can cause a problems later because this array is being sent to another api.

(char)0xEE is î, aka Unicode Character 'LATIN SMALL LETTER I WITH CIRCUMFLEX' (U+00EE), which is encoded to UTF-8 as 0xC3 0xAE, aka -61 -82.
Your Java code doesn't specify which encoding you wanted the bytes in, so Java apparently converted to UTF-8 for you (default varies by installation).
You explicitly specified ASCII in the C# code, so the EE character was converted to ?, aka 0x3F aka 63, since there is no such character in ASCII.
If you change Java code to use getBytes("ASCII") or getBytes(StandardCharsets.US_ASCII), then you get same result as C#.

Related

Why reading a character that has no ASCII representation with System.in doesn't give the character in two bytes?

import java.io.IOException;
public class Main {
public static void main(String[] args) throws IOException {
char ch = '诶';
System.out.println((int)ch);
int c;
while ((c = System.in.read()) != -1)
{
System.out.println(c);
}
}
}
Output:
35830
Here, The value that represents the char 诶 in unicode is 35830. In binary, It'll be 10001011 11110110.
When I enter that character in the terminal, I expected to get two bytes, 10001011 and 11110110. and when combining them again, I can be able to obtain the original char.
But what I actually get is:
232
175
182
10
I can see that 10 represent the newline character. But what does the first 3 numbers mean?

UTF-8 is a multi-byte variable-length encoding.
In order for something reading a stream of bytes to know that there are more bytes to be read in order to finish the current codepoint, there are some values that just cannot occur in a valid UTF-8 byte stream. Basically, certain patterns indicate "hang on, I'm not done".
There's a table which explains it here. For a codepoint in the range U+0800 to U+FFFF, it needs 16 bits to represent it; its byte representation consists of 3 bytes:
1st byte 2nd byte 3rd byte
1110xxxx 10xxxxxx 10xxxxxx
You are seeing 232 175 182 because those are the bytes of the UTF-8 encoding.
byte[] bytes = "诶".getBytes(StandardCharsets.UTF_8);
for (byte b : bytes) {
System.out.println((0xFF & b) + " " + Integer.toString(0xFF & b, 2));
}
Ideone demo
Output:
232 11101000
175 10101111
182 10110110
So the 3 bytes follow the pattern described above.

How can I save a String Byte without losing information?

I'm developing a JPEG decoder(I'm in the Huffman phase) and I want to write BinaryString's into a file.
For example, let's say we've this:
String huff = "00010010100010101000100100";
I've tried to convert it to an integer spliting it by 8 and saving it integer represantation, as I can't write bits:
huff.split("(?<=\\G.{8})"))
int val = Integer.parseInt(str, 2);
out.write(val); //writes to a FileOutputStream
The problem is that, in my example, if I try to save "00010010" it converts it to 18 (10010), and I need the 0's.
And finally, when I read :
int enter;
String code = "";
while((enter =in.read())!=-1) {
code+=Integer.toBinaryString(enter);
}
I got :
Code = 10010
instead of:
Code = 00010010
Also I've tried to convert it to bitset and then to Byte[] but I've the same problem.

Your example is that you have the string "10010" and you want the string "00010010". That is, you need to left-pad this string with zeroes. Note that since you're joining the results of many calls to Integer.toBinaryString in a loop, you need to left-pad these strings inside the loop, before concatenating them.
while((enter = in.read()) != -1) {
String binary = Integer.toBinaryString(enter);
// left-pad to length 8
binary = ("00000000" + binary).substring(binary.length());
code += binary;
}

You might want to look at the UTF-8 algorithm, since it does exactly what you want. It stores massive amounts of data while discarding zeros, keeping relevant data, and encoding it to take up less disk space.
Works with: Java version 7+
import java.nio.charset.StandardCharsets;
import java.util.Formatter;
public class UTF8EncodeDecode {
public static byte[] utf8encode(int codepoint) {
return new String(new int[]{codepoint}, 0, 1).getBytes(StandardCharsets.UTF_8);
}
public static int utf8decode(byte[] bytes) {
return new String(bytes, StandardCharsets.UTF_8).codePointAt(0);
}
public static void main(String[] args) {
System.out.printf("%-7s %-43s %7s\t%s\t%7s%n",
"Char", "Name", "Unicode", "UTF-8 encoded", "Decoded");
for (int codepoint : new int[]{0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E}) {
byte[] encoded = utf8encode(codepoint);
Formatter formatter = new Formatter();
for (byte b : encoded) {
formatter.format("%02X ", b);
}
String encodedHex = formatter.toString();
int decoded = utf8decode(encoded);
System.out.printf("%-7c %-43s U+%04X\t%-12s\tU+%04X%n",
codepoint, Character.getName(codepoint), codepoint, encodedHex, decoded);
}
}
}
https://rosettacode.org/wiki/UTF-8_encode_and_decode#Java
UTF-8 is a variable width character encoding capable of encoding all 1,112,064[nb 1] valid code points in Unicode using one to four 8-bit bytes.[nb 2] The encoding is defined by the Unicode Standard, and was originally designed by Ken Thompson and Rob Pike.[1][2] The name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.[3]
It was designed for backward compatibility with ASCII. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well. Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as "/" (slash) in filenames, "\" (backslash) in escape sequences, and "%" in printf.
https://en.wikipedia.org/wiki/UTF-8
Binary 11110000 10010000 10001101 10001000 becomes F0 90 8D 88 in UTF-8. Since you are storing it as text, you go from having to store 32 characters to storing 8. And because it's a well known and well designed encoding, you can reverse it easily. All the math is done for you.
Your example of 00010010100010101000100100 (or rather 00000001 0010100 0101010 00100100) converts to *$ (two unprintable characters on my machine). That's the UTF-8 encoding of the binary. I had mistakenly used a different site that was using the data I put in as decimal instead of binary.
https://onlineutf8tools.com/convert-binary-to-utf8
For a really good explanation of UTF-8 and how it can apply to the answer:
https://hackaday.com/2013/09/27/utf-8-the-most-elegant-hack/
Edit:
I took this question as a way to reduce the amount of characters needed to store values, which is a type of encoding. UTF-8 is a type of encoding. Used in a "non-standard" way, the OP can use UTF-8 to encode their strings of 0's & 1's in a much shorter format. That's how this answer is relevant.
If you concatenate the characters, you can go from 4x 8 bits (32 bits) to 8x 8 bits (64 bits) easily and encode a value as large as 9,223,372,036,854,775,807.

Converting string to binary and back again does not give the same string

I'm writing a Simplified DES algorithm to encrypt and subsequently decrypt a string. Suppose I have the initial character ( which has the binary value 00101000 which I get using the following algorithm:
public void getBinary() throws UnsupportedEncodingException {
byte[] plaintextBinary = text.getBytes("UTF-8");
for(byte b : plaintextBinary){
int val = b;
int[] tempBinRep = new int[8];
for(int i = 0; i<8; i++){
tempBinRep[i] = (val & 128) == 0 ? 0 : 1;
val <<= 1;
}
binaryRepresentations.add(tempBinRep);
}
}
After I perform the various permutations and shifts, ( and it's binary equivalent is transformed into 10001010 and it's ASCII equivalent Š. When I come around to decryption I pass the same character through the getBinary() method I now get the binary string 11000010 and another binary string 10001010 which translates into ASCII as x(.
Where is this rogue x coming from?
Edit: The full class can be found here.

You haven't supplied the decrypting code, so we can't know for sure, but I would guess you missed the encoding either when populating your String. Java Strings are encoded in UTF-16 by default. Since you're forcing UTF-8 when encrypting, I'm assuming you're doing the same when decrypting. The problem is, when you convert your encrypted bytes to a String for storage, if you let it default to UTF-16, you're probably ending up with a two-byte character because the 10001010 is 138, which is beyond the 127 range for ASCII charaters that get represented with a single byte.
So the "x" you're getting is the byte for the code page, followed by the actual character's byte. As suggested in the comments, you'd do better to just store the encrypted bytes as bytes, and not convert them to Strings until they're decrypted.

Single UTF-8 char to byte

If I am converting a UTF-8 char to byte, will there ever be a difference in the result of these 3 implementations based on locale, environment, etc.?
byte a = "1".getBytes()[0];
byte b = "1".getBytes(Charset.forName("UTF-8"))[0];
byte c = '1';

Your first line is dependent on the environment, because it will encode the string using the default character encoding of your system, which may or may not be UTF-8.
Your second line will always produce the same result, no matter what the locale or the default character encoding of your system is. It will always use UTF-8 to encode the string.
Note that UTF-8 is a variable-length character encoding. Only the first 127 characters are encoded in one byte; all other characters will take up between 2 and 6 bytes.
Your third line casts a char to an int. This will result in the int containing the UTF-16 character code of the character, since Java char stores characters using UTF-16. Since UTF-16 partially encodes characters in the same way as UTF-8, the result will be the same as the second line, but this is not true in general for any character.

In principle the question is already answered, but I cannot resist to post a little scribble, for those who like to play around with code:
import java.nio.charset.Charset;
public class EncodingTest {
private static void checkCharacterConversion(String c) {
byte asUtf8 = c.getBytes(Charset.forName("UTF-8"))[0];
byte asDefaultEncoding = c.getBytes()[0];
byte directConversion = (byte)c.charAt(0);
if (asUtf8 != asDefaultEncoding) {
System.out.println(String.format(
"First char of %s has different result in UTF-8 %d and default encoding %d",
c, asUtf8, asDefaultEncoding));
}
if (asUtf8 != directConversion) {
System.out.println(String.format(
"First char of %s has different result in UTF-8 %d and direct as byte %d",
c, asUtf8, directConversion));
}
}
public static void main(String[] argv) {
// btw: first time I ever wrote a for loop with a char - feels weird to me
for (char c = '\0'; c <= '\u007f'; c++) {
String cc = new String(new char[] {c});
checkCharacterConversion(cc);
}
}
}
If you run this e.g. with:
java -Dfile.encoding="UTF-16LE" EncodingTest
you will get no output.
But of course every single byte (ok, except for the first) will be wrong if you try:
java -Dfile.encoding="UTF-16BE" EncodingTest
because in "big endian" the first byte is always zero for ascii chars.
That is because in UTF-16 an ascii character '\u00xy is represented by two bytes, in UTF16-LE as [xy, 0] and in UTF16-BE as [0, xy]
However only the first statement produces any output, so b and c are indeed the same for the first 127 ascii characters - because in UTF-8 they are encoded by a single byte. This will not be true for any further characters, however; they all have multi-byte representations in UTF-8.

UTF-8 and charcters in Java and Eclipse IDE

public static void main(String[] args) throws UnsupportedEncodingException {
String str = "अ";
byte[] bytes = str.getBytes("UTF-8");
for (byte b : bytes) {
System.out.print(b + "\t");
}
String hindi = new String(bytes, "UTF-8");
System.out.println("\nHindi = " + hindi);
System.out.println((int) 'अ');
}
OUTPUT:
-32 -92 -123
Hindi = अ
2309
I need explanation on those three outputs. Especially the last one.
Also, I copy paste this characterअ from a web page. How do I type it manually in Eclipse IDE? For example, ALT + 65 will give 'A' but ALT + 2309 does not give me 'अ' (I copy paste this again).

The first print:
See public byte[] getBytes(Charset charset):
Encodes this String into a sequence of bytes using the given charset,
storing the result into a new byte array.
The second print:
See public String(byte[] bytes,
Charset charset):
Constructs a new String by decoding the specified array of bytes using
the specified charset.
The third print:
See this link:
You're printing the decimal code of it, which is 2309.
The links provided above should help you to understand the output you're getting in each case.

Ideally you need to type letter's associated unicode number "\uNNNN" in Java in any IDE.
Further for single character, straight approach would be:
char c = '\uNNNN';
System.out.println((int) c);
Update: List of Unicode ranges for Devanagri http://www.unicode.org/charts/PDF/U0900.pdf
To type in Hindi in notepad/IDE etc, search for software which which maps keyboard keys to specific Hindi letter and related grammaticle punctuations.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Converting unicode chars like 0x00 to bytes - java

Related

Why reading a character that has no ASCII representation with System.in doesn't give the character in two bytes?

How can I save a String Byte without losing information?

Converting string to binary and back again does not give the same string

Single UTF-8 char to byte

UTF-8 and charcters in Java and Eclipse IDE

Categories

Resources