Understanding Java character encodings [duplicate]

Understanding Java character encodings [duplicate] - java

What´s the difference between
"hello world".getBytes("UTF-8");
and
Charset.forName("UTF-8").encode("hello world").array();
?
The second code produces a byte array with 0-bytes at the end in most cases.

Your second snippet uses ByteBuffer.array(), which just returns the array backing the ByteBuffer. That may well be longer than the content written to the ByteBuffer.
Basically, I would use the first approach if you want a byte[] from a String :) You could use other ways of dealing with the ByteBuffer to convert it to a byte[], but given that String.getBytes(Charset) is available and convenient, I'd just use that...
Sample code to retrieve the bytes from a ByteBuffer:
ByteBuffer buffer = Charset.forName("UTF-8").encode("hello world");
byte[] array = new byte[buffer.limit()];
buffer.get(array);
System.out.println(array.length); // 11
System.out.println(array[0]); // 104 (encoded 'h')

Related

Is there a quick way to convert a byte array to a string?

In Java, it takes too much time to convert a very large byte array to a string. Is there another good way ??
byte [] buf = new byte[100000];
new String(buf).trim();

You could use:
new String(bytes, charset);
Edit: if you want to write this into a file, use Files.write bytes, else, then you probably should try to use a for loop, and use multiple threads. I am not too experienced with java, but I don't think their is a faster way than iterating through the array because you need all the elements.

Data loss when writing bytes to a file

I'm working on a string compressor for a school assignment,
There's one bug that I can't seem to work out. The compressed data is being written a file using a FileWriter, represented by a byte array. The compression algorithm returns an input stream so the data flows as such:
piped input stream
-> input stream reader
-> data stored in char buffer
-> data written to file with file writer.
Now, the bug is, that with some very specific strings, the second to last byte in the byte array is written wrong. and it's always the same bit values "11111100".
Every time it's this bit values and always the second to last byte.
Here are some samples from the code:
InputStream compress(InputStream){
//...
//...
PipedInputStream pin = new PipedInputStream();
PipedOutputStream pout = new PipedOutputStream(pin);
ObjectOutputStream oos = new ObjectOutputStream(pout);
oos.writeObject(someobject);
oos.flush();
DataOutputStream dos = new DataOutputStream(pout);
dos.writeFloat(//);
dos.writeShort(//);
dos.write(SomeBytes); // ---Here
dos.flush();
dos.close();
return pin;
}
void write(char[] cbuf, int off, int len){
//....
//....
InputStreamReader s = new InputStreamReader(
c.compress(new ByteArrayInputStream(str.getBytes())));
s.read(charbuffer);
out.write(charbuffer);
}
A string which triggers it is "hello and good evenin" for example.
I have tried to iterate over the byte array and write them one by one, it didn't help.
It's also worth noting that when I tried to write to a file using the output stream in the algorithm itself it worked fine. This design was not my choice btw.
So I'm not really sure what i'm doing wrong here.

Considering that you're saying:
Now, the bug is, that with some very specific strings, the second to
last byte in the byte array is written wrong. and it's always the same
bit values "11111100".
You are taking a
binary stream (the compressed data)
-> reading it as chars
-> then writing it as chars.
And your are converting bytes to chars without clearly defining the encoding.
I'd say that the problem is that your InputStreamReader is translating some byte sequences in a way that you're not expecting.
Remember that in encodings like utf-8 two or three bytes may become one single char.
It can't be coincidence that the very byte pattern you pointed out (11111100) Is one of the utf-8 escape codes (1111110x). Check this wikipedia table at and you'll see that uft-8 is destructive since if a byte starts with: 1111110x the next must start with 10xxxxxx.
Meaning that if using utf-8 to convert
bytes1[] -> chars[] -> bytes2[]
in some cases bytes2 will be different from bytes1.
I recommend changing your code to remove those readers. Or specify ASCII encoding to see if that prevent the translations.

I solved this by encoding and decoding the bytes with Base64.

How to convert "java.nio.HeapByteBuffer" to String

I have a data structure java.nio.HeapByteBuffer[pos=71098 lim=71102 cap=94870], which I need to convert into Int (in Scala), the conversion might look simple but whatever which I approach , i did not get right conversion. could you please help me?
Here is my code snippet:
val v : ByteBuffer= map.get("company").get
val utf_str = new String(v, java.nio.charset.StandardCharsets.UTF_8)
println (utf_str)
the output is just "R" ??

I can't see how you can even get that to compile, String has constructors that accepts another string or possibly an array, but not a ByteBuffer or any of its parents.
To work with the nio buffer api you first write to a buffer, then do a flip before you read from the buffer, there are lots of good resources online about that. This one for example: http://tutorials.jenkov.com/java-nio/buffers.html
How to read that as a string entirely depends on how the characters are encoded inside the buffer, if they are two bytes per character (as strings are in Java/the JVM) you can convert your buffer to a character buffer by using asCharBuffer.
So, for example:
val byteBuffer = ByteBuffer.allocate(7).order(ByteOrder.BIG_ENDIAN);
byteBuffer.putChar('H').putChar('i').putChar('!')
byteBuffer.flip()
val charBuffer = byteBuffer.asCharBuffer
assert(charBuffer.toString == "Hi!")

How to convert byte array in String format to byte array?

I have created a byte array of a file.
FileInputStream fileInputStream=null;
File file = new File("/home/user/Desktop/myfile.pdf");
byte[] bFile = new byte[(int) file.length()];
try {
fileInputStream = new FileInputStream(file);
fileInputStream.read(bFile);
fileInputStream.close();
}catch(Exception e){
e.printStackTrace();
}
Now,I have one API, which is expecting a json input, there I have to put the above byte array in String format. And after reading the byte array in string format, I need to convert it back to byte array again.
So, help me to find;
1) How to convert byte array to String and then back to the same byte array?

The general problem of byte[] <-> String conversion is easily solved once you know the actual character set (encoding) that has been used to "serialize" a given text to a byte stream, or which is needed by the peer component to accept a given byte stream as text input - see the perfectly valid answers already given on this. I've seen a lot of problems due to lack of understanding character sets (and text encoding in general) in enterprise java projects even with experienced software developers, so I really suggest diving into this quite interesting topic. It is generally key to keep the character encoding information as some sort of "meta" information with your binary data if it represents text in some way. Hence the header in, for example, XML files, or even suffixes as parts of file names as it is sometimes seen with Apache htdocs contents etc., not to mention filesystem-specific ways to add any kind of metadata to files. Also, when communicating via, say, http, the Content-Type header fields often contain additional charset information to allow for correct interpretation of the actual Contents.
However, since in your example you read a PDF file, I'm not sure if you can actually expect pure text data anyway, regardless of any character encoding.
So in this case - depending on the rest of the application you're working on - you may want to transfer binary data within a JSON string. A common way to do so is to convert the binary data to Base64 and, once transferred, recover the binary data from the received Base64 string.
How do I convert a byte array to Base64 in Java?
is a good starting point for such a task.

String class provides an overloaded constructor for this.
String s = new String(byteArray, "UTF-8");
byteArray = s.getBytes("UTF-8");
Providing an explicit encoding charset is encouraged because different encoding schemes may have different byte representations. Read more here and here.
Also, your inputstream maynot read all the contents in one go. You have to read in a loop until there is nothing more left to be read. Read the documentation. read() returns the number of bytes read.
Reads up to b.length bytes of data from this input stream into an
array of bytes. This method blocks until some input is available

String.getBytes() and String(byte[] bytes) are methods to consider.

Convert byte array to String
String s = new String(bFile , "ISO-8859-1" );
Convert String to byte array
byte bArray[] =s.getBytes("ISO-8859-1");

How do I avoid the overhead associated with String.getBytes(Charset ch)

Using String.getBytes(Charset ch), allocates a new buffer, in fact it returns a byte[]. Is there a way to avoid this? I'd like to have a reusable byte array and have the strings encoded in this buffer.

You can use the Charset and CharsetEncoder APIs directly, in particular calling encode(CharBuffer, ByteBuffer, boolean). However, I wouldn't expect it to end up being particularly pleasant code.

If you're like me an don't master ByteBuffer, to complement Jon's answer, you could also create your own OutputStream implementation wrapping your byte array, and use an OutputStreamWriter to write the String to this custom OutputStream.

You can use
getChars(int srcBegin, int srcEnd, char[] dst, int dstBegin)
//Copies characters from this string into the destination character array.
and manage the array by yourself.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.