What's the difference between writeUTF and writeChars? - java

What's the difference between writeUTF and writeChars? (methods of ObjectOutputStream)
Further I have not found the corresponding readChars in ObjectInputStream.

writeUTF writes text in UTF-8 format encoding preceeded with text length, so readUTF knows how many characters to read from stream.
writeChars writes text as a sequence of 2-bytes chars with no length. To read it, we should use readChar method and we need to know how many chars were written.

writeChars() uses Unicode values
Writes every character in the string s, to the output stream, in
order, two bytes per character. If s is null, a NullPointerException
is thrown. If s.length is zero, then no characters are written.
Otherwise, the character s[0] is written first, then s1, and so on;
the last character written is s[s.length-1]. For each character, two
bytes are actually written, high-order byte first, in exactly the
manner of the writeChar method.
writeUTF() uses a slightly-modified version of UTF-8
Writes two bytes of length information to the output stream, followed
by the modified UTF-8 representation of every character in the string
s. If s is null, a NullPointerException is thrown. Each character in
the string s is converted to a group of one, two, or three bytes,
depending on the value of the character.

Related

String to Byte Conversion and Back Again Not Returning Same Result (ASCII)

I'm having a few issues converting a string back to the appropriate value after it has been converted to bytes.
The initial string:
"0000000000Y Yã"
Where the 'ã' is just a character value.
The conversion code:
byte[] b = s.getBytes(StandardCharsets.US_ASCII);
However when using to convert it back:
String str = new String(b, StandardCharsets.US_ASCII);
I recieve:
"0000000000Y Y?"
Anyone know why this is?
Thanks.
ã is not an ASCII character, so how it is handled is given by the implementation
https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#getBytes-java.nio.charset.Charset-
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array.
For this charset it comes out as '?'
ã is not part of the US_ASCII character set.
The getBytes() method is documented with:
This method always replaces malformed-input and unmappable-character
sequences with this charset's default replacement byte array.
(see the documentation)
For US_ASCII, the default replacement byte array seems to be one byte representing the '?' character (ASCII code 0x3F). So this is what gets inserted into the byte array in place of your ã character.
When converting back to String, you get the character corresponding to the replacement byte, being the '?' character.
So, if you convert to bytes, and you want to get back the identical characters, be sure to use a character set that contains every character you intend to use. A safe decision will be UTF-8.
If you need to obey some character encoding (e.g. because some external interface needs that), then Java's replacement strategy makes sense, but of course some characters will get lost.
This is because ã is not an ASCII character. Check an
ASCII table for valid ASCII characters.

Reading special UTF-8 characters from ByteBuffer

How can I read special characters from a ByteBuffer? ByteBuffer.wrap("\u0001".getBytes()).asCharBuffer().get() throws a BufferUnderflow exception. As far as I understand ByteBuffer.getChar() always advances by two bytes, and there's only one byte in this string.

Difference between String.length() and String.getBytes().length

I am beginner and self-learning in Java programming.
So, I want to know about difference between String.length() and String.getBytes().length in Java.
What is more suitable to check the length of the string?
String.length()
String.length() is the number of 16-bit UTF-16 code units needed to represent the string. That is, it is the number of char values that are used to represent the string and thus also equal to toCharArray().length. For most characters used in western languages this is typically the same as the number of unicode characters (code points) in the string, but the number of code point will be less than the number of code units if any UTF-16 surrogate pairs are used. Such pairs are needed only to encode characters outside the BMP and are rarely used in most writing (emoji are a common exception).
String.getBytes().length
String.getBytes().length on the other hand is the number of bytes needed to represent your string in the platform's default encoding. For example, if the default encoding was UTF-16 (rare), it would be exactly 2x the value returned by String.length() (since each 16-bit code unit takes 2 bytes to represent). More commonly, your platform encoding will be a multi-byte encoding like UTF-8.
This means the relationship between those two lengths are more complex. For ASCII strings, the two calls will almost always produce the same result (outside of unusual default encodings that don't encode the ASCII subset in 1 byte). Outside of ASCII strings, String.getBytes().length is likely to be longer, as it counts bytes needed to represent the string, while length() counts 2-byte code units.
Which is more suitable?
Usually you'll use String.length() in concert with other string methods that take offsets into the string. E.g., to get the last character, you'd use str.charAt(str.length()-1). You'd only use the getBytes().length if for some reason you were dealing with the array-of-bytes encoding returned by getBytes.
The length() method returns the length of the string in characters.
Characters may take more than a single byte. The expression String.getBytes().length returns the length of the string in bytes, using the platform's default character set.
The String.length() method returns the quantity of symbols in string. While String.getBytes().length() returns number of bytes used to store those symbols.
Usually chars are stored in UTF-16 encoding. So it takes 2 bytes to store one char.
Check this SO answer out.
I hope that it will help :)
In short, String.length() returns the number of characters in the string while String.getBytes().length returns the number of bytes to represent the characters in the string with specified encoding.
In many cases, String.length() will have the same value as String.getBytes().length. But in cases like encoding UTF-8 and the character has value over 127, String.length() will not be the same as String.getBytes().length.
Here is an example which explains how characters in string is converted to bytes when calling String.getBytes(). This should give you a sense of the difference between String.length() and String.getBytes().length.

Storing characters as single bytes in java

I read that we should use Reader/Writer for reading/writing character data and InputStream/OutputStream for reading/writing binary data. Also, in java characters are 2 bytes. I am wondering how the following program works. It reads characters from standard input stores them in a single byte and prints them out. How are two byte characters fitting into one byte here?
http://www.cafeaulait.org/course/week10/06.html
The comment explains it pretty clearly:
// Notice that although a byte is read, an int
// with value between 0 and 255 is returned.
// Then this is converted to an ISO Latin-1 char
// in the same range before being printed.
So basically, this assumes that the incoming byte represents a character in ISO-8859-1.
If you use a console with a different encoding, or perhaps provide a character which isn't in ISO-8859-1, you'll end up with problems.
Basically, this is not good code.
Java stores characters as 2 bytes, but for normal ASCII characters the actual data fits in one byte. So as long as you can assume the file being read there is ASCII then that will work fine, as the actual numeric value of the character fits in a single byte.

Java - stream of bytes vs. stream of characters?

Title is pretty self-explanatory. In a lot of the JRE javadocs I see the phrases "stream of bytes" and "stream of characters" all over the place.
But aren't they the same thing? Or are they slightly different (e.g. interpreted differently) in Java-land? Thanks in advance.
In Java, a byte is not the same thing as a char. Therefore a byte stream is different from a character stream. Bytes are intended for arbitrary binary data; characters are specifically for data representing the building blocks of strings.
but if a char is only 1 byte in width
Except that it's not.
As per the JLS §4.2.1 a char is a number in the range:
from '\u0000' to '\uffff' inclusive, that is, from 0 to 65535
But a byte is a number in the range
from -128 to 127, inclusive
Stream of byte is just plain byte, like how you would see it when you open a file in HEX Editor.
Character is different from just plain byte. ASCII encoding uses exactly 1 byte per character, but that is not true for many other encoding. For example, UTF-8 encoding may use from 1 to 4 bytes to encode a single character. Stream of character is designed to abstract away the underlying encoding, and produce char of one type of encoding (in Java, char and String uses UTF-16 encoding).
As a rule of thumb:
When you are dealing with text, you must use stream of character to decode the byte into character with the appropriate encoding.
When you are dealing with binary data or mixed of binary and text, you must use stream of byte, since it doesn't make sense otherwise. If a sequence of byte represents a String in certain encoding, then you can always pick those bytes out and use String(byte[] bytes, Charset charset) constructor to get back the String.
They are different. char is a 2-byte datatype in Java: byte is a 1-byte datatype.
Edit: char is also an unsigned type, while byte is not.
Generally it is better off to talk about streams in terms of their sizes, rather than what they carry. Stream of bytes is more intuitive than streams of chars, because streams of chars make us have to double check that a char is indeed a byte, not a unicode char, or anything fancy.
A char is a representation, which can be represented by a byte, but a byte is always going to be a byte. All world will burn when bytes will stop being 8 bits.

Categories