I am beginner and self-learning in Java programming.
So, I want to know about difference between String.length() and String.getBytes().length in Java.
What is more suitable to check the length of the string?
String.length()
String.length() is the number of 16-bit UTF-16 code units needed to represent the string. That is, it is the number of char values that are used to represent the string and thus also equal to toCharArray().length. For most characters used in western languages this is typically the same as the number of unicode characters (code points) in the string, but the number of code point will be less than the number of code units if any UTF-16 surrogate pairs are used. Such pairs are needed only to encode characters outside the BMP and are rarely used in most writing (emoji are a common exception).
String.getBytes().length
String.getBytes().length on the other hand is the number of bytes needed to represent your string in the platform's default encoding. For example, if the default encoding was UTF-16 (rare), it would be exactly 2x the value returned by String.length() (since each 16-bit code unit takes 2 bytes to represent). More commonly, your platform encoding will be a multi-byte encoding like UTF-8.
This means the relationship between those two lengths are more complex. For ASCII strings, the two calls will almost always produce the same result (outside of unusual default encodings that don't encode the ASCII subset in 1 byte). Outside of ASCII strings, String.getBytes().length is likely to be longer, as it counts bytes needed to represent the string, while length() counts 2-byte code units.
Which is more suitable?
Usually you'll use String.length() in concert with other string methods that take offsets into the string. E.g., to get the last character, you'd use str.charAt(str.length()-1). You'd only use the getBytes().length if for some reason you were dealing with the array-of-bytes encoding returned by getBytes.
The length() method returns the length of the string in characters.
Characters may take more than a single byte. The expression String.getBytes().length returns the length of the string in bytes, using the platform's default character set.
The String.length() method returns the quantity of symbols in string. While String.getBytes().length() returns number of bytes used to store those symbols.
Usually chars are stored in UTF-16 encoding. So it takes 2 bytes to store one char.
Check this SO answer out.
I hope that it will help :)
In short, String.length() returns the number of characters in the string while String.getBytes().length returns the number of bytes to represent the characters in the string with specified encoding.
In many cases, String.length() will have the same value as String.getBytes().length. But in cases like encoding UTF-8 and the character has value over 127, String.length() will not be the same as String.getBytes().length.
Here is an example which explains how characters in string is converted to bytes when calling String.getBytes(). This should give you a sense of the difference between String.length() and String.getBytes().length.
Related
Recently I found some negative bytes hidden in a Java string in my code which was causing a .equals String comparison to fail.
What is the significance of negative byte values in Strings? Can they mean anything? It there any situation in which a negative byte value in a String could be interpreted as anything? I'm noob at this encoding business so if it requires explanation into different encoding schemes please feel free.
A Java string contains characters, BUT you can interpret them in different ways. If each character is a byte, then it can range from 0-255, inclusive. That's 8 bits.
Now, the leftmost bit can be interpreted as a sign bit or as part of the magnitude of the character. If that bit is interpreted as a sign bit then you will have data items ranging from -128 to +127, inclusive.
You didn't post the code you used to print the characters but if you used logic that interpreted the characters as signed data items then you will get negative numbers in out output.
I defined a unicode character as a byte array:
private static final byte[] UNICODE_MEXT_LINE = Charsets.UTF_8.encode("\u0085").array();
At the moment byte array length is 3, is it safe to assume the length of the array is always 3 across platforms?
Thank you
It's safe to assume that that particular character will always be three bytes long, regardless of platform.
But unicode characters in UTF-8 can be one byte, two bytes, three bytes or even four bytes long, so no, you can't assume that if you convert any character to UTF-8 then it'll come out as three bytes.
That particular character will always be 3 bytes in length, but others will be different. Unicode characters are anywhere from 1-4 bytes long. The 8 in 'UTF-8' just means that it uses 8-bit code units.
The Wikipedia page on UTF-8 provides a pretty good overview of how that works. Basically, the first bits of the first byte tell you how many bytes long that character will be. For instance, if the first bit of the first byte is a 0 as in 01111111, then that means this character is only one byte long (in utf-8, these are the ascii characters). If the first bits are 110 as in 11011111, then that tells you that this character will be two bytes long. The chart in the Wikipedia page provides a good illustration of this.
There's also this question, which has some good answers as well.
As java doc states it:
char: The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).
But when I have a String (just containing ASCII-signs) and convert it to a byte array, every sign of the String is stored in one byte, which is less than the 16 bit as java docs states it. How does it work? I could imagine that the java compiler/interpreter uses just one byte per char for an ASCII sign for performance issues.
Furthermore, what happens if I've got a String with just ASCII signs and one UTF-16 sign and convert it to a byte array. Every sign of the String uses 2 bytes now?
Converting characters to bytes and vice versa is done using a character encoding.
The character encoding determines how characters are represented by bytes. For example, ASCII is a character encoding which uses 7 bits per character. Obviously, it can only represent 128 characters, way less than the 65,536 characters that exist in Java.
Other character encodings are UTF-8 and UTF-16. In fact, a Java char is really an UTF-16 character - if you directly cast it to an int, you would get the UTF-16 code for the character.
Here's a longer tutorial to character encodings: What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
If you call getBytes() on a String, it will use the default character encoding of the system to convert the characters in the string to bytes. It's better to use the version of getBytes() that takes a character set name as an argument, so that you know what character set is used. For example:
byte[] bytes = str.getBytes("UTF-8");
The internal format of a String uses 16 bits per character. When you convert it to a byte array, you use a certain character encoding which is either specified explicitly or the default platform encoding. The encoding may use fewer bits per character.
For example the ASCII encoding will store each character in a byte but it can only represent 128 different characters.
Another often used encoding is UTF-8 which uses a variable number of bytes per character. The first 128 characters (corresponding to the characters available in ASCII) can be stored in one byte each. Characters with order numbers 128 or higher need two or more bytes.
getBytes()
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
Your platform's default charset is probably UTF8. Hence, getBytes() will use one byte per character for characters which fit comfortably into that size.
String.getBytes() "encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array". The platform's default charset (Charset.defaultCharset()) is probably UTF-8.
As for the second question, strings aren't actually required to use UTF-16. The way a JVM stores strings internally is irrelevant. The few occurrences of UTF-16 in the JVM spec apply only to chars.
I read that we should use Reader/Writer for reading/writing character data and InputStream/OutputStream for reading/writing binary data. Also, in java characters are 2 bytes. I am wondering how the following program works. It reads characters from standard input stores them in a single byte and prints them out. How are two byte characters fitting into one byte here?
http://www.cafeaulait.org/course/week10/06.html
The comment explains it pretty clearly:
// Notice that although a byte is read, an int
// with value between 0 and 255 is returned.
// Then this is converted to an ISO Latin-1 char
// in the same range before being printed.
So basically, this assumes that the incoming byte represents a character in ISO-8859-1.
If you use a console with a different encoding, or perhaps provide a character which isn't in ISO-8859-1, you'll end up with problems.
Basically, this is not good code.
Java stores characters as 2 bytes, but for normal ASCII characters the actual data fits in one byte. So as long as you can assume the file being read there is ASCII then that will work fine, as the actual numeric value of the character fits in a single byte.
Title is pretty self-explanatory. In a lot of the JRE javadocs I see the phrases "stream of bytes" and "stream of characters" all over the place.
But aren't they the same thing? Or are they slightly different (e.g. interpreted differently) in Java-land? Thanks in advance.
In Java, a byte is not the same thing as a char. Therefore a byte stream is different from a character stream. Bytes are intended for arbitrary binary data; characters are specifically for data representing the building blocks of strings.
but if a char is only 1 byte in width
Except that it's not.
As per the JLS ยง4.2.1 a char is a number in the range:
from '\u0000' to '\uffff' inclusive, that is, from 0 to 65535
But a byte is a number in the range
from -128 to 127, inclusive
Stream of byte is just plain byte, like how you would see it when you open a file in HEX Editor.
Character is different from just plain byte. ASCII encoding uses exactly 1 byte per character, but that is not true for many other encoding. For example, UTF-8 encoding may use from 1 to 4 bytes to encode a single character. Stream of character is designed to abstract away the underlying encoding, and produce char of one type of encoding (in Java, char and String uses UTF-16 encoding).
As a rule of thumb:
When you are dealing with text, you must use stream of character to decode the byte into character with the appropriate encoding.
When you are dealing with binary data or mixed of binary and text, you must use stream of byte, since it doesn't make sense otherwise. If a sequence of byte represents a String in certain encoding, then you can always pick those bytes out and use String(byte[] bytes, Charset charset) constructor to get back the String.
They are different. char is a 2-byte datatype in Java: byte is a 1-byte datatype.
Edit: char is also an unsigned type, while byte is not.
Generally it is better off to talk about streams in terms of their sizes, rather than what they carry. Stream of bytes is more intuitive than streams of chars, because streams of chars make us have to double check that a char is indeed a byte, not a unicode char, or anything fancy.
A char is a representation, which can be represented by a byte, but a byte is always going to be a byte. All world will burn when bytes will stop being 8 bits.