String.getBytes() returns array of Unicode chars - java

I was reading getbytes and from documentation it states that it will return
the resultant byte array.
But when i ran the following program, i found that it is returning array of Unicode symbols.
public class GetBytesExample {
public static void main(String args[]) {
String str = new String("A");
byte[] array1 = str.getBytes();
System.out.print("Default Charset encoding:");
for (byte b : array1) {
System.out.print(b);
}
}
}
The above program prints output
Default Charset encoding:65
This 65 is equivalent to Unicode representation of A. My question is that where are the bytes whose return type is expected.

There is no PrintStream.print(byte) overload, so the byte needs to be widened to invoke the method.
Per JLS 5.1.2:
19 specific conversions on primitive types are called the widening primitive conversions:
byte to short, int, long, float, or double
...
There's no PrintStream.print(short) overload either.
The next most-specific one is PrintStream.print(int). So that's the one that's invoked, hence you are seeing the numeric value of the byte.

String.getBytes() returns the encoding of the string using the platform encoding. The result depends on which machine you run this. If the platform encoding is UTF-8, or ASCII, or ISO-8859-1, or a few others, an 'A' will be encoded as 65 (aka 0x41).

This 65 is equivalent to Unicode representation of A
It is also equivalent to a UTF-8 representation of A
It is also equivalent to a ASCII representation of A
It is also equivalent to a ISO/IEC 8859-1 representation of A
It so happens that the encoding for A is similar in a lot character encodings, and that these are all similar to the Unicode code-point. And this is not a coincidence. It is a result of the history of character set / character encoding standards.
My question is that where are the bytes whose return type is expected.
In the byte array, of course :-)
You are (just) misinterpreting them.
When you do this:
for (byte b : array1) {
System.out.print(b);
}
you output a series of bytes as decimal numbers with no spaces between them. This is consistent with the way that Java distinguishes between text / character data and binary data. Bytes are binary. The getBytes() method gives a binary encoding (in some character set) of the text in the string. You are then formatting and printing the binary (one byte at a time) as decimal numbers.
If you want more evidence of this, replace the "A" literal with a literal containing (say) some Chinese characters. Or any Unicode characters greater than \u00ff ... expressed using \u syntax.

Related

Most efficient typecasting or converting long or int to 4 char string

My goal is to conserve space in my data store, which only accepts Strings.
Because a String in Java is a 16-bit array, I figure that in theory I should be able to convert my 8-byte long into a 4-char String, as both are represented by 8 bytes. (To be clear, I am not interested in making my long integer human-readable in base 10, I want to store it in as short of a String as possible.)
However, almost all the literature I have found on this is about converting to the 8-bit byte type, not the type char.
I could encode as UTF8. I am concerned this would mean I double the length of String, as each 8-bit byte is stored as a 16-bit char. This would defeat my whole purpose for compacting my data into a 64-bit medium in the first place.
private static final Charset UTF8_CHARSET = Charset.forName("UTF-8");
new String(ByteBuffer.allocate(8).putLong(value).array(), UTF8_CHARSET);
Is my concern correct that I would be wasting space, and if so, is there a way to not waste space?
char != int
Q: Are there any byte sequences that are not generated by a UTF? How
should I interpret them?
A: None of the UTFs can generate every arbitrary byte sequence. For
example, in UTF-8 every byte of the form 110xxxxx2 must be followed
with a byte of the form 10xxxxxx2. A sequence such as <110xxxxx2
0xxxxxxx2> is illegal, and must never be generated. When faced with
this illegal byte sequence while transforming or interpreting, a UTF-8
conformant process must treat the first byte 110xxxxx2 as an illegal
termination error: for example, either signaling an error, filtering
the byte out, or representing the byte with a marker such as FFFD
(REPLACEMENT CHARACTER). In the latter two cases, it will continue
processing at the second byte 0xxxxxxx2.
A conformant process must not interpret illegal or ill-formed byte
sequences as characters, however, it may take error recovery actions.
No conformant process may use irregular byte sequences to encode
out-of-band information.
String != byte[] && char != int
Internally String objects are Unicode and encoded as UTF-16 no matter what their source is.
How is text represented in the Java platform?
The Java programming language is based on the Unicode character set,
and several libraries implement the Unicode standard. The primitive
data type char in the Java programming language is an unsigned 16-bit
integer that can represent a Unicode code point in the range U+0000 to
U+FFFF, or the code units of UTF-16. The various types and classes in
the Java platform that represent character sequences - char[],
implementations of java.lang.CharSequence (such as the String class),
and implementations of java.text.CharacterIterator - are UTF-16
sequences.
String is internally represented by UTF-16
The character encodings like UTF-8 are only for interpreting or converting to/from a byte[].
Even if you write a custom CharsetProvider all that will do is encode/decode a byte[] externally, this will absolutely not change the fact that a String is internally represented by UTF-16, so what you want to do is kind of pointless.
Can't be done
Character is actually a 32 bit number, the Charset is just an encoding of that 32 bit number. UTF-8 can be 1, 2, 3 or 4 bytes for example, and UTF-16 is 2,4 bytes with a bit specifying if the next byte(s) is part of the same character or not.

Is a Java char array always a valid UTF-16 (Big Endian) encoding?

Say that I would encode a Java character array (char[]) instance as bytes:
using two bytes for each character
using big endian encoding (storing the most significant 8 bits in the leftmost and the least significant 8 bits in the rightmost byte)
Would this always create a valid UTF-16BE encoding? If not, which code points will result in an invalid encoding?
This question is very much related to this question about the Java char type and this question about the internal representation of Java strings.
No. You can create char instances that contain any 16-bit value you desire---there is nothing that constrains them to be valid UTF-16 code units, nor constrains an array of them to be a valid UTF-16 sequence. Even String does not require that its data be valid UTF-16:
char data[] = {'\uD800', 'b', 'c'}; // Unpaired lead surrogate
String str = new String(data);
The requirements for valid UTF-16 data are set out in Chapter 3 of the Unicode Standard (basically, everything must be a Unicode scalar value, and all surrogates must be correctly paired). You can test if a char array is a valid UTF-16 sequence, and turn it into a sequence of UTF-16BE (or LE) bytes, by using a CharsetEncoder:
CharsetEncoder encoder = Charset.forName("UTF-16BE").newEncoder();
ByteBuffer bytes = encoder.encode(CharBuffer.wrap(data)); // throws MalformedInputException
(And similarly using a CharsetDecoder if you have bytes.)

strange behaviour of java getBytes vs getBytes(charset)

consider the following:
public static void main(String... strings) throws Exception {
byte[] b = { -30, -128, -94 };
//section utf-32
String string1 = new String(b,"UTF-32");
System.out.println(string1); //prints ?
printBytes(string1.getBytes("UTF-32")); //prints 0 0 -1 -3
printBytes(string1.getBytes()); //prints 63
//section utf-8
String string2 = new String(b,"UTF-8");
System.out.println(string2); // prints •
printBytes(string2.getBytes("UTF-8")); //prints -30 -128 -94
printBytes(string2.getBytes()); //prints -107
}
public static void printBytes(byte[] bytes){
for(byte b : bytes){
System.out.print(b + " " );
}
System.out.println();
}
output:
?
0 0 -1 -3
63
•
-30 -128 -94
-107
so I have two questions:
in both sections : why the output getBytes() and getBytes(charSet) are different even though I have specifically mentioned the string's charset
why both of the byte outputs of getByte in section utf-32 are different from the actual byte[] b? (i.e. how can I convert back a string to its original byte array?)
Question 1:
in both sections : why the output getBytes() and getBytes(charSet) are different even though I have specifically mentioned the string's charset
The character set you've specified is used during character encoding of the string to the byte array (i.e. in the method itself only). It's not part of the String instance itself. You are not setting the character set for the string, the character set is not stored.
Java does not have an internal byte encoding of the character set; it uses arrays of char internally. If you call String.getBytes() without specifying a character set, it will use the platform default - e.g. Windows-1252 on Windows machines.
Question 2:
why both of the byte outputs of getByte in section utf-32 are different from the actual byte[] b? (i.e. how can I convert back a string to its original byte array?)
You cannot always do this. Not all bytes represent a valid encoding of characters. So if such an encoded array is decoded then these kind of encodings are silently ignored, i.e. the bytes are simply skipped.
This already happens during String string1 = new String(b,"UTF-32"); and String string2 = new String(b,"UTF-8");.
You can change this behavior using an instance of CharsetDecoder, retrieved using Charset.newDecoder.
If you want to encode a random byte array into a String instance then you should use a hexadecimal or base 64 encoder. You should not use a character decoder for that.
Java String / char (16 bits UTF-16!) / Reader / Writer are for Unicode text. So all scripts may be combined in a text.
Java byte (8 bits) / InputStream / OutputStream are for binary data. If that data represents text, one needs to know its encoding to make text out of it.
So a conversion from bytes to text always needs a Charset. Often there exists an overloaded method without the charset, and then it defaults to the System.getProperty("file.encoding") which can differ on every platform.
Using a default is absolutely non-portable, if the data is cross-platform.
So you had the misconception that the encoding belonged to the String. This is understandable, seeing that in C/C++ unsigned char and byte were largely interchangeable, and encodings a nightmare.

Finding out utf-8 value of small Tethe char

There is something confusing about this,
I'm trying to get the utf-8 int value of the small Tetha character, which should be 225182191:
http://en.wikipedia.org/wiki/Theta#Character_Encodings
But:
public static void main(String... args){
char c='Ɵ';
System.out.println((byte)c);
}
Prints: -97 (????)
I did change my text encoding scheme on eclipse from MacRoman to UTF-8
The encoding of the text source file has nothing to do with how things are at runtime.
A Java char is a 16-bit wide value. It is always implicitly UTF-16.
When the compiler generates a .class file char literals are transcoded to UTF-16 and stored in an int structure within the class' constant pool. Strings are converted to a modified UTF-8 for compactness reasons.
When either is loaded by the JVM they are represented as UTF-16 values/sequences in memory.
Transcoding the value from UTF-16 to UTF-8:
char c = '\u03B8'; // greek small letter theta θ
for (byte b : String.valueOf(c).getBytes(StandardCharsets.UTF_8)) {
int unsigned = b & 0xFF;
System.out.append(" ").print(unsigned);
}
FYI: The three-byte decimal sequence 225 182 191 is "modifier letter small theta" and not "greek small letter theta"
It should be casted to an int, or alternatively, used as String and call the method codepointAt(0)
char c='Ɵ';
System.out.println((int)c);
System.out.println("Ɵ".codePointAt(0));

java unicode value of char

When I do Collection.sort(List), it will sort based on String's compareTo() logic,where it compares both the strings char by char.
List<String> file1 = new ArrayList<String>();
file1.add("1,7,zz");
file1.add("11,2,xx");
file1.add("331,5,yy");
Collections.sort(file1);
My understanding is char means it specifies the unicode value, I want to know the unicode values of char like ,(comma) etc. How can I do it? Any url contains the numeric value of these?
My understanding is char means it specifies the unicode value, I want to know the unicode values of char like ,(comma) etc
Well there's an implicit conversion from char to int, which you can easily print out:
int value = ',';
System.out.println(value); // Prints 44
This is the UTF-16 code unit for the char. (As fge notes, a char in Java is a UTF-16 code unit, not a Unicode character. There are Unicode code points greater than 65535, which are represented as two UTF-16 code units.)
Any url contains the numeric value of these?
Yes - for more information about Unicode, go to the Unicode web site.
Uhm no, char is not a "unicode value" (and the word to use is Unicode code point).
A char is a code unit in the UTF-16 encoding. And it so happens that in Unicode's Basic Multilingual Plane (ie, Unicode code points ranging from U+0000 to U+FFFF, for code points defined in this range), yes, there is a 1-to-1 mapping between char and Unicode.
In order to know the numeric value of a code point you can just do:
System.out.println((int) myString.charAt(0));
But this IS NOT THE CASE for code points outside the BMP. For these, one code point translates to two chars. See Character.toChars(). And more generally, all static methods in Character relating to code points. There are quite a few!
This also means that String's .length() is actually misleading, since it returns the number of chars, not the number of graphemes.
Demonstration with one Unicode emoticon (the first in that page):
System.out.println(new String(Character.toChars(0x1f600)).length())
prints 2. Whereas:
final String s = new String(Character.toChars(0x1f600));
System.out.println(s.codePointCount(0, s.length());
prints 1.

Categories