Double byte character in Java

Double byte character in Java - java

Below code will print the length of byte store to below String which contain double byte Japanese character. Per my understanding, output of this program should be 2, however it is coming as 3. Why it this the case?
String j = "大";
System.out.println(j.getBytes().length);
If this will be always the case, then should I assume below:
1, for single byte character, output of program will be always 1
2, for double byte character, output of program will be always 3

UTF 8 characters byte length can be between 1 to 4 bytes. So your code is printing whatever is the correct byte length for the input japanese character.

I believe the code point for that character is 0x5927, which when represented as UTF-8 is the three bytes E5 A4 A7. (Not all non-ASCII characters take 3 bytes in UTF-8, only those with code points in the range of 0x0800 and 0xFFFF.)

.getBytes() method uses the default system encoding (in case of Linux it's usually UTF-8).
Since you mentioned "one-byte" and "two-byte Japanese characters", I guess you want to use SJIS encoding. You do it this way:
String j = "大";
System.out.println(j.getBytes("SJIS").length);
prints 2.
As a guideline, never use .getBytes without specifying an encoding and never use any other method or class that uses the default system encoding. You'll run your code on a different computer and it will stop working.

Related

Store data in Byte array in java

I am trying to convert a string like "password" to hex values, then have it inside a long array, the loop working fine till reaching the value "6F" (hex value for o char) then I have an exception java.lang.NumberFormatException
String password = "password";
char array[] = password.toCharArray();
int index = 0;
for (char c : array) {
String hex = (Integer.toHexString((int) c));
data[index] = Long.parseLong(hex);
index++;
}
how can I store the 6F values inside Byte array, as the 6F is greater than 1 byte ?. Please help me on this

Long.parseLong parses decimal numbers. It turns the string "10" into the number 10. If the input is hex, that is incorrect - the string "10" is supposed to be turned into the number 16. The fix is to use the Long.parseLong(String input, int radix) method. the radix you want is 16, though writing that as 0x10 may be more readable - it's the same thing to the compiler, purely a personal style choice. Thus, Long.parseLong(hex, 0x10) is what you want.
Note that in practice char has numbers that go from 0 to 65535, which doesn't fit in bytes. In effect, you must put a marker down that passwords must not contain any characters that aren't ASCII characters (so no umlauts, snowmen, emoji, funny quotes, etc).
If you fail to check this, Integer.toHexString((int) c) will turn into something like 16F or worse (3 to 4 characters), and it may also turn into a single character.
More generally, converting from char c to a hex string, and then parse the hex string into a number, is completely pointless. It's turning 15 into "F" and then turning "F" into 15. If you just want to shove a char into a byte: data[index++] = (byte) c; is all you need - that is the only line you need in your for loop.
But, heed this:
This really isn't how you're supposed to do that!
What you're doing is converting character data to a byte array. This is not actually simple - there are only 256 possible bytes, and there are way more characters that folks have invented. Literally hundreds of thousands of them.
Thus, to convert characters to bytes or vice versa, you must apply an encoding. Encodings have wildly varying properties. The most commonly used encoding, however, is 'UTF-8'. It represent every unicode symbol, and has the interesting property that basic ASCII characters look the exact same. However, it has the downside that any given character is smeared out into 1, 2, 3, or even 4 bytes, depending on what character it is. Fortunately, java has plenty of tools for this, thus, you don't need to care. What you really want, is this:
byte[] data = password.getBytes(StandardCharsets.UTF8);
That's asking the string to turn itself into a byte array, using UTF8 encoding. That means "password" turns into the sequence '112 97 115 115 119 111 114 100' which is no doubt what you want, but you can also have as password, say, außgescheignet ☃, and that works too - it's turned into bytes, and you can get back to your snowman enabled password:
String in = "außgescheignet ☃";
byte[] data = in.getBytes(StandardCharsets.UTF8);
String andBackAgain = new String(data, StandardCharsets.UTF8);
assert in.equals(andBackAgain); // true
if you stick this in a source file, make sure you save it in whatever text editor you use to do this as UTF8, and that javac compiles it that way too (javac has an -encoding parameter to enforce this).
If you think this is going to cause issues on whatever you send this to, and you want to restrict it to what someone with a rather USA-centric view would call 'normal' characters, then you want the exact same code as showcased here, but use StandardCharsets.ASCII instead. Then, that line (password.getBytes(StandardCharsets.ASCII)) will flat out error if it includes non-ASCII characters. That's a good thing: Your infrastructure would not deal with it correctly, we just posited that in this hypothetical exercise. Throwing an exception early in the process on a relevant line is exactly what you want.

Java 8 UTF-16 isn't default charset but UTF-8

I been doing some coding with String in Java8,Java 11 but this question is based on Java 8. I have this little snippet.
final char e = (char)200;//È
I just thought that the characters between 0.255[Ascii+extended Ascii] would always fit in a byte just because 2^8=256 but this seems not to be true i have try on the website https://mothereff.in/byte-counter and states that the character is taking 2 bytes can somebody please explain to me.
Another question in a lot of post states that Java is UTF-16 but in my machine running Windows 7 is returning UTF-8 in this snippet.
String csn = Charset.defaultCharset().name();
Is this platform depent?
Other questions i have try this snippet.
final List<Charset>charsets = Arrays.asList(StandardCharsets.ISO_8859_1,StandardCharsets.US_ASCII,StandardCharsets.UTF_16,StandardCharsets.UTF_8);
charsets.forEach(a->print(a,"È"));
System.out.println("getBytes");
System.out.println(Arrays.toString("È".getBytes()));
charsets.forEach(a->System.out.println(a+" "+Arrays.toString(sb.toString().getBytes(a))));
private void print(final Charset set,final CharSequence sb){
byte[] array = new byte[4];
set.newEncoder()
.encode(CharBuffer.wrap(sb), ByteBuffer.wrap(array), true);
final String buildedString = new String(array,set);
System.out.println(set+" "+Arrays.toString(array)+" "+buildedString+"<<>>"+buildedString.length());
}
And prints
run:
ISO-8859-1 [-56, 0, 0, 0] È//PERFECT USING 1 BYTE WHICH IS -56
US-ASCII [0, 0, 0, 0] //DONT GET IT SEE THIS ITEM FOR LATER
UTF-16 [-2, -1, 0, -56] È<<>>1 //WHAT IS -2,-1 BYTE USED FOR? I HAVE TRY WITH OTHER EXAMPLES AND THEY ALWAYS APPEAR AM I LOSING TWO BYTES HERE??
UTF-8 [-61, -120, 0, 0] 2 È //SEEMS TO MY CHARACTER NEEDS TWO BYTES?? I THOUGHT THAT CODE=200 WOULD REQUIRE ONLY ONE
getBytes
[-61, -120]//OK MY UTF-8 REPRESENTATION
ISO-8859-1 [-56]//OK
US-ASCII [63]//OK BUT WHY WHEN I ENCODE IN ASCCI DOESNT GET ANY BYTE ENCODED?
UTF-16 [-2, -1, 0, -56]//AGAIN WHAT ARE -2,-1 IN THE LEADING BYTES?
UTF-8 [-61, -120]//OK
I have try
System.out.println(new String(new byte[]{-1,-2},"UTF-16"));//SIMPLE "" I AM WASTING THIS 2 BYTES??
In resume.
Why UTF-16 always has two leading bytes are they wasted? new byte[]{-1,-2}
Why when i encode "È" i dont get any bytes in ASCCI Charset but when i do È.getBytes(StandardCharsets.US_ASCII) i get {63}?
Java uses UTF-16 but in my case UTF-8 is platform depend??
Sorry if this post is confussing
Environment
Windows 7 64 Bits Netbeans 8.2 with Java 1.8.0_121

First question
For your first question: those bytes are the BOM code and they specify the byte order (whether the least or most significant comes first) of multibyte encoding such as UTF-16.
Second question
Every ASCII character can be encoded as a single byte in UTF-8. But ASCII is not an 8-bit encoding, it uses 7 bits for every character. And in fact, all Unicode character with code points >= 128 require at least two bytes. (The reason is that you need a way to distinguish between 200 and a multibyte code point whose first byte happens to be 200. UTF-8 solves this by using the bytes >= 128 to represent multibyte codepoints.)
'È' is not an ASCII character, so it cannot be represented in ASCII. This explains the second output: 63 is ASCII for the character '?'. And indeed, the Javadoc for the getBytes(Charset) method specifies that unmappable input is mapped to "the default replacement byte array", in this case '?'. On the other hand, to obtain the first ASCII byte array you used the CharsetEncoder directly, which is a more low-level API and does not perform such automatic replacements. (When you would have checked the result of the encode method, you would have found it to have returned a CoderResult instance representing an error.)
Third question
Java 8 Strings use UTF-16 internally, but when communicating with other software, different encodings may be expected, such as UTF-8. The Charset.defaultCharset() method returns the default character set of the virtual machine, which depends on the locale and character set of the operating system, not on the encoding used internally by Java strings.

Let's back up a bit…
Java's text datatypes use the UTF-16 character encoding of the Unicode character set. (As do, VB4/5/6/A/Script, JavaScript, .NET, ….) You can see this in the various operations you do with the string API: indexing, length, ….
Libraries support converting between the text datatypes and byte arrays using various encodings. Some of them are categorized as "Extended ASCII", but stating that is a very poor substitute for naming the character encoding actually being used.
Some operating systems allow the user to designate a default character encoding. (Most users don't know or care, though.) Java attempts to pick this up. It is only useful when the program understands that input from the user is that character encoding or that output should be. This century, users dealing in text files prefer to use a specific encoding, communicate them unchanged across systems, don't appreciate lossy conversions and therefore don't have any use for this concept. From a program's point of view, it is never what you want unless it is exactly what you want.
Where a conversion would be lossy, you have the choice of a replacement character (such a '?'), omitting it, or throwing an exception.
A character encoding is a map between a codepoint (integer) of a character set and one or more code units, according to the definition of the encoding. A code unit is a fixed size and the number of code units needed for a codepoint, might vary by codepoint.
In libraries, it is not generally useful to have an array of code units so they take the further step of converting to/from an array of bytes. byte values do range from -128 to 127, however, that's the Java interpretation as two's complement 8-bit integers. As the bytes are understood to be encoding text, the values would be interpret according to the rules of the character encoding.
Because some Unicode encodings, have code units more than one byte long, byte order becomes important. So, at the byte array level, there is UTF-16 Big Endian and UTF-16 Little Endian. When communicating a text file or stream, you would send the bytes and well as having a shared knowledge of the encoding. This "metadata" is required for understanding. So, UTF-16BE or UTF-16LE, for example. To make that a bit easier, Unicode allows some metadata beginning of the file or stream to indicate the byte order. It is called the byte-order mark (BOM) So, the external metadata can share the encoding (say, UTF-16), while the internal metadata shares the byte order. Unicode allows the BOM to be present even when byte order is not relevant, such as UTF-8. So, if the understanding is that the bytes are text encoded with any Unicode encoding and a BOM is present, then it's a very simple matter to figure out which Unicode encoding it is and what the byte order is, if relavent.
1) You are seeing the BOM in some of your Unicode encoding outputs.
2) È is not in the ASCII character set. What would want to happen in this case? I often prefer an exception.
3) The system you were using, for your account, at the time of your tests, may have had UTF-8 as the default character encoding, Is that important to the way you want and have encoded your text files on that system?

Java string "hello" has 12 bytes when getBytes("UTF-16")?

I expected that, when a java character is stored as "UTF-16", each character uses 2 bytes, so "hello" should consume 10 bytes, but this code:
String h = "hello";
System.out.println(new String(h.getBytes("UTF-16"), "UTF-16").length());
System.out.println(new String(h.getBytes("UTF-8"), "UTF-8").getBytes("UTF-16").length);
Will print "5 12"
My question:
(1) I expected that the first println should get "10" as I mentioned. But why 5?
(2) For the second println, I am trying to getBytes for it first as "UTF-8" then as "UTF-16". I suppose it should also be 10. But actually it's 12.
I'm using MAC and my region is HongKong. Would you help to explain what's happening in the program, and how "5 12" actually came out?
Thanks a lot!

(1) I expected that the first println should get "10" as I mentioned. But why 5?
You take a 5 character string, encode it as bytes using UTF-16 encoding.
Then you create a new string by decoding the bytes (correctly) from UTF-16, which gives you a new string consisting of your original 5 characters again.
(2) For the second println, I am trying to getBytes for it first as "UTF-8" then as "UTF-16". I suppose it should also be 10. But actually it's 12.
This part of the code:
new String(h.getBytes("UTF-8"), "UTF-8")
is actually a no-op. It is just a rather expensive way to copy a string. You encode the string to bytes using UTF-8 as the encoding scheme, and then you create a new string by decoding the UTF-8 encoded bytes.
So effectively, you are doing this:
"hello".getBytes("UTF-16").length
The reason for the extra 2 bytes is that UTF-16 encoding puts a BOM (byte order mark) as the first (2 byte) code unit.
For more information, read the Unicode FAQs on "UTF-8, UTF-16, UTF-32 & BOM".

I expected that the first println should get "10" as I mentioned. But why 5?
You are calling length() on the String, not on the byte[]. So this will give you the length of the String in characters (at least as long as we are staying in the Unicode Basic Multilingual Plane -- this unfortunately breaks down when you have characters that need variable-length encoding even in UTF-16).
Once you have a String, it does not matter what encoding was used to create it. length is always given in terms of characters.
If you converted this into a byte[] using UTF-16, you might rightfully have expected 10 (for the five characters times two bytes each) -- that it actually ends up being 12 is due to a Byte Order Mark being included.

Why character can't print after 128 number

In my project i am try to convert Binary number to integer and convert integer to Character. But after 128 number print only '?' character. Please help me how to print up to 250 characters. My code is
class b
{
public static void main(String[] args)
{
String dec1="11011001" ;
System.out.println(dec1);
int dec = Integer.parseInt(dec1, 2);
System.out.println(dec);
String str = new Character((char)dec).toString();
System.out.println("decrypted number is "+str);
}
}
Thank you.

Not all byte values have a printable character associated with them, ASCII does not, many/most unicode bytes do not, and the range 0x00 - 0x1f are all unprintable controls such as DC1, Bell, Backspace, etc. Unicode has the same first 32 characters reserved as non-printable.
Byte values above 127 (0x7f) have different meanings in different encodings, there are many encodings. Historically ASCII was the default encoding and there were many extensions to it. These days the standard is unicode which exists in several varieties including UTF-8, UTF-16 (LE, BE and BOM) and UTF-32 (LE, BE and BOM). UTF8 is common for interchange especially over the net and UTF-16 internally in many systems.
Depending on the encoding and glyph (displayed representation) it may take from one to over 16 bytes to represent a single glyph. Emoji mostly are in code plane 1 meaning that they require more than 16-bits for their code point (unicode is a 21-bit encoding system). Additionally some glyphs are represented by a sequence of code points, examples are flags which combine a country with the flag and Emoji joined with "joiners".
In the case of 217 (0xd9) that is not a legal codepoint in UTF-8 but 217 as two bytes (16-bit integer) (0x00d9) is a valid representation of Ù.
See ASCII and Unicode.

As per your code,First the binary will be converted to Integer and Then you are converting Integer to the Character which is done by checking the ASCII value. It will return the character having same ASCII value as the Integer dec1 you are converting. Since in ASCII TABLE the values are upto 127, You will get the character upto the integer value 127, So for the greater value of dec1 than 127, You will get character as ? which will be then converted into String. First 32 elements are non-printable characters so you will get some strange symbol for it but for value of dec1 in the range 32-126, You will get the character assigned to that particular ASCII value as per ASCII TABLE. Since the value 127 is assigned to DEL, you will get strange symbol for value of dec 127.

The issue is that your console's encoding doesn't match the encoding of the output of your Java program. I don't know what console you're using, but on Windows, you can run this command to see your current encoding:
chcp
The default console's encoding for USA is 437 and for Western Europe and Canada 850. These encodings have the 128 characters from ASCII encoding and 128 additional characters that are different from one encoding to another. You get nothing beyond the 128 ASCII characters because your Java output's encoding doesn't match the console's encoding. You have to change one of them to match the other.
You can change your console's encoding to UTF-8 by running this command:
chcp 65001
If you're not on Windows, you'll have to search for the equivalent commands for your system. But I believe on most Linux & Unix derived systems, you can use the locale command to see the current encoding and the export command to change it.

I receive the following output from your code. I assume that you run the program in an environment/console that doesn't support the character. You need a console that support UTF-8, UTF-16 or similar to be able to print all characters you setup numerical values for.
11011001
217
decrypted number is Ù

Isn't the size of character in Java 2 bytes?

I used RandomAccessFile to read a byte from a text file.
public static void readFile(RandomAccessFile fr) {
byte[] cbuff = new byte[1];
fr.read(cbuff,0,1);
System.out.println(new String(cbuff));
}
Why am I seeing one full character being read by this?

A char represents a character in Java (*). It is 2 bytes large (or 16 bits).
That doesn't necessarily mean that every representation of a character is 2 bytes long. In fact many character encodings only reserve 1 byte for every character (or use 1 byte for the most common characters).
When you call the String(byte[]) constructor you ask Java to convert the byte[] to a String using the platform's default charset(**). Since the platform default charset is usually a 1-byte encoding such as ISO-8859-1 or a variable-length encoding such as UTF-8, it can easily convert that 1 byte to a single character.
If you run that code on a platform that uses UTF-16 (or UTF-32 or UCS-2 or UCS-4 or ...) as the platform default encoding, then you will not get a valid result (you'll get a String containing the Unicode Replacement Character instead).
That's one of the reasons why you should not depend on the platform default encoding: when converting between byte[] and char[]/String or between InputStream and Reader or between OutputStream and Writer, you should always specify which encoding you want to use. If you don't, then your code will be platform-dependent.
(*) that's not entirely true: a char represents a UTF-16 code unit. Either one or two UTF-16 code units represent a Unicode code point. A Unicode code point usually represents a character, but sometimes multiple Unicode code points are used to make up a single character. But the approximation above is close enough to discuss the topic at hand.
(**) Note that on Android the default character set is always UTF-8 and starting with Java 18 the Java platform itself also switched to this default (but it can still be configured to act the legacy way)

Java stores all it's "chars" internally as two bytes. However, when they become strings etc, the number of bytes will depend on your encoding.
Some characters (ASCII) are single byte, but many others are multi-byte.
Java supports Unicode, thus according to:
Java Character Docs
The max value supported is "\uFFFF" (hex FFFF, dec 65535), or 11111111 11111111 binary (two bytes).

The constructor String(byte[] bytes) takes the bytes from the buffer and encodes them to characters.
It uses the platform default charset to encode bytes to characters. If you know, your file contains text, that is encoded in a different charset, you can use the String(byte[] bytes, String charsetName) to use the correct encoding (from bytes to characters).

In ASCII text file each character is just one byte

Looks like your file contains ASCII characters, which are encoded in just 1 byte. If text file was containing non-ASCII character, e.g. 2-byte UTF-8, then you get just the first byte, not whole character.

There are some great answers here but I wanted to point out the jvm is free to store a char value in any size space >= 2 bytes.
On many architectures there is a penalty for performing unaligned memory access so a char might easily be padded to 4 bytes. A volatile char might even be padded to the size of the CPU cache line to prevent false sharing. https://en.wikipedia.org/wiki/False_sharing
It might be non-intuitive to new Java programmers that a character array or a string is NOT simply multiple characters. You should learn and think about strings and arrays distinctly from "multiple characters".
I also want to point out that java characters are often misused. People don't realize they are writing code that won't properly handle codepoints over 16 bits in length.

Java allocates 2 of 2 bytes for character as it follows UTF-16. It occupies minimum 2 bytes while storing a character, and maximum of 4 bytes. There is no 1 byte or 3 bytes of storage for character.

The Java char is 2 bytes. But the file encoding may be different.
So first you should know what encoding your file uses. For example, the file could be UTF-8 or ASCII encoded, then you will retrieve the right chars by reading one byte at a time.
If the encoding of the file is UTF-16, it may still show you the correct char if your UTF-16 is little endian. For example, the little endian UTF-16 for A is [65, 0]. Then when you read the first byte, it returns 65. After padding with 0 for the second byte, you will get A.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.