Platform Dependent Encoding issues in Java - java

Noticed this behavior while troubleshooting a file generation issue in a piece of java code that's moved from AIX to LINUX sever.
Charset.defaultCharset();
returns ISO-8859-1 on AIX, UTF-8 on Linux, and windows-1252 on my Windows 7. With that said, I am trying to figure out why on the Linux machine, nlength = 24 (3 bytes per alphanumeric character) whereas on AIX and Windows it is 8.
String inString = "ABC12345";
byte[] ebcdicByte = new byte[inString.length()];
System.out.println("Length:"+inString.getBytes("Cp1047").length);
ebcdicByte = inString.getBytes("Cp1047").);
String ebcdicString = new String( ebcdicByte);
int nlength = ebcdicString.getBytes().length;

You are misunterstanding things.
This is Java.
There are bytes. There are chars. And there is the default encoding.
When translating from bytes to chars, you have to decode.
When translating from chars to bytes, you have to encode.
And of course, apart from very limited charsets you will never have a 1-1 char-byte mapping.
If you see problems with encoding/decoding, the cause is pretty simple: somewhere in your code (with luck, in only one place; if not lucky, in several places) you failed to specify the charset to use when decoding and encoding.
Also note that by default, the encoding/decoding behaviour on failure it to replace unmappable char/byte sequences.
All this to say: a String does not have an encoding. Sure, it is a series of chars and a char is a primitive type; but it could just as well have been a stream of carrier pigeons, the two basic processes remain the same: you need to decode from bytes and you need to encode to bytes; if either part fails you end with meaningless byte sequences/mutant carrier pigeons.

Building on fge's answer...
Your observation is occurring because new String(ebcdicByte) and ebcdicString.getBytes() use the platform's default charset.
ISO-8859-1 and windows-1252 are one-byte charsets. In those charsets, one byte always represents one character. So in AIX and Windows, when you do new String(ebcdicByte), you will always get a String whose character count is identical to your byte array's length. Similarly, converting a String back to bytes will use a one-to-one mapping.
But in UTF-8, one character does not necessarily correspond to one byte. In UTF-8, bytes 0 through 127 are single-byte representations of characters, but all other values are part of a multi-byte sequence.
However, not just any sequence of bytes with their high bit set is a valid UTF-8 sequence. If you give an UTF-8 decoder a sequence of bytes that isn't a properly encoded UTF-8 byte sequence, it is considered malformed. new String will simply map malformed sequences to a special default character, usually "�" ('\ufffd'). That behavior can be changed by explicitly creating your own CharsetDecoder and calling its onMalformedInput method, rather than just relying on new String(byte[]).
So, the ebcdicByte array contains this EBCDIC representation of "ABC12345":
C1 C2 C3 F1 F2 F3 F4 F5
None of those are valid UTF-8 byte sequences, so ebcdicString ends up as "\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd" which is "��������".
Your last line of code calls ebcdicString.getBytes(), which again does not specify a character set, which means the default charset will be used. Using UTF-8, "�" gets encoded as three bytes, EF BF BD. Since there are eight of those in ebcdicString, you get 3×8=24 bytes.

You have to specify the charset in the second to last line.
String ebcdicString = new String( ebcdicByte,"Cp1047");
as already pointed out, you always have to specify the charset when encoding/decoding.

Related

Java 8 UTF-16 isn't default charset but UTF-8

I been doing some coding with String in Java8,Java 11 but this question is based on Java 8. I have this little snippet.
final char e = (char)200;//È
I just thought that the characters between 0.255[Ascii+extended Ascii] would always fit in a byte just because 2^8=256 but this seems not to be true i have try on the website https://mothereff.in/byte-counter and states that the character is taking 2 bytes can somebody please explain to me.
Another question in a lot of post states that Java is UTF-16 but in my machine running Windows 7 is returning UTF-8 in this snippet.
String csn = Charset.defaultCharset().name();
Is this platform depent?
Other questions i have try this snippet.
final List<Charset>charsets = Arrays.asList(StandardCharsets.ISO_8859_1,StandardCharsets.US_ASCII,StandardCharsets.UTF_16,StandardCharsets.UTF_8);
charsets.forEach(a->print(a,"È"));
System.out.println("getBytes");
System.out.println(Arrays.toString("È".getBytes()));
charsets.forEach(a->System.out.println(a+" "+Arrays.toString(sb.toString().getBytes(a))));
private void print(final Charset set,final CharSequence sb){
byte[] array = new byte[4];
set.newEncoder()
.encode(CharBuffer.wrap(sb), ByteBuffer.wrap(array), true);
final String buildedString = new String(array,set);
System.out.println(set+" "+Arrays.toString(array)+" "+buildedString+"<<>>"+buildedString.length());
}
And prints
run:
ISO-8859-1 [-56, 0, 0, 0] È//PERFECT USING 1 BYTE WHICH IS -56
US-ASCII [0, 0, 0, 0] //DONT GET IT SEE THIS ITEM FOR LATER
UTF-16 [-2, -1, 0, -56] È<<>>1 //WHAT IS -2,-1 BYTE USED FOR? I HAVE TRY WITH OTHER EXAMPLES AND THEY ALWAYS APPEAR AM I LOSING TWO BYTES HERE??
UTF-8 [-61, -120, 0, 0] 2 È //SEEMS TO MY CHARACTER NEEDS TWO BYTES?? I THOUGHT THAT CODE=200 WOULD REQUIRE ONLY ONE
getBytes
[-61, -120]//OK MY UTF-8 REPRESENTATION
ISO-8859-1 [-56]//OK
US-ASCII [63]//OK BUT WHY WHEN I ENCODE IN ASCCI DOESNT GET ANY BYTE ENCODED?
UTF-16 [-2, -1, 0, -56]//AGAIN WHAT ARE -2,-1 IN THE LEADING BYTES?
UTF-8 [-61, -120]//OK
I have try
System.out.println(new String(new byte[]{-1,-2},"UTF-16"));//SIMPLE "" I AM WASTING THIS 2 BYTES??
In resume.
Why UTF-16 always has two leading bytes are they wasted? new byte[]{-1,-2}
Why when i encode "È" i dont get any bytes in ASCCI Charset but when i do È.getBytes(StandardCharsets.US_ASCII) i get {63}?
Java uses UTF-16 but in my case UTF-8 is platform depend??
Sorry if this post is confussing
Environment
Windows 7 64 Bits Netbeans 8.2 with Java 1.8.0_121
First question
For your first question: those bytes are the BOM code and they specify the byte order (whether the least or most significant comes first) of multibyte encoding such as UTF-16.
Second question
Every ASCII character can be encoded as a single byte in UTF-8. But ASCII is not an 8-bit encoding, it uses 7 bits for every character. And in fact, all Unicode character with code points >= 128 require at least two bytes. (The reason is that you need a way to distinguish between 200 and a multibyte code point whose first byte happens to be 200. UTF-8 solves this by using the bytes >= 128 to represent multibyte codepoints.)
'È' is not an ASCII character, so it cannot be represented in ASCII. This explains the second output: 63 is ASCII for the character '?'. And indeed, the Javadoc for the getBytes(Charset) method specifies that unmappable input is mapped to "the default replacement byte array", in this case '?'. On the other hand, to obtain the first ASCII byte array you used the CharsetEncoder directly, which is a more low-level API and does not perform such automatic replacements. (When you would have checked the result of the encode method, you would have found it to have returned a CoderResult instance representing an error.)
Third question
Java 8 Strings use UTF-16 internally, but when communicating with other software, different encodings may be expected, such as UTF-8. The Charset.defaultCharset() method returns the default character set of the virtual machine, which depends on the locale and character set of the operating system, not on the encoding used internally by Java strings.
Let's back up a bit…
Java's text datatypes use the UTF-16 character encoding of the Unicode character set. (As do, VB4/5/6/A/Script, JavaScript, .NET, ….) You can see this in the various operations you do with the string API: indexing, length, ….
Libraries support converting between the text datatypes and byte arrays using various encodings. Some of them are categorized as "Extended ASCII", but stating that is a very poor substitute for naming the character encoding actually being used.
Some operating systems allow the user to designate a default character encoding. (Most users don't know or care, though.) Java attempts to pick this up. It is only useful when the program understands that input from the user is that character encoding or that output should be. This century, users dealing in text files prefer to use a specific encoding, communicate them unchanged across systems, don't appreciate lossy conversions and therefore don't have any use for this concept. From a program's point of view, it is never what you want unless it is exactly what you want.
Where a conversion would be lossy, you have the choice of a replacement character (such a '?'), omitting it, or throwing an exception.
A character encoding is a map between a codepoint (integer) of a character set and one or more code units, according to the definition of the encoding. A code unit is a fixed size and the number of code units needed for a codepoint, might vary by codepoint.
In libraries, it is not generally useful to have an array of code units so they take the further step of converting to/from an array of bytes. byte values do range from -128 to 127, however, that's the Java interpretation as two's complement 8-bit integers. As the bytes are understood to be encoding text, the values would be interpret according to the rules of the character encoding.
Because some Unicode encodings, have code units more than one byte long, byte order becomes important. So, at the byte array level, there is UTF-16 Big Endian and UTF-16 Little Endian. When communicating a text file or stream, you would send the bytes and well as having a shared knowledge of the encoding. This "metadata" is required for understanding. So, UTF-16BE or UTF-16LE, for example. To make that a bit easier, Unicode allows some metadata beginning of the file or stream to indicate the byte order. It is called the byte-order mark (BOM) So, the external metadata can share the encoding (say, UTF-16), while the internal metadata shares the byte order. Unicode allows the BOM to be present even when byte order is not relevant, such as UTF-8. So, if the understanding is that the bytes are text encoded with any Unicode encoding and a BOM is present, then it's a very simple matter to figure out which Unicode encoding it is and what the byte order is, if relavent.
1) You are seeing the BOM in some of your Unicode encoding outputs.
2) È is not in the ASCII character set. What would want to happen in this case? I often prefer an exception.
3) The system you were using, for your account, at the time of your tests, may have had UTF-8 as the default character encoding, Is that important to the way you want and have encoded your text files on that system?

Converting String from One Charset to Another

I am working on converting a string from one charset to another and read many example on it and finally found below code, which looks nice to me and as a newbie to Charset Encoding, I want to know, if it is the right way to do it .
public static byte[] transcodeField(byte[] source, Charset from, Charset to) {
return new String(source, from).getBytes(to);
}
To convert String from ASCII to EBCDIC, I have to do:
System.out.println(new String(transcodeField(ebytes,
Charset.forName("US-ASCII"), Charset.forName("Cp1047"))));
And to convert from EBCDIC to ASCII, I have to do:
System.out.println(new String(transcodeField(ebytes,
Charset.forName("Cp1047"), Charset.forName("US-ASCII"))));
The code you found (transcodeField) doesn't convert a String from one encoding to another, because a String doesn't have an encoding¹. It converts bytes from one encoding to another. The method is only useful if your use case satisfies 2 conditions:
Your input data is bytes in one encoding
Your output data needs to be bytes in another encoding
In that case, it's straight forward:
byte[] out = transcodeField(inbytes, Charset.forName(inEnc), Charset.forName(outEnc));
If the input data contains characters that can't be represented in the output encoding (such as converting complex UTF8 to ASCII) those characters will be replaced with the ? replacement symbol, and the data will be corrupted.
However a lot of people ask "How do I convert a String from one encoding to another", to which a lot of people answer with the following snippet:
String s = new String(source.getBytes(inputEncoding), outputEncoding);
This is complete bull****. The getBytes(String encoding) method returns a byte array with the characters encoded in the specified encoding (if possible, again invalid characters are converted to ?). The String constructor with the 2nd parameter creates a new String from a byte array, where the bytes are in the specified encoding. Now since you just used source.getBytes(inputEncoding) to get those bytes, they're not encoded in outputEncoding (except if the encodings use the same values, which is common for "normal" characters like abcd, but differs with more complex like accented characters éêäöñ).
So what does this mean? It means that when you have a Java String, everything is great. Strings are unicode, meaning that all of your characters are safe. The problem comes when you need to convert that String to bytes, meaning that you need to decide on an encoding. Choosing a unicode compatible encoding such as UTF8, UTF16 etc. is great. It means your characters will still be safe even if your String contained all sorts of weird characters. If you choose a different encoding (with US-ASCII being the least supportive) your String must contain only the characters supported by the encoding, or it will result in corrupted bytes.
Now finally some examples of good and bad usage.
String myString = "Feng shui in chinese is 風水";
byte[] bytes1 = myString.getBytes("UTF-8"); // Bytes correct
byte[] bytes2 = myString.getBytes("US-ASCII"); // Last 2 characters are now corrupted (converted to question marks)
String nordic = "Här är några merkkejä";
byte[] bytes3 = nordic.getBytes("UTF-8"); // Bytes correct, "weird" chars take 2 bytes each
byte[] bytes4 = nordic.getBytes("ISO-8859-1"); // Bytes correct, "weird" chars take 1 byte each
String broken = new String(nordic.getBytes("UTF-8"), "ISO-8859-1"); // Contains now "Här är några merkkejä"
The last example demonstrates that even though both of the encodings support the nordic characters, they use different bytes to represent them and using the wrong encoding when decoding results in Mojibake. Therefore there's no such thing as "converting a String from one encoding to another", and you should never use the broken example.
Also note that you should always specify the encoding used (with both getBytes() and new String()), because you can't trust that the default encoding is always the one you want.
As a last issue, Charset and Encoding aren't the same thing, but they're very much related.
¹ Technically the way a String is stored internally in the JVM is in UTF-16 encoding up to Java 8, and variable encoding from Java 9 onwards, but the developer doesn't need to care about that.
NOTE
It's possible to have a corrupted String and be able to uncorrupt it by fiddling with the encoding, which may be where this "convert String to other encoding" misunderstanding originates from.
// Input comes from network/file/other place and we have misconfigured the encoding
String input = "Här är några merkkejä"; // UTF-8 bytes, interpreted wrongly as ISO-8859-1 compatible
byte[] bytes = input.getBytes("ISO-8859-1"); // Get each char as single byte
String asUtf8 = new String(bytes, "UTF-8"); // Recreate String as UTF-8
If no characters were corrupted in input, the string would now be "fixed". However the proper approach is to use the correct encoding when reading input, not fix it afterwards. Especially if there's a chance of it becoming corrupted.

Will String.getBytes("UTF-16") return the same result on all platforms?

I need to create a hash from a String containing users password. To create the hash, I use a byte array which I get by calling String.getBytes(). But when I call this method with specified encoding, (such as UTF-8) on a platform where this is not the default encoding, the non-ASCII characters get replaced by a default character (if I understand the behaviour of getBytes() correctly) and therefore on such platform, I will get a different byte array, and eventually a different hash.
Since Strings are internally stored in UTF-16, will calling String.getBytes("UTF-16") guarantee me that I get the same byte array on every platform, regardless of its default encoding?
Yes. Not only is it guaranteed to be UTF-16, but the byte order is defined too:
When decoding, the UTF-16 charset interprets the byte-order mark at the beginning of the input stream to indicate the byte-order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.
(The BOM isn't relevant when the caller doesn't ask for it, so String.getBytes(...) won't include it.)
So long as you have the same string content - i.e. the same sequence of char values - then you'll get the same bytes on every implementation of Java, barring bugs. (Any such bug would be pretty surprising, given that UTF-16 is probably the simplest encoding to implement in Java...)
The fact that UTF-16 is the native representation for char (and usually for String) is only relevant in terms of ease of implementation, however. For example, I'd also expect String.getBytes("UTF-8") to give the same results on every platform.
It is true, java uses Unicode internally so it may combine any script/language. String and char use UTF-16BE but .class files store there String constants in UTF-8. In general it is irrelevant what String does, as there is a conversion to bytes specifying the encoding the bytes have to be in.
If this encoding of the bytes cannot represent some of the Unicode characters, a placeholder character or question mark is given. Also fonts might not have all Unicode characters, 35 MB for a full Unicode font is a normal size. You might then see a square with 2x2 hex codes or so for missing code points. Or on Linux another font might substitute the char.
Hence UTF-8 is a perfect fine choice.
String s = ...;
if (!s.startsWith("\uFEFF")) { // Add a Unicode BOM
s = "\uFEFF" + s;
}
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
Both UTF-16 (in both byte orders) and UTF-8 always are present in the JRE, whereas some Charsets are not. Hence you can use a constant from StandardCharsets not needing to handle any UnsupportedEncodingException.
Above I added a BOM for Windows Notepad esoecially, to recognize UTF-8. It certainly is not good practice. But as a small help here.
There is no disadvantage to UTF16-LE or UTF-16BE. I think UTF-8 is a bit more universally used, as UTF-16 also cannot store all Unicode code points in 16 bits. Text is Asian scripts would be more compressed, but already HTML pages are more compact in UTF-8 because of the HTML tags and other latin script.
For Windows UTF-16LE might be more native.
Problem with placeholders for non-Unicode platforms, especially Windows, might happen.
I just found this:
https://github.com/facebook/conceal/issues/138
which seems to answer negatively your question.
As per Jon Skeet's answer: the specification is clear. But I guess Android/Mac implementations of Dalvik/JVM don't agree.

Isn't the size of character in Java 2 bytes?

I used RandomAccessFile to read a byte from a text file.
public static void readFile(RandomAccessFile fr) {
byte[] cbuff = new byte[1];
fr.read(cbuff,0,1);
System.out.println(new String(cbuff));
}
Why am I seeing one full character being read by this?
A char represents a character in Java (*). It is 2 bytes large (or 16 bits).
That doesn't necessarily mean that every representation of a character is 2 bytes long. In fact many character encodings only reserve 1 byte for every character (or use 1 byte for the most common characters).
When you call the String(byte[]) constructor you ask Java to convert the byte[] to a String using the platform's default charset(**). Since the platform default charset is usually a 1-byte encoding such as ISO-8859-1 or a variable-length encoding such as UTF-8, it can easily convert that 1 byte to a single character.
If you run that code on a platform that uses UTF-16 (or UTF-32 or UCS-2 or UCS-4 or ...) as the platform default encoding, then you will not get a valid result (you'll get a String containing the Unicode Replacement Character instead).
That's one of the reasons why you should not depend on the platform default encoding: when converting between byte[] and char[]/String or between InputStream and Reader or between OutputStream and Writer, you should always specify which encoding you want to use. If you don't, then your code will be platform-dependent.
(*) that's not entirely true: a char represents a UTF-16 code unit. Either one or two UTF-16 code units represent a Unicode code point. A Unicode code point usually represents a character, but sometimes multiple Unicode code points are used to make up a single character. But the approximation above is close enough to discuss the topic at hand.
(**) Note that on Android the default character set is always UTF-8 and starting with Java 18 the Java platform itself also switched to this default (but it can still be configured to act the legacy way)
Java stores all it's "chars" internally as two bytes. However, when they become strings etc, the number of bytes will depend on your encoding.
Some characters (ASCII) are single byte, but many others are multi-byte.
Java supports Unicode, thus according to:
Java Character Docs
The max value supported is "\uFFFF" (hex FFFF, dec 65535), or 11111111 11111111 binary (two bytes).
The constructor String(byte[] bytes) takes the bytes from the buffer and encodes them to characters.
It uses the platform default charset to encode bytes to characters. If you know, your file contains text, that is encoded in a different charset, you can use the String(byte[] bytes, String charsetName) to use the correct encoding (from bytes to characters).
In ASCII text file each character is just one byte
Looks like your file contains ASCII characters, which are encoded in just 1 byte. If text file was containing non-ASCII character, e.g. 2-byte UTF-8, then you get just the first byte, not whole character.
There are some great answers here but I wanted to point out the jvm is free to store a char value in any size space >= 2 bytes.
On many architectures there is a penalty for performing unaligned memory access so a char might easily be padded to 4 bytes. A volatile char might even be padded to the size of the CPU cache line to prevent false sharing. https://en.wikipedia.org/wiki/False_sharing
It might be non-intuitive to new Java programmers that a character array or a string is NOT simply multiple characters. You should learn and think about strings and arrays distinctly from "multiple characters".
I also want to point out that java characters are often misused. People don't realize they are writing code that won't properly handle codepoints over 16 bits in length.
Java allocates 2 of 2 bytes for character as it follows UTF-16. It occupies minimum 2 bytes while storing a character, and maximum of 4 bytes. There is no 1 byte or 3 bytes of storage for character.
The Java char is 2 bytes. But the file encoding may be different.
So first you should know what encoding your file uses. For example, the file could be UTF-8 or ASCII encoded, then you will retrieve the right chars by reading one byte at a time.
If the encoding of the file is UTF-16, it may still show you the correct char if your UTF-16 is little endian. For example, the little endian UTF-16 for A is [65, 0]. Then when you read the first byte, it returns 65. After padding with 0 for the second byte, you will get A.

Encoding conversion in java

Is there any free java library which I can use to convert string in one encoding to other encoding, something like iconv? I'm using Java version 1.3.
You don't need a library beyond the standard one - just use Charset. (You can just use the String constructors and getBytes methods, but personally I don't like just working with the names of character encodings. Too much room for typos.)
EDIT: As pointed out in comments, you can still use Charset instances but have the ease of use of the String methods: new String(bytes, charset) and String.getBytes(charset).
See "URL Encoding (or: 'What are those "%20" codes in URLs?')".
CharsetDecoder should be what you are looking for, no ?
Many network protocols and files store their characters with a byte-oriented character set such as ISO-8859-1 (ISO-Latin-1).
However, Java's native character encoding is Unicode UTF16BE (Sixteen-bit UCS Transformation Format, big-endian byte order).
See Charset. That doesn't mean UTF16 is the default charset (i.e.: the default "mapping between sequences of sixteen-bit Unicode code units and sequences of bytes"):
Every instance of the Java virtual machine has a default charset, which may or may not be one of the standard charsets.
[US-ASCII, ISO-8859-1 a.k.a. ISO-LATIN-1, UTF-8, UTF-16BE, UTF-16LE, UTF-16]
The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system.
This example demonstrates how to convert ISO-8859-1 encoded bytes in a ByteBuffer to a string in a CharBuffer and visa versa.
// Create the encoder and decoder for ISO-8859-1
Charset charset = Charset.forName("ISO-8859-1");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();
try {
// Convert a string to ISO-LATIN-1 bytes in a ByteBuffer
// The new ByteBuffer is ready to be read.
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap("a string"));
// Convert ISO-LATIN-1 bytes in a ByteBuffer to a character ByteBuffer and then to a string.
// The new ByteBuffer is ready to be read.
CharBuffer cbuf = decoder.decode(bbuf);
String s = cbuf.toString();
} catch (CharacterCodingException e) {
}
I would just like to add that if the String is originally encoded using the wrong encoding it might be impossible to change it to another encoding without errors.
The question does not state that the conversion here is made from wrong encoding to correct encoding but I personally stumbled to this question just because of this situation so just a heads up for others as well.
This answer in other question gives an explanation why the conversion does not always yield correct results
https://stackoverflow.com/a/2623793/4702806
It is a whole lot easier if you think of unicode as a character set (which it actually is - it is very basically the numbered set of all known characters). You can encode it as UTF-8 (1-3 bytes per character depending) or maybe UTF-16 (2 bytes per character or 4 bytes using surrogate pairs).
Back in the mist of time Java used to use UCS-2 to encode the unicode character set. This could only handle 2 bytes per character and is now obsolete. It was a fairly obvious hack to add surrogate pairs and move up to UTF-16.
A lot of people think they should have used UTF-8 in the first place. When Java was originally written unicode had far more than 65535 characters anyway...

Categories