UTF-8 to EBCDIC in Java

UTF-8 to EBCDIC in Java - java

Our requirement is to send EBCDIC text to mainframe. We have some chinese characters thus UTF8 format.
So, is there a way to convert the UTF-8 characters to EBCDIC?
Thanks,
Raj Mohan

Assuming your target system is an IBM mainframe or midrange, it has full support for all of the EBCDIC encodings built into it's JVM as encodings named CPxxxx, corresponding to the IBM CCSID's (CP stands for code-page). You will need to do the translations on the host-side since the client side will not have the necessary encoding support.
Since Unicode is DBCS and greater, and supports every known character, you will likely be targeting multiple EBCDIC encodings; so you will likely configure those encodings in some way. Try to have your client Unicode (UTF-8, UTF-16, etc) only, with the translations being done as data arrives on the host and/or leaves the host system.
Other than needing to do translations host-side, the mechanics are the same as any Java translation; e.g. new String(bytes,encoding) and String.getBytes(encoding), and the various NIO and writer classes. There's really no magic - it's no different than translating between, say, ISO 8859-x and Unicode, or any other SBCS (or limited DBCS).
For example:
byte[] ebcdta="Hello World".getBytes("CP037"); // get bytes for EBCDIC codepage 37
You can find more information on IBM's documentation website.

EBCDIC has many 8-Bit Codepages. Many of them are supported by the VM. Have a look at Charset.availableCharsets().keySet(), the EBCDIC pages are named IBM... (there are aliases like cp500 for IBM500 as you can see by Charset.forName("IBM500").aliases()).
There are two problems:
if you have characters included in different code pages of EBCDIC, this will not help
i am not sure, if these charsets are available in any vm outside windows.
For the first, have a look at this approach. For the second, have a try on the desired target runtime ;-)

You can always make use of the IBM Toolbox for Java (JTOpen), specifically the com.ibm.as400.access.AS400Text class in the jt400.jar.
It goes as follows:
int codePageNumber = 420;
String codePage = "CP420";
String sourceUtfText = "أحمد يوسف صالح";
AS400Text converter = new AS400Text(sourceUtfText.length(), codePageNumber);
byte[] bytesData = converter.toBytes(sourceUtfText);
String resultedEbcdicText = new String(bytesData, codePage);
I used the code-page 420 and its corresponding java representation of the encoding CP420, this code-page is used for Arabic text, so, you should pick the suitable code-page for Chinese text.

For the midrange AS/400 (IBM i these days) the best bet is to use the IBM Java Toolkit (jt400.jar) which does all these things transparently (perhaps slightly hinted).
Please note that inside Java a character is a 16 bit value, not an UTF-8 (that is an encoding).

Related

With the release of Java 18, UTF-8 is the default charset now. Does it imply that char data type is no more UTF-16?

I understand that Java follows UTF-16 for the char data type. But the new update in Java 18 - Default charset for the standard Java APIs is UTF-8.
Does this update have any impact on the char data type encoding format? I also understand that UTF-8 is a variable-width encoding that can accommodate characters up to 4 bytes. What is the size of the char data type after Java 18? And does it still adhere to UTF-16 or moved to UTF-8?

No, the change in default character encoding does not affect the internals of char/Character, nor does it affect the intervals of String.
I suggest you read the official documentation, JEP 400: UTF-8 by Default. The motivation and the details are explained thoroughly.
The change made in Java 18 affects mainly input/output. So this includes the older APIs and classes for reading and writing files. Some of the newer APIs and classes were already defaulting to UTF-8. JEP 400 seeks to make this default consistent throughout the bundled libraries.
One particular issue called out in the JEP regards Java source code files that were saved with a non-UTF-8 encoding and compiled with an earlier JDK. Recompiling on JDK 18 or later may cause problems.
By the way, let me remind you that char/Character has been legacy since Java 5, essentially broken since Java 2. As a 16-bit type, char is physically incapable of representing most characters.
To work with individual characters, use code point integer numbers. Look for codePoint methods added to classes including String, StringBuilder, Character, etc.

Point of other encoding rather than UTF-8

I have been working with String in various programming language for a long time, and I haven't come across a situation where I need to use any other encoding except UTF-8
The question might feel like opinion based, but I don't understand why other encoding should be available.
wouldn't it just make everyone's life (especially programmers) easier to just have one single standard?
I take Java as an example:
A Set of currently available encoding for Java can be found here:
https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html

UTF-8: Advantages and disadvantages
The typical argument is:
Asian languages have many more characters and would require oversized
encoding for their languages.
However, the Pros outweigh the cons in my opinion:
UTF-8, in general is much more powerful due to is compatibility with ASCII
The fact that it is Unicode
Other UTF-16/32 are not fixed-length
Others that are not unicode are extremely complex
I would take a gander over here: Why don't people use other encodings.

String in java are internally represented as UTF-16, when you build a String you don't have to tell what encoding to use as internal representation (but you have to pass the encoding if you are building a String from an array of bytes).
The link you provided shows the Encoding available for read and write operations; if you want to read correctly a text file encoded in ISO-8859-1 on a platform where the default encoding is UTF-8 you must specify the correct encoding and your language (java in this case) must be able to automatically convert from one encoded form to another.
Java manage a lot of encodings and the convertion from one to another, but internally it represents Strings as UTF-16, but you don't have to worry about that; you only must specifiy the encoding when transforming a String to a sequence of bytes, or vice versa.

Java - UTF8/16 is a Charset Name or Character Encoding?

The application I am developing will be used by folks in Western & Eastern Europe as well in the US. I am encoding my input and decoding my output with UTF-8 character set.
My confusion is becase when I use this method String(byte[] bytes, String charsetName), I provide UTF-8 as the charsetname when it really is an character encoding. And my default econding is set in Eclipse as Cp1252.
Does this mean if, in the US in my Java application, I create an Output text file using Cp1252 as my charset encoding and UTF-8 as my charset name, will the folks in Europe be able to read this file in my Java application and vice versa?

They're encodings. It's a pity that Java uses "charset" all over the place when it really means "encoding", but that's hard to fix now :( Annoyingly, IANA made the same mistake.
Actually, by Unicode terminology they're probably most accurately character encoding schemes:
A character encoding form plus byte serialization. There are seven character encoding schemes in Unicode: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.
Where a character encoding form is:
Mapping from a character set definition to the actual code units used to represent the data.
Yes, the fact that Unicode only defines seven character encoding forms makes this even more confusing. Fundamentally, all most developers need to know is that a "charset" in Java terminology is a mapping between text data (String, char[]) and binary data (byte[]).

I think those two things are not directly related.
The Eclipse setting decide how your eclipse editor will save the text file (typically source code) you created/edited. You can use other editors and therefore the file maybe saved in some other encoding scheme. As long as your java compiler has no problem compiling your source code you're safe.
The
java String(byte[] bytes, String charsetName)
is your own application logic that deals with how do you want to interpret some data your read either from a file or network. Different charsetName (essentially different character encoding scheme) may have different interpretation on the byte array.

A "charset" does implies the set of characters that the text uses. For UTF-8/16, the character set happens to be "all" characters. For others, not necessarily. Back in the days, everybody were inventing their own character sets and encoding schemes, and the two were almost 1-to-1 mapping, therefore one name can be used to refer to both character set and encoding scheme.

Java safeguards for when UTF-16 doesn't cut it

My understanding is that Java uses UTF-16 by default (for String and char and possibly other types) and that UTF-16 is a major superset of most character encodings on the planet (though, I could be wrong). But I need a way to protect my app for when it's reading files that were generated with encodings (I'm not sure if there are many, or none at all) that UTF-16 doesn't support.
So I ask:
Is it safe to assume the file is UTF-16 prior to reading it, or, to maximize my chances of not getting NPEs or other malformed input exceptions, should I be using a character encoding detector like JUniversalCharDet or JCharDet or ICU4J to first detect the encoding?
Then, when writing to a file, I need to be sure that a characte/byte didn't make it into the in-memory object (the String, the OutputStream, whatever) that produces garbage text/characters when written to a string or file. Ideally, I'd like to have some way of making sure that this garbage-producing character gets caught somehow before making it into the file that I am writing. How do I safeguard against this?
Thanks in advance.

Java normally uses UTF-16 for its internal representation of characters. n Java char arrays are a sequence of UTF-16 encoded Unicode codepoints. By default char values are considered to be Big Endian (as any Java basic type is). You should however not use char values to write strings to files or memory. You should make use of the character encoding/decoding facilities in the Java API (see below).
UTF-16 is not a major superset of encodings. Actually, UTF-8 and UTF-16 can both encode any Unicode code point. In that sense, Unicode does define almost any character that you possibly want to use in modern communication.
If you read a file from disk and asume UTF-16 then you would quickly run into trouble. Most text files are using ASCII or an extension of ASCII to use all 8 bits of a byte. Examples of these extensions are UTF-8 (which can be used to read any ASCII text) or ISO 8859-1 (Latin). Then there are a lot of encodings e.g. used by Windows that are an extension of those extensions. UTF-16 is not compatible with ASCII, so it should not be used as default for most applications.
So yes, please use some kind of detector if you want to read a lot of plain text files with unknown encoding. This should answer question #1.
As for question #2, think of a file that is completely ASCII. Now you want to add a character that is not in the ASCII. You choose UTF-8 (which is a pretty safe bet). There is no way of knowing that the program that opens the file guesses correctly guesses that it should use UTF-8. It may try to use Latin or even worse, assume 7-bit ASCII. In that case you get garbage. Unfortunately there are no smart tricks to make sure this never happens.
Look into the CharsetEncoder and CharsetDecoder classes to see how Java handles encoding/decoding.

Whenever a conversion between bytes and characters takes place, Java allows to specify the character encoding to be used. If it is not specified, a machine dependent default encoding is used. In some encodings the bit pattern representing a certain character has no similarity with the bit pattern used for the same character in UTF-16 encoding.
To question 1 the answer is therefore "no", you cannot assume the file is encoded in UTF-16.
It depends on the used encoding which characters are representable.

Encoding CharsetNames for Charset.forName(String)

I have a question about Charset.forName(String charsetName). Is there a list of charsetNames I can refer to? For example, for UTF-8, we use "utf8" for the charsetName. What about WINDOWS-1252, GB18030, etc.?

Charset Description
US-ASCII Seven-bit ASCII, a.k.a. ISO646-US, a.k.a. the Basic Latin block of the Unicode character set
ISO-8859-1 ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1
UTF-8 Eight-bit UCS Transformation Format
UTF-16BE Sixteen-bit UCS Transformation Format, big-endian byte order
UTF-16LE Sixteen-bit UCS Transformation Format, little-endian byte order
UTF-16 Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark
Reference: http://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html

The charset names in Java are platform dependent, there are only 6 constants in the StandardCharsets class.
To view the all charsets you should look at IANA. Check Preferred MIME Name and aliases columns.

To list all character set installed in your JVM, you might use the following
code snippet (Java 8 SE or higher):
SortedMap<String, Charset> map = Charset.availableCharsets();
map.keySet().stream().forEach(System.out::println);
On my system, this lists around 170 character sets.

The java Charset library is required to accept just a few basic encodings: ASCII, Latin-1 (ISO-8859-1), and a handful of UTF variants that you can see listed in this answer. That's a pretty useless list for any practical purposes, unless your scope is limited to Latin-1. In reality, Java classes can handle a large number of encodings that you can read about in the Supported Encodings page. Quoting from it:
The java.io.InputStreamReader, java.io.OutputStreamWriter, java.lang.String classes, and classes in the java.nio.charset package can convert between Unicode and a number of other character encodings. The supported encodings vary between different implementations of Java SE 8. The class description for java.nio.charset.Charset lists the encodings that any implementation of Java SE 8 is required to support.
JDK 8 for all platforms (Solaris, Linux, and Microsoft Windows) and JRE 8 for Solaris and Linux support all encodings shown on this page. JRE 8 for Microsoft Windows may be installed as a complete international version or as a European languages version. [...]
The rest of the page consists of an extensive table of encoding names and synonyms, which is what the OP was after all those years ago...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.