Is UTF to EBCDIC Conversion lossless? - java

We have a process which communicates with an external via MQ. The external system runs on a mainframe maching (IBM z/OS), while we run our process on a CentOS Linux platform. So far we never had any issues.
Recently we started receiving messages from them with non-printable EBCDIC characters embedded in the message. They use the characters as a compressed ID, 8 bytes long. When we receive it, it arrives on our queue encoded in UTF (CCSID 1208).
They need to original 8 bytes back in order to identify our response messages. I'm trying to find a solution in Java to convert the ID back from UTF to EBCDIC before sending the response.
I've been playing around with the JTOpen library, using the AS400Text class to do the conversion. Also, the counterparty has sent us a snapshot of the ID in bytes. However, when I compare the bytes after conversion, they are different from the original message.
Has anyone ever encountered this issue? Maybe I'm using the wrong code page?
Thanks for any input you may have.
Bytes from counterparty(Positions [5,14]):
00000 F0 40 D9 F0 F3 F0 CB 56--EF 80 04 C9 10 2E C4 D4 |0 R030.....I..DM|
Program output:
UTF String: [R030ôîÕ؜IDMDHP1027W 0510]
EBCDIC String: [R030ôîÃÃÂIDMDHP1027W 0510]
NATIVE CHARSET - HEX: [52303330C3B4C3AEC395C398C29C491006444D44485031303237572030353130]
CP500 CHARSET - HEX: [D9F0F3F066BE66AF663F663F623FC9102EC4D4C4C8D7F1F0F2F7E640F0F5F1F0]
Here is some sample code:
private void readAndPrint(MQMessage mqMessage) throws IOException {
mqMessage.seek(150);
byte[] subStringBytes = new byte[32];
mqMessage.readFully(subStringBytes);
String msgId = toHexString(mqMessage.messageId).toUpperCase();
System.out.println("----------------------------------------------------------------");
System.out.println("MESSAGE_ID: " + msgId);
String hexString = toHexString(subStringBytes).toUpperCase();
String subStr = new String(subStringBytes);
System.out.println("NATIVE CHARSET - HEX: [" + hexString + "] [" + subStr + "]");
// Transform to EBCDIC
int codePageNumber = 37;
String codePage = "CP037";
AS400Text converter = new AS400Text(subStr.length(), codePageNumber);
byte[] bytesData = converter.toBytes(subStr);
String resultedEbcdicText = new String(bytesData, codePage);
String hexStringEbcdic = toHexString(bytesData).toUpperCase();
System.out.println("CP500 CHARSET - HEX: [" + hexStringEbcdic + "] [" + resultedEbcdicText + "]");
System.out.println("----------------------------------------------------------------");
}

If a MQ message has varying sub-message fields that require different encodings, then that's how you should handle those messages, i.e., as separate message pieces.
But as you describe this, the entire message needs to be received without conversion. The first eight bytes need to be extracted and held separately. The remainder of the message can then have its encoding converted (unless other sub-fields also need to be extracted as binary, unconverted bytes).
For any return message, the opposite conversion must be done. The text portion of the message can be converted, and then that sub-string can have the original eight bytes prepended to it. The newly reconstructed message then can be sent back through the queue, again without automatic conversion.
Your partner on the other end is not using the messaging product correctly. (Of course, you probably shouldn't say that out loud.) There should be no part of such a message that cannot automatically survive intact across both directions. Instead of an 8-byte binary field, it should be represented as something more like a 16-byte hex representation of the 8-byte value for one example method. In hex, there'd be no conversion problem either way across the route.

It seems to me that the special 8 bytes are not actually EBCDIC character but simply 8 bytes of data. If it is in such case then I believe, as mentioned by another answer, that you should handle that 8 bytes separately without allowing it convert to UTF8 and then back to EBCDIC for further processing.
Depending on the EBCDIC variant you are using, it is quite possible that a byte in EBCDIC is not converting to a meaningful UTF-8 character, and hence, you will fail to get the original byte by converting the UTF8 character to EBCDIC you received.
A brief search on Google give me several EBCDIC tables (e.g. http://www.simotime.com/asc2ebc1.htm#AscEbcTables). You can see there are lots of values in EBCDIC that do not have character assigned. Hence, when they are converted to UTF8, you may not assume each of them will convert to a distinct character in Unicode. Therefore your proposed way of processing is going to be very dangerous and error-prone.

Related

How to determine if a InputStream contains JSON data?

How do I check if the data behind a java.io.InputStream (from File, URL, ..) is of type JSON?
Of course to be complete the best would be to load the whole data of the stream and try to validate it as JSON (e.g checking for closing bracket }). Since the stream source might be very big (a GeoJSON file with a size of 500MB) this eventually end in a burning machine.
To avoid this I wrote a small method that only takes the first character of the InputStream as UTF-8/16/32 and tests if it is a { according to RFC 4627 (which is referenced and updated by RFC 7159) to determine its JSONness:
JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.
And:
Since the first two characters of a JSON text will always be ASCII
characters [RFC0020], it is possible to determine whether an octet
stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
at the pattern of nulls in the first four octets.
00 00 00 xx UTF-32BE
00 xx 00 xx UTF-16BE
xx 00 00 00 UTF-32LE
xx 00 xx 00 UTF-16LE
xx xx xx xx UTF-8
The method is:
public static boolean mightBeJSON(InputStream stream) {
try {
byte[] bytes = new byte[1];
stream.read(bytes);
if (bytes[0] == 0x7B) {
return true;
}
stream.read(bytes);
if (bytes[0] == 0x7B) {
return true;
}
stream.read(bytes);
stream.read(bytes);
if (bytes[0] == 0x7B) {
return true;
}
} catch (IOException e) {
// Nothing to do;
}
return false;
}
Until now my machine is still not burning, BUT:
Is there anything wrong with this approach/implementation?
May there be any problems in some situations?
Anything to improve?
RFC 7159 states:
8. String and Character Issues
8.1 Character Encoding
JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The
default encoding is UTF-8, and JSON texts that are encoded in UTF-8
are interoperable in the sense that they will be read successfully by
the maximum number of implementations; there are many implementations
that cannot successfully read texts in other encodings (such as UTF-16
and UTF-32).
Implementations MUST NOT add a byte order mark to the beginning of
a JSON text. In the interests of interoperability, implementations
that parse JSON texts MAY ignore the presence of a byte order mark
rather than treating it as an error.
This doesn't answer your question per say, but I hope it can help in your logic.

Convert a java string to an xml that contains valid utf-8 characters

Here is what I was doing -
Take up a document(JSON) from mongodb
Write this key value as an XML
Send this XML to Apache Solr for indexing
Here is how I was doing step #2
Given key say "key1" and value as "value1" step#2 output is
"<"+ key1 + ">" + value1 + "</"+ key1 + ">"
Now when i send this XML to Solr, I was getting Stax exceptions like -
Invalid UTF-8 start byte 0xb7
Invalid UTF-8 start byte 0xa0
Invalid UTF-8 start byte 0xb0
Invalid UTF-8 start byte 0x96
So here is how I am thinking to fix it -
key1New = new String(key1.getBytes("UTF-8"), "UTF-8");
value1New = new String(value1.getBytes("UTF-8"), "UTF-8");
Should this work OR I should rather do this -
key1New = new String(key1.getBytes("UTF-8"), "ISO-8859-1");
value1New = new String(value1.getBytes("UTF-8"), "ISO-8859-1");
Java String Objects dont have encodings. An encoding, in this context, makes sense when associated with a byte[]. try something like this:
byte[] utf8xmlBytes = originalxmlString.getBytes("UTF8");
and send these bytes.
EDIT: Also, consider the comment of Jon Skeet. It is usually a good idea to create XML using an API unless you have a very small amount of XML.

Decode of base64 string containing zip file gets 8 character codes wrong in result string

I'm receiving a base64-encoded zip file (in the form of a string) from a SOAP request.
I can decode the string successfully using a stand-alone program, b64dec.exe, but I need to do it in a java routine. I'm trying to decode it (theZipString) with Apache commons-codec-1.7.jar routines:
import org.apache.commons.codec.binary.Base64;
import org.apache.commons.codec.binary.StringUtils;
StringUtils.newString(Base64.decodeBase64(theZipString), "ISO-8859-1");
Zip file readers open the resulting file and show the list of content files but the content files have CRC errors.
I compared the result of my java routine with the result of the b64dec.exe program (using UltraEdit) and found that they are identical with the exception that eight different byte-values, where ever they appear in the b64dec.exe result, are replaced by 3F ("?") in mine. The values and their ISO-8859-1 character names are A4 ('currency'), A6 ('broken bar'), A8 ('diaeresis'), B4 ('acute accent'), B8 ('cedilla'), BC ('vulgar fraction 1/4'), BD ('vulgar fraction 1/2'), and BE ('vulgar fraction 3/4').
I'm guessing that the StringUtils.newString function is not translating those eight values to the string output, because I tried other 8-bit character sets: UTF-8, and cp437. Their results are similar but worse, with many more 3F, "?" substitutions.
Any suggestions? What character set should I use for the newString function to convert a .zip string? Is the Apache function incapable of this translation? Is there a better way to do this decode?
Thanks!
A zip file is not a string. It's not encoded text. It may contain text files, but that's not the same thing. It's just binary data.
If you treat arbitrary binary data as a string, bad things will happen. Instead, you should use streams or byte arrays. So this is fine:
byte[] zipData = Base64.decodeBase64(theZipString);
... but don't try to convert that to a string. If you write out that byte[] to a file (probably with FileOutputStream or some utility method) it should be fine.

String.getBytes("UTF-32") returns different results on JVM and Dalvik VM

I have a 48 character AES-192 encryption key which I'm using to decrypt an encrypted database.
However, it tells me the key length is invalid, so I logged the results of getBytes().
When I execute:
final String string = "346a23652a46392b4d73257c67317e352e3372482177652c";
final byte[] utf32Bytes = string.getBytes("UTF-32");
System.out.println(utf32Bytes.length);
Using BlueJ on my mac (Java Virtual Machine), I get 192 as the output.
However, when I use:
Log.d(C.TAG, "Key Length: " + String.valueOf("346a23652a46392b4d73257c67317e352e3372482177652c".getBytes("UTF-32").length));
I get 196 as the output.
Does anybody know why this is happening, and where Dalvik is getting an additional 4 bytes from?
You should specify endianess on both machines
final byte[] utf32Bytes = string.getBytes("UTF-32BE");
Note that "UTF-32BE" is a different encoding, not special .getBytes parameter. It has fixed endianess and doesn't need BOM. More info: http://www.unicode.org/faq/utf_bom.html#gen6
Why would you UTF-32 encode plain a hexidecimal number. Thats 8x larger than it needs to be. :P
String s = "346a23652a46392b4d73257c67317e352e3372482177652c";
byte[] bytes = new BigInteger(s, 16).toByteArray();
String s2 = new BigInteger(1, bytes).toString(16);
System.out.println("Strings match is "+s.equals(s2)+" length "+bytes.length);
prints
Strings match is true length 24

UTF8 convertion for text obtained from internet

ElasticSearch is a search Server which accepts data only in UTF8.
When i tries to give ElasticSearch following text
Small businesses potentially in line for a lighter reporting load include those with an annual turnover of less than £440,000, net assets of less than £220,000 and fewer than ten employees"
Through my java application - Basically my java application takes this info from a webpage , and gives it to elasticSearch. ES complaints it cant understand £ and it fails. After filtering through below code -
byte bytes[] = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �
But then when I copy it to a file in my home directory using bash and it goes in fine. Any pointers will help.
You have ISO-8895-1 octets in bytes, which you then tell String to decode as if it were UTF-8. When it does that, it doesn't recognize the illegal 0xA3 sequence and replaces it with the substitution character.
To do this, you have to construct the string with the encoding it uses, then convert it to the encoding that you want. See How do I convert between ISO-8859-1 and UTF-8 in Java?.
UTF-8 is easier than one thinks. In String everything is unicode characters.
Bytes/string conversion is done as follows.
(Note Cp1252 or Windows-1252 is the Windows Latin1 extension of ISO-8859-1; better use
that one.)
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(file), "Cp1252"));
PrintWriter out = new PrintWriter(
new OutputStreamWriter(new FileOutputStream(file), "UTF-8"));
response.setContentType("text/html; charset=UTF-8");
response.setEncoding("UTF-8");
String s = "20 \u00A3"; // Escaping
To see why Cp1252 is more suitable than ISO-8859-1:
http://en.wikipedia.org/wiki/Windows-1252
String s is a series of characters that are basically independent of any character encoding (ok, not exactly independent, but close enough for our needs now). Whatever encoding your data was in when you loaded it into a String has already been decoded. The decoding was done either using system default encoding (which is practically ALWAYS AN ERROR, do not ever use system default encoding, trust me I have over 10 years of experience in dealing with bugs related to wrong default encodings) or the encoding you explicitely specified when you loaded the data.
When you call getBytes("ISO-8859-1") for a String, you request that the String is encoded into bytes according to ISO-8859-1 encoding.
When you create a String from a byte array, you need to specify the encoding in which the characters in the byte array are represented. You create a string from a byte array that has been encoded in UTF-8 (and just above you encoded it in ISO-8859-1, that is your error).
What you want to do is:
byte bytes[] = s.getBytes("UTF-8");
s = new String(bytes, "UTF-8");

Categories