How does inputstream per byte reading work? - java

I cannot understand how System.in.read() method works.
There is such a code:
public static void main(String[] args) throws IOException {
while (true){
Integer x = System.in.read();
System.out.println(Integer.toString(x, 2));
}
I know that System.in.read() method reads from the inputstream PER ONE BYTE.
So when I enter 'A'(U+0041, one byte is used to store the char) - the program output is:
1000001 (U+0041)
1010 (NL) - it works as expected.
But when I enter 'Я'(U+042F, two bytes are used to store the char) - the output is:
11010000 (byte1)
10101111 (byte2)
1010 (byte3 - NL)
The real code for letter 'Я'(U+042F) is 10000101111.
Why 11010000 10101111 (byte1 + byte2) is not the binary code for letter 'Я'(U+042F)?

This will depend on the external process that is sending data to System.in. It could be a command shell, an IDE, or another process.
In the typical case of a command shell, the shell will have a character encoding configured. (chcp on Windows, locale charmap on Linux.)
The character encoding determines how a graphical character or glyph is coded as a number. For example, a Windows machine might use a "code page" of "Windows-1251" and encode "Я" as one byte (0xCF). Or, it could use UTF-8 and encode "Я" as two bytes (0xD0 0xAF), or UTF-16 and use two different bytes (0x04 0x2F).
Your results show that the process sending data to your Java program is using UTF-8 as an encoding.

Related

Appending extended ascii in strings

I'm trying to use extended ascii character 179(looks like pipe).
Here is how I use it.
String cmd = "";
char pipe = (char) 179;
// cmd ="02|CO|0|101|03|0F""
cmd ="02"+pipe+"CO"+pipe+"0"+pipe+"101"+pipe+"03"+pipe+"0F";
System.out.println("cmd "+cmd);
Output
cmd 02³CO³0³101³03³0F
But the output is like this . I have read that extended ascii characters are not displayed correctly.
Is my code correct and just the ascii is not correctly displayed
or my code is wrong.
I'm not concerned about showing this string to user I need to send it to server.
EDIT
The vendor's api document states that we need to use ascii 179 (looks like pipe) . The server side code needs 179(part of extended ascii) as pipe/vertical line so I cannot use 124(pipe)
EDIT 2
Here is the table for extended ascii
On the other hand this table shows that ascii 179 is "3" . Why
are there different interpretation of the same and which one should I
consider??
EDIT 3
My default charset value is (is this related to my problem?)
System.out.println("Default Charset=" + Charset.defaultCharset());
Default Charset=windows-1252
Thanks!
I have referred to
How to convert a char to a String?
How to print the extended ASCII code in java from integer value
Thanks
Use the below code.
String cmd = "";
char pipe = '\u2502';
cmd ="02"+pipe+"CO"+pipe+"0"+pipe+"101"+pipe+"03"+pipe+"0F";
System.out.println("cmd "+cmd);
System.out.println("int value: " + (int)pipe);
Output:
cmd 02│CO│0│101│03│0F
int value: 9474
I am using IntelliJ. This is the output I am getting.
Your code is correct; concatenating String values and char values does what one expects. It's the value of 179 that is wrong. You can google "unicode 179", and you'll find "Unicode Character 'SUPERSCRIPT THREE' (U+00B3)", as one might expect. And, you could simply say "char pipe = '|';" instead of using an integer. Or even better: String pipe = "|"; which also allows you the flexibility to use more than one character :)
In response to the new edits...
May I suggest that you fix this rather low-level problem not at the Java String level, but instead replace the byte encoding this character before sending the bytes to the server?
E.g. something like this (untested)
byte[] bytes = cmd.getBytes(); // all ascii, so this should be safe.
for (int i = 0; i < bytes.length; i++) {
if (bytes[i] == '|') {
bytes[i] = (byte)179;
}
}
// send command bytes to server
// don't forget endline bytes/chars or whatever the protocol might require. good luck :)

Japanese Character Encoding in Base64

I have been asked to fix a bug in our email processing software.
When a message whose subject is encoded in RFC 2047 like this:
=?ISO-2022-JP?B?GyRCR1s/LiVGJTklSC1qRnxLXDhsGyhC?=
is received, it is incorrectly decoded - one of the Japanese characters is not rendered properly. It is rendered like this: 配信テスト?日本語 when it should be 配信テスト㈱日本語
(I do not understand Japanese) - clearly one of the characters, the one which looks its in brackets, has not been rendered.
The decoding is carried out by javax.mail.internet.MimeUtility.decodeText()
If I try it with an on-line decoder (the only one I've found is here) it seems to work OK, so I was suspecting a bug in MimeUtility.
So I tried some experiments, in the form of this little program:
public class Encoding {
private static final Charset CHARSET = Charset.forName("ISO-2022-JP");
public static void main(String[] args) throws UnsupportedEncodingException {
String control = "繋がって";
String subject= "配信テスト㈱日本語";
String controlBase64 = japaneseToBase64(control);
System.out.println(controlBase64);
System.out.println(base64ToJapanese(controlBase64));
String subjectBase64 = japaneseToBase64(subject);
System.out.println(subjectBase64);
System.out.println(base64ToJapanese(subjectBase64));
}
private static String japaneseToBase64(String in) {
return Base64.encodeBase64String(in.getBytes(CHARSET));
}
private static String base64ToJapanese(String in) {
return new String(Base64.decodeBase64(in), CHARSET);
}
}
(The Base64 and Hex classes are in org.apache.commons.codec)
When I run it, here's the output:
GyRCN1IkLCRDJEYbKEI=
繋がって
GyRCR1s/LiVGJTklSCEpRnxLXDhsGyhC
配信テスト?日本語
The first, shorter Japanese string is a control, and this returns the same as the input, having been converted into Base64 and back again, using Charset ISO-2022-JP. All OK there.
The second Japanese string is the one with the dodgy character. As you see, it returns with a ? instead of the character. The Base64 encoding output is also different from the original subject encoding.
Sorry if this is long, I wanted to be thorough. What's going on, and how can I decode this character correctly?
The bug is not in your software, but the subject string itself is incorrectly encoded. Other software may be able to decode the text by making further assumptions about the content, just as it is often assumed that characters in the range 0x80-0x9f are Cp1252-encoded, although ISO-8859-1 or ISO-8859-15 is specified.
ISO-2022-JP is a multi-charset encoding, using escape sequences to switch between the actually used character set. Your encoded string starts with ESC $ B, indicating that the character set JIS X 0208-1983 is used. The offending character is encoded as 0x2d6a. That code point is not defined in the referred character set, but later added to JIS X 0213:2000, a newer version of the JIS X character set specifications.
Try using "MS932" or "Shift-JIS" in your encoding. Means
private static final Charset CHARSET = Charset.forName("MS932");
There are different scripts in Japanese like kanji, katakana. Some of the encoding like Cp132 will not support some characters of Japanese. The problem you face is because of the encoding "ISO-2022-JP" you have used in your code.
ISO-2022-JP uses pairs of bytes, called ku and ten, that index into a 94×94 table of characters. The pair that fails has ku 12 and ten 73, which is not listed in table of valid characters I have (based on JIS X 0208). All of ku=12 seems to be unused.
Wikipedia doesn't list any updates to JIS X 0208, either. Perhaps the sender is using some sort of vendor-defined extension?
Despite the fact that ISO-2022-JP is a variable width encoding, it seems as though Java doesn't support the section of the character set that it lies in (possibly as a result of the missing escape sequences in ISO-2022-JP-2 that are present in ISO-2022-JP-3 and ISO-2022-JP-2004 which aren't supported). UTF-8, UTF-16 and UTF-32 do however support all of the characters.
UTF-32:
AAB+SwAAMEwAADBjAAAwZg==
繋がって
AACRTQAAT+EAADDGAAAwuQAAMMgAADIxAABl5QAAZywAAIqe
配信テスト㈱日本語
As an extra tidbit, regardless of whether UTF-32 was used, when the strings were printed as-is they retained their natural encoding and appeared normally.

Is UTF to EBCDIC Conversion lossless?

We have a process which communicates with an external via MQ. The external system runs on a mainframe maching (IBM z/OS), while we run our process on a CentOS Linux platform. So far we never had any issues.
Recently we started receiving messages from them with non-printable EBCDIC characters embedded in the message. They use the characters as a compressed ID, 8 bytes long. When we receive it, it arrives on our queue encoded in UTF (CCSID 1208).
They need to original 8 bytes back in order to identify our response messages. I'm trying to find a solution in Java to convert the ID back from UTF to EBCDIC before sending the response.
I've been playing around with the JTOpen library, using the AS400Text class to do the conversion. Also, the counterparty has sent us a snapshot of the ID in bytes. However, when I compare the bytes after conversion, they are different from the original message.
Has anyone ever encountered this issue? Maybe I'm using the wrong code page?
Thanks for any input you may have.
Bytes from counterparty(Positions [5,14]):
00000 F0 40 D9 F0 F3 F0 CB 56--EF 80 04 C9 10 2E C4 D4 |0 R030.....I..DM|
Program output:
UTF String: [R030ôîÕ؜IDMDHP1027W 0510]
EBCDIC String: [R030ôîÃÃÂIDMDHP1027W 0510]
NATIVE CHARSET - HEX: [52303330C3B4C3AEC395C398C29C491006444D44485031303237572030353130]
CP500 CHARSET - HEX: [D9F0F3F066BE66AF663F663F623FC9102EC4D4C4C8D7F1F0F2F7E640F0F5F1F0]
Here is some sample code:
private void readAndPrint(MQMessage mqMessage) throws IOException {
mqMessage.seek(150);
byte[] subStringBytes = new byte[32];
mqMessage.readFully(subStringBytes);
String msgId = toHexString(mqMessage.messageId).toUpperCase();
System.out.println("----------------------------------------------------------------");
System.out.println("MESSAGE_ID: " + msgId);
String hexString = toHexString(subStringBytes).toUpperCase();
String subStr = new String(subStringBytes);
System.out.println("NATIVE CHARSET - HEX: [" + hexString + "] [" + subStr + "]");
// Transform to EBCDIC
int codePageNumber = 37;
String codePage = "CP037";
AS400Text converter = new AS400Text(subStr.length(), codePageNumber);
byte[] bytesData = converter.toBytes(subStr);
String resultedEbcdicText = new String(bytesData, codePage);
String hexStringEbcdic = toHexString(bytesData).toUpperCase();
System.out.println("CP500 CHARSET - HEX: [" + hexStringEbcdic + "] [" + resultedEbcdicText + "]");
System.out.println("----------------------------------------------------------------");
}
If a MQ message has varying sub-message fields that require different encodings, then that's how you should handle those messages, i.e., as separate message pieces.
But as you describe this, the entire message needs to be received without conversion. The first eight bytes need to be extracted and held separately. The remainder of the message can then have its encoding converted (unless other sub-fields also need to be extracted as binary, unconverted bytes).
For any return message, the opposite conversion must be done. The text portion of the message can be converted, and then that sub-string can have the original eight bytes prepended to it. The newly reconstructed message then can be sent back through the queue, again without automatic conversion.
Your partner on the other end is not using the messaging product correctly. (Of course, you probably shouldn't say that out loud.) There should be no part of such a message that cannot automatically survive intact across both directions. Instead of an 8-byte binary field, it should be represented as something more like a 16-byte hex representation of the 8-byte value for one example method. In hex, there'd be no conversion problem either way across the route.
It seems to me that the special 8 bytes are not actually EBCDIC character but simply 8 bytes of data. If it is in such case then I believe, as mentioned by another answer, that you should handle that 8 bytes separately without allowing it convert to UTF8 and then back to EBCDIC for further processing.
Depending on the EBCDIC variant you are using, it is quite possible that a byte in EBCDIC is not converting to a meaningful UTF-8 character, and hence, you will fail to get the original byte by converting the UTF8 character to EBCDIC you received.
A brief search on Google give me several EBCDIC tables (e.g. http://www.simotime.com/asc2ebc1.htm#AscEbcTables). You can see there are lots of values in EBCDIC that do not have character assigned. Hence, when they are converted to UTF8, you may not assume each of them will convert to a distinct character in Unicode. Therefore your proposed way of processing is going to be very dangerous and error-prone.

Java App : Unable to read iso-8859-1 encoded file correctly

I have a file which is encoded as iso-8859-1, and contains characters such as ô .
I am reading this file with java code, something like:
File in = new File("myfile.csv");
InputStream fr = new FileInputStream(in);
byte[] buffer = new byte[4096];
while (true) {
int byteCount = fr.read(buffer, 0, buffer.length);
if (byteCount <= 0) {
break;
}
String s = new String(buffer, 0, byteCount,"ISO-8859-1");
System.out.println(s);
}
However the ô character is always garbled, usually printing as a ? .
I have read around the subject (and learnt a little on the way) e.g.
http://www.joelonsoftware.com/articles/Unicode.html
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
http://www.ingrid.org/java/i18n/utf-16/
but still can not get this working
Interestingly this works on my local pc (xp) but not on my linux box.
I have checked that my jdk supports the required charsets (they are standard, so this is no suprise) using :
System.out.println(java.nio.charset.Charset.availableCharsets());
I suspect that either your file isn't actually encoded as ISO-8859-1, or System.out doesn't know how to print the character.
I recommend that to check for the first, you examine the relevant byte in the file. To check for the second, examine the relevant character in the string, printing it out with
System.out.println((int) s.getCharAt(index));
In both cases the result should be 244 decimal; 0xf4 hex.
See my article on Unicode debugging for general advice (the code presented is in C#, but it's easy to convert to Java, and the principles are the same).
In general, by the way, I'd wrap the stream with an InputStreamReader with the right encoding - it's easier than creating new strings "by hand". I realise this may just be demo code though.
EDIT: Here's a really easy way to prove whether or not the console will work:
System.out.println("Here's the character: \u00f4");
Parsing the file as fixed-size blocks of bytes is not good --- what if some character has a byte representation that straddles across two blocks? Use an InputStreamReader with the appropriate character encoding instead:
BufferedReader br = new BufferedReader(
new InputStreamReader(
new FileInputStream("myfile.csv"), "ISO-8859-1");
char[] buffer = new char[4096]; // character (not byte) buffer
while (true)
{
int charCount = br.read(buffer, 0, buffer.length);
if (charCount == -1) break; // reached end-of-stream
String s = String.valueOf(buffer, 0, charCount);
// alternatively, we can append to a StringBuilder
System.out.println(s);
}
Btw, remember to check that the unicode character can indeed be displayed correctly. You could also redirect the program output to a file and then compare it with the original file.
As Jon Skeet suggests, the problem may also be console-related. Try System.console().printf(s) to see if there is a difference.
#Joel - your own answer confirms that the problem is a difference between the default encoding on your operating system (UTF-8, the one Java has picked up) and the encoding your terminal is using (ISO-8859-1).
Consider this code:
public static void main(String[] args) throws IOException {
byte[] data = { (byte) 0xF4 };
String decoded = new String(data, "ISO-8859-1");
if (!"\u00f4".equals(decoded)) {
throw new IllegalStateException();
}
// write default charset
System.out.println(Charset.defaultCharset());
// dump bytes to stdout
System.out.write(data);
// will encode to default charset when converting to bytes
System.out.println(decoded);
}
By default, my Ubuntu (8.04) terminal uses the UTF-8 encoding. With this encoding, this is printed:
UTF-8
?ô
If I switch the terminal's encoding to ISO 8859-1, this is printed:
UTF-8
ôô
In both cases, the same bytes are being emitted by the Java program:
5554 462d 380a f4c3 b40a
The only difference is in how the terminal is interpreting the bytes it receives. In ISO 8859-1, ô is encoded as 0xF4. In UTF-8, ô is encoded as 0xC3B4. The other characters are common to both encodings.
If you can, try to run your program in debugger to see what's inside your 's' string after it is created. It is possible that it has correct content, but output is garbled after System.out.println(s) call. In that case, there is probably mismatch between what Java thinks is encoding of your output and character encoding of your terminal/console on Linux.
Basically, if it works on your local XP PC but not on Linux, and you are parsing the exact same file (i.e. you transferred it in a binary fashion between the boxes), then it probably has something to do with the System.out.println call. I don't know how you verify the output, but if you do it by connecting with a remote shell from the XP box, then there is the character set of the shell (and the client) to consider.
Additionally, what Zach Scrivena suggests is also true - you cannot assume that you can create strings from chunks of data in that way - either use an InputStreamReader or read the complete data into an array first (obviously not going to work for a large file). However, since it does seem to work on XP, then I would venture that this is probably not your problem in this specific case.

Why is ¿ displayed different in Windows vs Linux even when using UTF-8?

Why is the following displayed different in Linux vs Windows?
System.out.println(new String("¿".getBytes("UTF-8"), "UTF-8"));
in Windows:
¿
in Linux:
¿
System.out.println() outputs the text in the system default encoding, but the console interprets that output according to its own encoding (or "codepage") setting. On your Windows machine the two encodings seem to match, but on the Linux box the output is apparently in UTF-8 while the console is decoding it as a single-byte encoding like ISO-8859-1. Or maybe, as Jon suggested, the source file is being saved as UTF-8 and javac is reading it as something else, a problem that can be avoided by using Unicode escapes.
When you need to output anything other than ASCII text, your best bet is to write it to a file using an appropriate encoding, then read the file with a text editor--consoles are too limited and too system-dependent. By the way, this bit of code:
new String("¿".getBytes("UTF-8"), "UTF-8")
...has no effect on the output. All that does is encode the contents of the string to a byte array and decode it again, reproducing the original string--an expensive no-op. If you want to output text in a particular encoding, you need to use an OutputStreamWriter, like so:
FileOutputStream fos = new FileOutputStream("out.txt");
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");
Not sure where the problem is exactly, but it's worth noting that
¿ ( 0xc2,0xbf)
is the result of encoding with UTF-8
0xbf,
which is the Unicode codepoint for ¿
So, it looks like in the linux case, the output is not being displayed as utf-8, but as a single-byte string
Check what encoding your linux terminal has.
For gnome-terminal in ubuntu - go to the "Terminal" menu and select "Set Character Encoding".
For putty, Configuration -> Window -> Translation -> UTF-8 (and if that doesn't work, see this post).
Run this code to help determine if it is a compiler or console issue:
public static void main(String[] args) throws Exception {
String s = "¿";
printHex(Charset.defaultCharset(), s);
Charset utf8 = Charset.forName("UTF-8");
printHex(utf8, s);
}
public static void printHex(Charset encoding, String s)
throws UnsupportedEncodingException {
System.out.print(encoding + "\t" + s + "\t");
byte[] barr = s.getBytes(encoding);
for (int i = 0; i < barr.length; i++) {
int n = barr[i] & 0xFF;
String hex = Integer.toHexString(n);
if (hex.length() == 1) {
System.out.print('0');
}
System.out.print(hex);
}
System.out.println();
}
If the encoded bytes for UTF-8 are different on each platform (it should be c2bf), it is a compiler issue.
If it is a compiler issue, replace "¿" with "\u00bf".
It's hard to know exactly which bytes your source code contains, or the string which getBytes() is being called on, due to your editor and compiler encodings.
Can you produce a short but complete program containing only ASCII (and the relevant \uxxxx escaping in the string) which still shows the problem?
I suspect the problem may well be with the console output on either Windows or Linux, but it would be good to get a reproducible program first.

Categories