Java encoding from OS X to Windows - java

I have made a primitive multi-client chat with swing GUI. Everything works fine as long as both people write from the same OS. If one of them writes from Windows and the other from OS X, the encoding of some special characters goes nuts. ( I am from CZE, we use characters as š,ě,č,ř,ž...). I have searched for a long time but didn't find anything that would help.
I have input and output defined as:
in = new BufferedReader(new InputStreamReader(soc.getInputStream()));
out = new PrintWriter(new OutputStreamWriter(soc.getOutputStream()));
where soc is the socket used for connecting to the server side.
The sending process is as simple as:
out.println(message);
where message is a String, which I got from JTextArea by calling method .getText()
I know why this problem occurs, but I was unable to find any reasonable solution.
Any help will be appreciated.
Thanks

When reading character data from Input/OutputStreams, it's a good practice to always specify the character encoding. Otherwise the platform default encoding is used (which might not be the same on all systems).
in = new BufferedReader(new InputStreamReader(soc.getInputStream(), StandardCharsets.UTF_8));
out = new PrintWriter(new OutputStreamWriter(soc.getOutputStream(), StandardCharsets.UTF_8));

Related

Same unicode character behaves differently in different IDEs

When I read the following unicode string it reads as differently..When I execute the program using netbeans it is working fine but when I tried using Eclipse / directly from CMD it is not working.
After reading it adds ƒÂ these characters
Then the string becomes Mýxico
String to be read is Mýxico...I used the CSVReader with Encoding to read as follows.
sourceReader = new CSVReader(new FileReader(soureFile));
List<String[]> data = sourceReader.readAll();
Any suggestions????????
It sounds like the different editors are using different encodings. For example one is using utf-8 and one is using something else.
Check the encoding settings in all of the editors are the same.
We should use encoding while reading the file. So above statment should be changed as follows
targetReader=new CSVReader(new InputStreamReader(
new FileInputStream(targetFile), "UTF-8"));
data = targetReader.readAll();

Get list of processes on Windows in a charset-safe way

This post gives a solution to retrieve the list of running processes under Windows. In essence it does:
String cmd = System.getenv("windir") + "\\system32\\" + "tasklist.exe";
Process p = Runtime.getRuntime().exec(cmd);
InputStreamReader isr = new InputStreamReader(p.getInputStream());
BufferedReader input = new BufferedReader(isr);
then reads the input.
It looks and works great but I was wondering if there is a possibility that the charset used by tasklist might not be the default charset and that this call could fail?
For example this other question about a different executable shows that it could cause some issues.
If that is the case, is there a way to determine what the appropriate charset would be?
Can break this into 2 parts:
The windows part
From java you're executing a Windows command - externally to the jvm in "Windows land". When java Runtime class executes a windows command, it uses the DLL for consoles & so appears to windows as if the command is running in a console
Q: When I run C:\windows\system32\tasklist.exe in a console, what is the character encoding ("code page" in windows terminology) of the result?
windows "chcp" command with no argument gives the active code page number for the console (e.g. 850 for Multilingual-Latin-1, 1252 for Latin-1). See Windows Microsoft Code Pages, Windows OEM Code Pages, Windows ISO Code Pages
The default system code page is originally setup according to your system locale (type systeminfo to see this or Control Panel-> Region and Language).
the windows OS/.NET function getACP() also gives this info
The java part:
How do I decode a java byte stream from the windows code page of "x" (e.g. 850 or 1252)?
the full mapping between windows code page numbers and equivalent java charset names can be derived from here - Code Page Identifiers (Windows)
However, in practice one of the following prefixes can be added to achieve the mapping:
"" (none) for ISO, "IBM" or "x-IBM" for OEM, "windows-" OR "x-windows-" for Microsoft/Windows.
E.g. ISO-8859-1 or IBM850 or windows-1252
Full Solution:
String cmd = System.getenv("windir") + "\\system32\\" + "chcp.com";
Process p = Runtime.getRuntime().exec(cmd);
// Use default charset here - only want digits which are "core UTF8/UTF16";
// ignore text preceding ":"
String windowsCodePage = new Scanner(
new InputStreamReader(p.getInputStream())).skip(".*:").next();
Charset charset = null;
String[] charsetPrefixes =
new String[] {"","windows-","x-windows-","IBM","x-IBM"};
for (String charsetPrefix : charsetPrefixes) {
try {
charset = Charset.forName(charsetPrefix+windowsCodePage);
break;
} catch (Throwable t) {
}
}
// If no match found, use default charset
if (charset == null) charset = Charset.defaultCharset();
cmd = System.getenv("windir") + "\\system32\\" + "tasklist.exe";
p = Runtime.getRuntime().exec(cmd);
InputStreamReader isr = new InputStreamReader(p.getInputStream(), charset);
BufferedReader input = new BufferedReader(isr);
// Debugging output
System.out.println("matched codepage "+windowsCodePage+" to charset name:"+
charset.name()+" displayName:"+charset.displayName());
String line;
while ((line = input.readLine()) != null) {
System.out.println(line);
}
Thanks for the Q! - was fun.
Actually, the charset used by tasklist is always different from the system default.
On the other hand, it's quite safe to use the default as long as the output is limited to ASCII. Usually executable modules have only ASCII characters in their names.
So to get the correct Strings, you have to convert (ANSI) Windows code page to OEM code page, and pass the latter as charset to InputStreamReader.
It seems there's no comprehensive mapping between the these encodings. The following mapping can be used:
Map<String, String> ansi2oem = new HashMap<String, String>();
ansi2oem.put("windows-1250", "IBM852");
ansi2oem.put("windows-1251", "IBM866");
ansi2oem.put("windows-1252", "IBM850");
ansi2oem.put("windows-1253", "IBM869");
Charset charset = Charset.defaultCharset();
String streamCharset = ansi2oem.get(charset.name());
if (streamCharset) {
streamCharset = charset.name();
}
InputStreamReader isr = new InputStreamReader(p.getInputStream(),
streamCharset);
This approach worked for me with windows-1251 and IBM866 pair.
To get the current OEM encoding used by Windows, you can use GetOEMCP function. The return value depends on Language for non-Unicode programs setting on Administrative tab in Region and Language control panel. Reboot is required to apply the change.
There are two kinds of encodings on Windows: ANSI and OEM.
The former is used by non-Unicode applications running in GUI mode.
The latter is used by Console applications. Console applications cannot display characters that cannot be represented in the current OEM encoding.
Since tasklist is console mode application, its output is always in the current OEM encoding.
For English systems, the pair is usually Windows-1252 and CP850.
As I am in Russia, my system has the following encodings: Windows-1251 and CP866.
If I capture output of tasklist into a file, the file can't display Cyrillic characters correctly:
I get ЏаЁўҐв instead of Привет (Hi!) when viewed in Notepad.
And µTorrent is displayed as зTorrent.
You cannot change the encoding used by tasklist.
However it's possible to change the output encoding of cmd. If you pass /u switch to it, it will output everything in UTF-16 encoding.
cmd /c echo Hi>echo.txt
The size of echo.txt is 4 bytes: two bytes for Hi and two bytes for new line (\r and \n).
cmd /u /c echo Hi>echo.txt
Now the size of echo.txt is 8 bytes: each character is represented with two bytes.
Why not use the Windows API via JNA, instead of spawning processes? Like this:
import com.sun.jna.platform.win32.Kernel32;
import com.sun.jna.platform.win32.Tlhelp32;
import com.sun.jna.platform.win32.WinDef;
import com.sun.jna.platform.win32.WinNT;
import com.sun.jna.win32.W32APIOptions;
import com.sun.jna.Native;
public class ListProcesses {
public static void main(String[] args) {
Kernel32 kernel32 = (Kernel32) Native.loadLibrary(Kernel32.class, W32APIOptions.UNICODE_OPTIONS);
Tlhelp32.PROCESSENTRY32.ByReference processEntry = new Tlhelp32.PROCESSENTRY32.ByReference();
WinNT.HANDLE snapshot = kernel32.CreateToolhelp32Snapshot(Tlhelp32.TH32CS_SNAPPROCESS, new WinDef.DWORD(0));
try {
while (kernel32.Process32Next(snapshot, processEntry)) {
System.out.println(processEntry.th32ProcessID + "\t" + Native.toString(processEntry.szExeFile));
}
}
finally {
kernel32.CloseHandle(snapshot);
}
}
}
I posted a similar answer elsewhere.
There is a much better way to check the running processes, or even to run OS command through java: Process and ProcessBuilder.
As for the Charset, you can always inquire the OS about the supported charsets, and obtain an Encoder or Decoder according to your needs.
[Edit]
Let's break it down; there's no way of knowing in which encoding the bytes of a given String are, so your only choice is to get those bytes, shift the ordering as necessary (if you're ever in such an environment where a process can give you an array of bytes in different ordering, use ByteBuffer to deal with that), and use the multiple CharsetDecoders supported to decode the bytes to reasonable output.
It is overkill and requires you to estimate that a given output could be in UTF-8, UTF-16 or any other encoding. But at least you can decode the given output using one of the possible Charsets, and then try to use the processed output for your needs.
Since we're talking about a process run by the same OS in which the JVM itself is running, it is quite possible that your output will be in one of the Charset encodings returned by the availableCharsets() method.

Is BufferedReader ignoring the first line?

I am currently writing a service that should take cleartext commands and then return something according to thoose commands, which is also in cleartext.
I have this odd problem with BufferedReader, or, it might be telnet that is odd, for some reason the BufferedReader reads the first command, however that command is ignored no matter what i do, which i can get around by sending the first command twice, but that is just stretching it a bit, in my oppinion.
The code below is in a run() method.
Then i set out as a PrintWriter and in as a BufferedReader.
The runs variable is by default true.
out = new PrintWriter(handle.getOutputStream(), true);
in = new BufferedReader(new InputStreamReader(handle.getInputStream()));
while (runs) {
String msg;
msg = in.readLine();
String[] command;
command = msg.split(" ", 3);
/* do something with the command */
}
So my question is if BufferedReader is ignoring the first line or is it telnet that is not coorperating with me?
If it is something else, then please enlighten me.
EDIT
I got this debug message:
Debug: ���� ����'������/nick halmark
so i suppose that it is about all the questionmarks.
I am actually using the latest Putty since i am developing on a windows box... and as far as i recall... then it does not exist by default
If you are using PuTTY, you need to choose the "Raw" Connection Type.
Microsoft telnet servers like to have some content/protocol negotiation at the beginning, so PuTTY will do this by default as per the RFC 854 spec. That's the garbage that you are reading.

Client-side string encoding java

My team and I have this nasty problem with parsing a string received from our server. The server is pretty simple socket stuff done in qt here is the sendData function:
void sendData(QTcpSocket *client,QString response){
QString text = response.toUtf8();
QByteArray block;
QDataStream out(&block, QIODevice::WriteOnly);
out << (quint32)0;
out << text;
out.device()->seek(0);
out << (quint32)(block.size() - sizeof(quint32));
try{
client->write(block);
}
catch(...){...
The client is in Java and is also pretty standard socket stuff, here is where we are at now after trying many many different ways of decoding the response from the server:
Socket s;
try {
s = new Socket(URL, 1987);
PrintWriter output = new PrintWriter(s.getOutputStream(), true);
InputStreamReader inp = new InputStreamReader(s.getInputStream(), Charset.forName("UTF-8"));
BufferedReader rd = new BufferedReader( inp );
String st;
while ((st = rd.readLine()) != null){
System.out.println(st);
}...
If a connection is made with the server it sends a string "Send Handshake" with the size of the string in bytes sent before it as seen in the first block of code. This notifies the client that it should send authentication to the server. As of now the string we get from the server looks like this:
������ ��������S��e��n��d�� ��H��a��n��d��s��h��a��k��e
We have used tools such as string encode/decode tool to try and assess how the string is encoded but it fails on every configuration.
We are out of ideas as to what encoding this is, if any, or how to fix it.
Any help would be much appreciated.
At a glance, the line where you convert the QString parameter to a Utf8 QByteArray and then back to a QString seems odd:
QString text = response.toUtf8();
When the QByteArray returned by toUtf8() is assigned to text, I think it is assumed that the QByteArray contains an Ascii (char*) buffer.
I'm pretty sure that QDataStream is intended to be used only within Qt. It provides a platform-independent way of serializing data that is then intended to be deserialized with another QDataStream somewhere else. As you noticed, it's including a lot of extra stuff besides your raw data, and that extra stuff is subject to change at the next Qt version. (This is why the documentation suggests including in your stream the version of QDataStream being used ... so it can use the correct deserialization logic.)
In other words, the extra stuff you are seeing is probably meta-data included by Qt and it is not guaranteed to be the same with the next Qt version. From the docs:
QDataStream's binary format has evolved since Qt 1.0, and is likely to
continue evolving to reflect changes done in Qt. When inputting or
outputting complex types, it's very important to make sure that the
same version of the stream (version()) is used for reading and
writing.
If you are going to another language, this isn't practical to use. If it is just text you are passing, use a well-known transport mechanism (JSON, XML, ASCII text, UTF-8, etc.) and bypass the QDataStream altogether.

Java App : Unable to read iso-8859-1 encoded file correctly

I have a file which is encoded as iso-8859-1, and contains characters such as ô .
I am reading this file with java code, something like:
File in = new File("myfile.csv");
InputStream fr = new FileInputStream(in);
byte[] buffer = new byte[4096];
while (true) {
int byteCount = fr.read(buffer, 0, buffer.length);
if (byteCount <= 0) {
break;
}
String s = new String(buffer, 0, byteCount,"ISO-8859-1");
System.out.println(s);
}
However the ô character is always garbled, usually printing as a ? .
I have read around the subject (and learnt a little on the way) e.g.
http://www.joelonsoftware.com/articles/Unicode.html
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
http://www.ingrid.org/java/i18n/utf-16/
but still can not get this working
Interestingly this works on my local pc (xp) but not on my linux box.
I have checked that my jdk supports the required charsets (they are standard, so this is no suprise) using :
System.out.println(java.nio.charset.Charset.availableCharsets());
I suspect that either your file isn't actually encoded as ISO-8859-1, or System.out doesn't know how to print the character.
I recommend that to check for the first, you examine the relevant byte in the file. To check for the second, examine the relevant character in the string, printing it out with
System.out.println((int) s.getCharAt(index));
In both cases the result should be 244 decimal; 0xf4 hex.
See my article on Unicode debugging for general advice (the code presented is in C#, but it's easy to convert to Java, and the principles are the same).
In general, by the way, I'd wrap the stream with an InputStreamReader with the right encoding - it's easier than creating new strings "by hand". I realise this may just be demo code though.
EDIT: Here's a really easy way to prove whether or not the console will work:
System.out.println("Here's the character: \u00f4");
Parsing the file as fixed-size blocks of bytes is not good --- what if some character has a byte representation that straddles across two blocks? Use an InputStreamReader with the appropriate character encoding instead:
BufferedReader br = new BufferedReader(
new InputStreamReader(
new FileInputStream("myfile.csv"), "ISO-8859-1");
char[] buffer = new char[4096]; // character (not byte) buffer
while (true)
{
int charCount = br.read(buffer, 0, buffer.length);
if (charCount == -1) break; // reached end-of-stream
String s = String.valueOf(buffer, 0, charCount);
// alternatively, we can append to a StringBuilder
System.out.println(s);
}
Btw, remember to check that the unicode character can indeed be displayed correctly. You could also redirect the program output to a file and then compare it with the original file.
As Jon Skeet suggests, the problem may also be console-related. Try System.console().printf(s) to see if there is a difference.
#Joel - your own answer confirms that the problem is a difference between the default encoding on your operating system (UTF-8, the one Java has picked up) and the encoding your terminal is using (ISO-8859-1).
Consider this code:
public static void main(String[] args) throws IOException {
byte[] data = { (byte) 0xF4 };
String decoded = new String(data, "ISO-8859-1");
if (!"\u00f4".equals(decoded)) {
throw new IllegalStateException();
}
// write default charset
System.out.println(Charset.defaultCharset());
// dump bytes to stdout
System.out.write(data);
// will encode to default charset when converting to bytes
System.out.println(decoded);
}
By default, my Ubuntu (8.04) terminal uses the UTF-8 encoding. With this encoding, this is printed:
UTF-8
?ô
If I switch the terminal's encoding to ISO 8859-1, this is printed:
UTF-8
ôô
In both cases, the same bytes are being emitted by the Java program:
5554 462d 380a f4c3 b40a
The only difference is in how the terminal is interpreting the bytes it receives. In ISO 8859-1, ô is encoded as 0xF4. In UTF-8, ô is encoded as 0xC3B4. The other characters are common to both encodings.
If you can, try to run your program in debugger to see what's inside your 's' string after it is created. It is possible that it has correct content, but output is garbled after System.out.println(s) call. In that case, there is probably mismatch between what Java thinks is encoding of your output and character encoding of your terminal/console on Linux.
Basically, if it works on your local XP PC but not on Linux, and you are parsing the exact same file (i.e. you transferred it in a binary fashion between the boxes), then it probably has something to do with the System.out.println call. I don't know how you verify the output, but if you do it by connecting with a remote shell from the XP box, then there is the character set of the shell (and the client) to consider.
Additionally, what Zach Scrivena suggests is also true - you cannot assume that you can create strings from chunks of data in that way - either use an InputStreamReader or read the complete data into an array first (obviously not going to work for a large file). However, since it does seem to work on XP, then I would venture that this is probably not your problem in this specific case.

Categories