Java Charset problem on linux

Java Charset problem on linux - java

problem: I have a string containing special characters which i convert to bytes and vice versa..the conversion works properly on windows but on linux the special character is not converted properly.the default charset on linux is UTF-8 as seen with Charset.defaultCharset.getdisplayName()
however if i run on linux with option -Dfile.encoding=ISO-8859-1 it works properly..
how to make it work using the UTF-8 default charset and not setting the -D option in unix environment.
edit: i use jdk1.6.13
edit:code snippet
works with cs = "ISO-8859-1"; or cs="UTF-8"; on win but not in linux
String x = "½";
System.out.println(x);
byte[] ba = x.getBytes(Charset.forName(cs));
for (byte b : ba) {
System.out.println(b);
}
String y = new String(ba, Charset.forName(cs));
System.out.println(y);
~regards
daed

Your characters are probably being corrupted by the compilation process and you're ending up with junk data in your class file.
if i run on linux with option -Dfile.encoding=ISO-8859-1 it works properly..
The "file.encoding" property is not required by the J2SE platform specification; it's an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only; it's technically impossible to support the setting of this property to arbitrary values on the command line or at any other time during program execution.
In short, don't use -Dfile.encoding=...
String x = "½";
Since U+00bd (½) will be represented by different values in different encodings:
windows-1252 BD
UTF-8 C2 BD
ISO-8859-1 BD
...you need to tell your compiler what encoding your source file is encoded as:
javac -encoding ISO-8859-1 Foo.java
Now we get to this one:
System.out.println(x);
As a PrintStream, this will encode data to the system encoding prior to emitting the byte data. Like this:
System.out.write(x.getBytes(Charset.defaultCharset()));
That may or may not work as you expect on some platforms - the byte encoding must match the encoding the console is expecting for the characters to show up correctly.

Your problem is a bit vague. You mentioned that -Dfile.encoding solved your linux problem, but this is in fact only used to inform the Sun(!) JVM which encoding to use to manage filenames/pathnames at the local disk file system. And ... this does't fit in the problem description you literally gave: "converting chars to bytes and back to chars failed". I don't see what -Dfile.encoding has to do with this. There must be more into the story. How did you conclude that it failed? Did you read/write those characters from/into a pathname/filename or so? Or was you maybe printing to the stdout? Did the stdout itself use the proper encoding?
That said, why would you like to convert the chars forth and back to/from bytes? I don't see any useful business purposes for this.
(sorry, this didn't fit in a comment, but I will update this with the answer if you have given more info about the actual functional requirement).
Update: as per the comments: you basically just need to configure the stdout/cmd so that it uses the proper encoding to display those characters. In Windows you can do that with chcp command, but there's one major caveat: the standard fonts used in Windows cmd does not have the proper glyphs (the actual font pictures) for characters outside the ISO-8859 charsets. You can hack the one or other in registry to add proper fonts. No wording about Linux as I don't do it extensively, but it look like that -Dfile.encoding is somehow the way to go. After all ... I think it's better to replace cmd with a crossplatform UI tool to display the characters the way you want, for example Swing.

You should make the conversion explicitly:
byte[] byteArray = "abcd".getBytes( "ISO-8859-1" );
new String( byteArray, "ISO-8859-1" );
EDIT:
It seems that the problem is the encoding of your java file. If it works on windows, try compiling the source files on linux with javac -encondig ISO-8859-1. This should solve your problem.

Related

Java String some characters not showing [duplicate]

I have a problem with turkish special characters on different machines. The following code:
String turkish = "ğüşçĞÜŞÇı";
String test1 = new String(turkish.getBytes());
String test2 = new String(turkish.getBytes("UTF-8"));
String test3 = new String(turkish.getBytes("UTF-8"), "UTF-8");
System.out.println(test1);
System.out.println(test2);
System.out.println(test3);
On a Mac the three Strings are the same as the original string. On a Windows machine the three lines are (Printed with the Netbeans 6.7 console):
?ü?ç?Ü?Ç?
ÄŸÃ¼ÅŸÃ§ÄžÃœÅžÃ‡Ä±
?ü?ç?Ü?Ç?
I don't get the problem.

String test1 = new String(turkish.getBytes());
You're taking the Unicode String including the Turkish characters, and turning it into bytes using the default encoding (using the default encoding is usually a mistake). You're then taking those bytes and decoding them back into a String, again using the default encoding. The result is you've achieved nothing (except losing any characters that don't fit in the default encoding); whether you have put a String through an encode/decode cycle has no effect on what the following System.out.println(test1) does because that's still printing a String and not bytes.
String test2 = new String(turkish.getBytes("UTF-8"));
Encodes as UTF-8 and then decodes using the default encoding. On Mac the default encoding is UTF-8 so this does nothing. On Windows the default encoding is never UTF-8 so the result is the wrong characters.
String test3 = new String(turkish.getBytes("UTF-8"), "UTF-8");
Does precisely nothing.
To write Strings to stdout with a different encoding than the default encoding, you'd create a encoder something like new OutputStreamWriter(System.out, "cp1252") and send the string content to that.
However in this case, it looks like the console is using Windows code page 1252 Western European (+1 ATorres). There is no encoding mismatch issue here at all, so you won't be able to solve it by re-encoding strings!
The default encoding cp1252 matches the console's encoding, it's just that cp1252 doesn't contain the Turkish characters ğşĞŞı at all. You can see the other characters that are in cp1252, üçÜÇ, come through just fine. Unless you can reconfigure the console to use a different encoding that does include all the characters you want, there is no way you'll be able to output those characters.
Presumably on a Turkish Windows install, the default code page will be cp1254 instead and you will get the characters you expect (but other characters don't work). You can test this by changing the ‘Language to use for non-Unicode applications’ setting in the Regional and Language Options Control Panel app.
Unfortunately no Windows locale uses UTF-8 as the default code page. Putting non-ASCII output onto the console with the stdio stream functions is not something that's really reliable at all. There is a Win32 API to write Unicode directly to the console, but unfortunately nothing much uses it.

Don't rely on the console, or on the default platform encoding. Always specify the character encoding for calls like getBytes and the String constructor taking a byte array, and if you want to examine the contents of a string, print out the unicode value of each character.
I would also advise either restricting your source code to use ASCII (and \uxxxx to encode non-ASCII characters) or explicitly specifying the character encoding when you compile.
Now, what bigger problem are you trying to solve?

You may be dealing with different settings of the default encoding.
java -Dfile.encoding=utf-8
versus
java -Dfile.encoding=something else
Or, you may just be seeing the fact that the Mac terminal window works in UTF-8, and the Windows DOS box does not work in UTF-8.
As per Mr. Skeet, you have a third possible problem, which is that you are trying to embed UTF-8 chars in your source. Depending on the compiler options, you may or may not be getting what you intend there. Put this data in a properties file, or use \u escapes.
Finally, also per Mr. Skeet, never, ever call the zero-argument getBytes().

If you are using AspectJ compiler do not forget to set it's encoding to UTF-8 too. I have struggled to find this for hours.

How to change hyphen sign for question mark in java program?

I'm utilizing this line codes
String string = "Some usefull information − don't know what happens with my output";
System.out.println(string);
String str2verify = driver.findElement(By.xpath("//someWellFormXpath")).getText();
Assert.assertEquals(str2verify , "Some usefull information − don't know what happens with my output");
And I'm getting this in my console, so if I want to use equals function doesn't work.
Output
Some usefull information ? don't know what happens with my output
expected [Some usefull information ? don't know what happens with my outputS] but found [Some usefull information − don't know what happens with my output]
java.lang.AssertionError: expected [Some usefull information ? don't know what happens with my outputS] but found [Some usefull information − don't know what happens with my output]

This is the process:
You write some text. In an editor. That is showing strings to you.
You save your file. files are bytes, not characters, so your editor is applying a charset encoding to do this. Which one? Your editor will know, you didn't mention which one you use so I can't tell you.
Javac reads your file. files are bytes, but javac needs characters, so javac is applying a charset encoding to do this. Which one? "The platform default", unless you use the -encoding parameter / the tool you are using that calls javac has a way to tell it which -encoding parameter to use.
Javac emits class files. These are byte based so this doesn't require encoding.
Your java JVM runs your class file. As part of running, a string is printed to standard out.
System.out refers to 'standard out'. These things are, on pretty much every OS, a stream of bytes. Meaning, when you send strings there, the JVM first encodes your string using some charset encoding, then it goes to standard out.
Something is connected to the other end of standard out and sees these bytes. These convert the bytes back to a string, also using some encoding.
The characters are sent to the font rendering engine on your OS. Even if the character 'survived' all those conversions back and forth, it is possible your font doesn't have a glyph for it. The intent is clearly for that character to be an emdash (a dash that is as long as the letter 'm' - the standard 'minus' character is an ndash, not the same thing; that one is a bit shorter).
Count em up - that's like 6 conversions. They all need to be using the same charset encoding. So, check that your editor and javac agree on what charset encoding your source file is in. Then, check that the console thing that is showing the string is in agreement with standard out (which should be 'platform default', whatever that might be), then, check if the font you use has emdash.
PrintStream ps = new PrintStream(System.out, true, "UTF-8");
Then write to ps, not System.out - that's how you can explicitly force some charset to be used when writing to output.

It turns that em dash doesn't have a representation in cp-1252 charset encoding, so at the end I have to change to UTF-8 all my files in the project to be able to save this character.
It was a pain in the brain this encoding issue.
Thanks for all the suggestions friends.

ISO-8859-1 character encoding not working in Linux

I have tried below code in windows and able decode the message .But same code when i have tried Linux it's not working.
String message ="Ã¶Ã¶Ã¶Ã¶Ã¶";
String encodedMsg = new String(message.getBytes("ISO-8859-1"), "UTF-8");
System.out.println(encodedMsg);
I have verified and could see the default character set in Linux platform is UTF-8(Charset.defaultCharset().name())
Kindly suggest me how to do same encoding Linux platform.

The explanation for this, is, almost always, that somewhere bytes are turned to characters or characters are turned to bytes there where the encoding is not clearly specified, thus, defaulting to 'platform default', thus, causing different results depending on which platform you run it on.
Except, every place where you turn bytes to chars or chars to bytes in your snippet of code explicitly specified encoding.
Or does it?
String message ="Ã¶Ã¶Ã¶Ã¶Ã¶";
Ah, no, you forgot one place: javac itself.
You compile this code. That'll be where raw bytes (because the compiler is looking at ManmohansSourceFile.java, which is a file, which isn't characters, but a bunch of bytes) - which are converted into characters (because the java compiler works on characters), and this is done using some encoding. If you don't use the -encoding switch when running javac (or maven or gradle is running javac, and it passes an encoding, which one depends on your pom/gradle file), then this is read in using system encoding, and thus whether the string actually contains those bytes - who knows.
This is most likely the source of your problem.
The fix? Pick one:
Don't put non-ascii in your source files. Note that you can write the unicode symbol "Latin Capital Letter A with Tilde" as \u00C3 in your source file instead of as Ã. Then use \u00B6 for ¶.
String message ="\u00C3\u00B6\u00C3\u00B6\u00C3\u00B6\u00C3\u00B6\u00C3\u00B6";
String encodedMsg = new String(message.getBytes("ISO-8859-1"), "UTF-8");
System.out.println(encodedMsg);
> ööööö
Ensure you specify the right -encoding switch when compiling. So, if your text editor (that you use to type String message = "¶";) is configured as 'UTF-8', and then run javac -encoding UTF-8 manMohansFile.java.

First of all, I'm not sure exactly what you are expecting...your use of the term "encode" is a bit confusing, but from your comments, it appears that with the input "Ã¶Ã¶Ã¶Ã¶Ã¶", you expect the output "ööööö".
On both Linux and OS X with Java 1.8, I do get that result.
I do not have a Windows machine to try this on.
As #Pshemo indicated, it is possible that your input, since it's hardcoded in the source code as a string, is being represented as UTF-8, not as ISO-8859-1. Actually, this is what I expected, and I was surprised that the code worked as you expected.
Try creating the input with String.encode(), encoding to ISO-8859-1.

Java Unicode Characters after u 00ff

I can not print the unicode values after 00ff Instead I'm getting '?' character after execution of this in Eclipse. Is that an expectable behaviour?
System.out.println("\u01ff");

Eclipse uses by default platform default encoding (which is cp1252 in Windows) to decode characters during saving textbased files and during writing to standard output stream (as used by System.out). You need to explicitly set it to UTF-8 in order to achieve world domination.
Note that this way you also don't need to use those \uXXXX Unicode escapes anymore to represent those characters in your textbased source files.
Those question marks are caused because the charset as used by the output stream does not support the character as specified in the input stream.
See also:
Unicode - How to get the characters right?

The problem is the encoding used with System.out; by default, it's your platform's native encoding (usually not UTF-8). You can explicitly change the encoding by replacing the stream:
try {
PrintStream ps = new PrintStream(System.out, true, "UTF-8");
System.setOut(ps);
} catch (UnsupportedEncodingException error) {
System.err.println(error);
}
After this (barring font or encoding issues with the underlying environment), all Unicode characters should print correctly on System.out.
EDIT Based on the back-and-forth between me and BalusC on his answer, this isn't enough (or even necessary) to get things working inside Eclipse. There you have two problems to solve: using the correct encoding when writing to System.out and then using the correct encoding when presenting the console output in the Eclipse console view. BalusC's answer addresses both of those issues. Unfortunately, it won't address running outside Eclipse. For that, you need to either use my approach above or set the default run-time encoding. This can be done by using the flag -Dfile.encoding=UTF-8 on the command line or setting the environment variable JAVA_TOOL_OPTIONS to include -Dfile.encoding=UTF-8. To run your code successfully outside Eclipse, then that's probably the best approach.

Java can handle those characters just fine. But the outputs team will have a specific encoding.
And unless that encoding is something like UTF-8 or UTF-16, it won't be able to encode every character in Unicode. And when it encounters a character it can't represent, it will be replaced with a question mark.

The JVM uses the default file encoding for System.out as well.
On Linux for example, if your $LANG variable is set to de_DE.UTF-8, the value for file.encoding will be derived accordingly, and set to utf-8.
If the JVM cannot derive the setting you want, you can change the file encoding by setting a system property:
java -Dfile.encoding=utf-8 ...
on the command line.
You can do this as well in Eclipse using a Run configuration (context menu - Run as - Run configurations ... - Arguments - VM arguments)
So this works both for the command line and Eclipse, and there is no need to define the encoding explicitly in the source.
If the value is set to
java -Dfile.encoding=iso-8859-1
for example, only a subset of the Unicode characters can be represented, because that character set only supports a limited number of characters. The other ones will turn up as ?.
There are two other things to bear in mind:
The device which receives the stream (a terminal, shell etc.) must decode it correctly. As for GNOME's terminal for example, you can set the character encoding in a menu.
The font being used by that terminal etc. must contain the graphical representation for this character

String.getBytes("ISO-8859-1") gives me 16-bit characters on OS X

Using Java 6 to get 8-bit characters from a String:
System.out.println(Arrays.toString("öä".getBytes("ISO-8859-1")));
gives me, on Linux: [-10, 28]
but OS X I get: [63, 63, 63, -89]
I seem get the same result when using the fancy new nio CharSetEncoder class. What am I doing wrong? Or is it Apple's fault? :)

I managed to reproduce this problem by saving the source file as UTF-8, then telling the compiler it was really MacRoman:
javac -encoding MacRoman Test.java
I would have thought javac would default to UTF-8 on OSX, but maybe not. Or maybe you're using an IDE and it's defaulting to MacRoman. Whatever the case, you have to make it use UTF-8 instead.

What is the encoding of the source file? 63 is the code for ? which means "character can't be converted to the specified encoding".
So my guess is that you copied the source file to the Mac and that the source file uses an encoding which the Mac java compiler doesn't expect. IIRC, OS X will expect the file to be UTF-8.

Your source file is producing "öä" by combining characters.
Look at this:
System.out.println(Arrays.toString("\u00F6\u00E4".getBytes("ISO-8859-1")))
This shall print [-10,-28] like you expect (I don't like to print it this way but I know it's not the point of your question), because there the Unicode codepoints are specified, carved in stone, and your text editor is not allowed to "play smart" by combining 'o' and 'a' with diacritic signs.
Typically, when you encounter such problems you probably want to use two OS X Un*x commmands to figure what's going on under the hood: file and hexdump are very convenient in such cases.
You want to run them on your source file and you may want to run them on your class file.

Maybe the character set for the source is not set (and thus different according to system locale)?
Can you run the same compiled class on both systems (not re-compile)?

Bear in mind that there's more than one way to represent characters. Mac OS X uses unicode by default, so your string literal may actually not be represented by two bytes. You need to make sure that you load the string from the appropriate incoming character set; for example, by specifying in the source a \u escape character.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.