Java Unicode Characters after u 00ff - java

I can not print the unicode values after 00ff Instead I'm getting '?' character after execution of this in Eclipse. Is that an expectable behaviour?
System.out.println("\u01ff");

Eclipse uses by default platform default encoding (which is cp1252 in Windows) to decode characters during saving textbased files and during writing to standard output stream (as used by System.out). You need to explicitly set it to UTF-8 in order to achieve world domination.
Note that this way you also don't need to use those \uXXXX Unicode escapes anymore to represent those characters in your textbased source files.
Those question marks are caused because the charset as used by the output stream does not support the character as specified in the input stream.
See also:
Unicode - How to get the characters right?

The problem is the encoding used with System.out; by default, it's your platform's native encoding (usually not UTF-8). You can explicitly change the encoding by replacing the stream:
try {
PrintStream ps = new PrintStream(System.out, true, "UTF-8");
System.setOut(ps);
} catch (UnsupportedEncodingException error) {
System.err.println(error);
}
After this (barring font or encoding issues with the underlying environment), all Unicode characters should print correctly on System.out.
EDIT Based on the back-and-forth between me and BalusC on his answer, this isn't enough (or even necessary) to get things working inside Eclipse. There you have two problems to solve: using the correct encoding when writing to System.out and then using the correct encoding when presenting the console output in the Eclipse console view. BalusC's answer addresses both of those issues. Unfortunately, it won't address running outside Eclipse. For that, you need to either use my approach above or set the default run-time encoding. This can be done by using the flag -Dfile.encoding=UTF-8 on the command line or setting the environment variable JAVA_TOOL_OPTIONS to include -Dfile.encoding=UTF-8. To run your code successfully outside Eclipse, then that's probably the best approach.

Java can handle those characters just fine. But the outputs team will have a specific encoding.
And unless that encoding is something like UTF-8 or UTF-16, it won't be able to encode every character in Unicode. And when it encounters a character it can't represent, it will be replaced with a question mark.

The JVM uses the default file encoding for System.out as well.
On Linux for example, if your $LANG variable is set to de_DE.UTF-8, the value for file.encoding will be derived accordingly, and set to utf-8.
If the JVM cannot derive the setting you want, you can change the file encoding by setting a system property:
java -Dfile.encoding=utf-8 ...
on the command line.
You can do this as well in Eclipse using a Run configuration (context menu - Run as - Run configurations ... - Arguments - VM arguments)
So this works both for the command line and Eclipse, and there is no need to define the encoding explicitly in the source.
If the value is set to
java -Dfile.encoding=iso-8859-1
for example, only a subset of the Unicode characters can be represented, because that character set only supports a limited number of characters. The other ones will turn up as ?.
There are two other things to bear in mind:
The device which receives the stream (a terminal, shell etc.) must decode it correctly. As for GNOME's terminal for example, you can set the character encoding in a menu.
The font being used by that terminal etc. must contain the graphical representation for this character

Related

Java String some characters not showing [duplicate]

I have a problem with turkish special characters on different machines. The following code:
String turkish = "ğüşçĞÜŞÇı";
String test1 = new String(turkish.getBytes());
String test2 = new String(turkish.getBytes("UTF-8"));
String test3 = new String(turkish.getBytes("UTF-8"), "UTF-8");
System.out.println(test1);
System.out.println(test2);
System.out.println(test3);
On a Mac the three Strings are the same as the original string. On a Windows machine the three lines are (Printed with the Netbeans 6.7 console):
?ü?ç?Ü?Ç?
ğüşçĞÜŞÇı
?ü?ç?Ü?Ç?
I don't get the problem.
String test1 = new String(turkish.getBytes());
You're taking the Unicode String including the Turkish characters, and turning it into bytes using the default encoding (using the default encoding is usually a mistake). You're then taking those bytes and decoding them back into a String, again using the default encoding. The result is you've achieved nothing (except losing any characters that don't fit in the default encoding); whether you have put a String through an encode/decode cycle has no effect on what the following System.out.println(test1) does because that's still printing a String and not bytes.
String test2 = new String(turkish.getBytes("UTF-8"));
Encodes as UTF-8 and then decodes using the default encoding. On Mac the default encoding is UTF-8 so this does nothing. On Windows the default encoding is never UTF-8 so the result is the wrong characters.
String test3 = new String(turkish.getBytes("UTF-8"), "UTF-8");
Does precisely nothing.
To write Strings to stdout with a different encoding than the default encoding, you'd create a encoder something like new OutputStreamWriter(System.out, "cp1252") and send the string content to that.
However in this case, it looks like the console is using Windows code page 1252 Western European (+1 ATorres). There is no encoding mismatch issue here at all, so you won't be able to solve it by re-encoding strings!
The default encoding cp1252 matches the console's encoding, it's just that cp1252 doesn't contain the Turkish characters ğşĞŞı at all. You can see the other characters that are in cp1252, üçÜÇ, come through just fine. Unless you can reconfigure the console to use a different encoding that does include all the characters you want, there is no way you'll be able to output those characters.
Presumably on a Turkish Windows install, the default code page will be cp1254 instead and you will get the characters you expect (but other characters don't work). You can test this by changing the ‘Language to use for non-Unicode applications’ setting in the Regional and Language Options Control Panel app.
Unfortunately no Windows locale uses UTF-8 as the default code page. Putting non-ASCII output onto the console with the stdio stream functions is not something that's really reliable at all. There is a Win32 API to write Unicode directly to the console, but unfortunately nothing much uses it.
Don't rely on the console, or on the default platform encoding. Always specify the character encoding for calls like getBytes and the String constructor taking a byte array, and if you want to examine the contents of a string, print out the unicode value of each character.
I would also advise either restricting your source code to use ASCII (and \uxxxx to encode non-ASCII characters) or explicitly specifying the character encoding when you compile.
Now, what bigger problem are you trying to solve?
You may be dealing with different settings of the default encoding.
java -Dfile.encoding=utf-8
versus
java -Dfile.encoding=something else
Or, you may just be seeing the fact that the Mac terminal window works in UTF-8, and the Windows DOS box does not work in UTF-8.
As per Mr. Skeet, you have a third possible problem, which is that you are trying to embed UTF-8 chars in your source. Depending on the compiler options, you may or may not be getting what you intend there. Put this data in a properties file, or use \u escapes.
Finally, also per Mr. Skeet, never, ever call the zero-argument getBytes().
If you are using AspectJ compiler do not forget to set it's encoding to UTF-8 too. I have struggled to find this for hours.

ISO-8859-1 character encoding not working in Linux

I have tried below code in windows and able decode the message .But same code when i have tried Linux it's not working.
String message ="ööööö";
String encodedMsg = new String(message.getBytes("ISO-8859-1"), "UTF-8");
System.out.println(encodedMsg);
I have verified and could see the default character set in Linux platform is UTF-8(Charset.defaultCharset().name())
Kindly suggest me how to do same encoding Linux platform.
The explanation for this, is, almost always, that somewhere bytes are turned to characters or characters are turned to bytes there where the encoding is not clearly specified, thus, defaulting to 'platform default', thus, causing different results depending on which platform you run it on.
Except, every place where you turn bytes to chars or chars to bytes in your snippet of code explicitly specified encoding.
Or does it?
String message ="ööööö";
Ah, no, you forgot one place: javac itself.
You compile this code. That'll be where raw bytes (because the compiler is looking at ManmohansSourceFile.java, which is a file, which isn't characters, but a bunch of bytes) - which are converted into characters (because the java compiler works on characters), and this is done using some encoding. If you don't use the -encoding switch when running javac (or maven or gradle is running javac, and it passes an encoding, which one depends on your pom/gradle file), then this is read in using system encoding, and thus whether the string actually contains those bytes - who knows.
This is most likely the source of your problem.
The fix? Pick one:
Don't put non-ascii in your source files. Note that you can write the unicode symbol "Latin Capital Letter A with Tilde" as \u00C3 in your source file instead of as Ã. Then use \u00B6 for ¶.
String message ="\u00C3\u00B6\u00C3\u00B6\u00C3\u00B6\u00C3\u00B6\u00C3\u00B6";
String encodedMsg = new String(message.getBytes("ISO-8859-1"), "UTF-8");
System.out.println(encodedMsg);
> ööööö
Ensure you specify the right -encoding switch when compiling. So, if your text editor (that you use to type String message = "¶";) is configured as 'UTF-8', and then run javac -encoding UTF-8 manMohansFile.java.
First of all, I'm not sure exactly what you are expecting...your use of the term "encode" is a bit confusing, but from your comments, it appears that with the input "ööööö", you expect the output "ööööö".
On both Linux and OS X with Java 1.8, I do get that result.
I do not have a Windows machine to try this on.
As #Pshemo indicated, it is possible that your input, since it's hardcoded in the source code as a string, is being represented as UTF-8, not as ISO-8859-1. Actually, this is what I expected, and I was surprised that the code worked as you expected.
Try creating the input with String.encode(), encoding to ISO-8859-1.

Store Arabic in String and insert it into database using Java

I am trying to pass Arabic String into Function that store it into a database but the String's Chars is converted into '?'
as example
String str = new String();
str = "عشب";
System.out.print(str);
the output will be :
"???"
and it is stored like this in the database.
and if i insert into database directly it works well.
Make sure your character encoding is utf-8.
The snippet you showed works perfectly as expected.
For example if you are encoding your source files using windows-1252 it won't work.
The problem is that System.out.println is PrintWriter which converts the Arabic string into bytes using the default encoding; which presumably cannot handle the arabic characters. Try
System.out.write(str.getBytes("UTF-8"));
System.out.println();
Many modern operating systems use UTF-8 as default encoding which will support non-latin characters correctly. Windows is not one of those, with ANSI being the default in Western installations (I have not used Windows recently, so that may have changed). Either way, you should probably force the default character encoding for the Java process, irrespective of the platform.
As described in another Stackoverflow question (see Setting the default Java character encoding?), you'll need to changed the default as follows, for the Java process:
java -Dfile.encoding=UTF-8
Additionally, since you are running in IDE you may need to tell it to display the output in the indicated charset or risk corruption, though that is IDE specific and the exact instructions will depend on your IDE.
One other thing, is if you are reading or writing text files then you should always specify the expected character encoding, otherwise you will risk falling back to the platform default.
You need to set character set utf-8 for this.
at java level you can do:
Charset.forName("UTF-8").encode(myString);
If you want to do so at IDE level then you can do:
Window > Preferences > General > Content Types, set UTF-8 as the default encoding for all content types.

Why is my Java Charset.defaultCharset() GBK and not Unicode?

Config: Windows 8 English operating system; JDK1.7; Eclipse.
I installed a software written by a Chinese, and the GUI is Chinese characters. But the software displays ugly with square boxes. I searched the internet and found a method to fix it. In the control panel of Win8, set "language for non-Unicode Programs" to be "Chinese".
But problem arises when writing code in Eclipse. We know Java itself uses two byte Unicode to store char and String. But when I execute the following code:
import java.util.Arrays;
import java.nio.charset.Charset;
public class CharSetTest {
public static void main(String[] args) throws Exception {
System.out.println(Charset.defaultCharset());
String s = "哈哈";
byte[] b3 = s.getBytes("UTF-8");
System.out.println(b3.length);
System.out.format("%X %X %X\n", b3[0],b3[1],b3[2]);
System.out.println(new String(b3));
byte[] b4 = s.getBytes();
System.out.format("%X %X %X\n", b4[0],b4[1]);
}
}
The output is weird:
GBK //default charset is GBK, not Unicode or UTF-8
3 //this is obvious since a Chinese character is encoded into 3 bytes
E5 93 88 //this is corresponding UTF-8 code number
鍝? //something wrong here
B9 FE //I think s.getBytes() should use JAVA's default encode "Unicode", but NOT is this case
Several questions:
What is Java default charset? Is it Unicode? How Java default
charset interact with programmers? For example, if Java use Unicode,
then a string "abc" cannot be encoded into other charset since they
are different from Unicode like charset for Russia, Frence etc,
since they are totally different encoding method.
What does Charset.defaultCharset() return? Does it return my Windows 8's
default charset?
How does Charset.defaultCharset() return GBK? I didn't set anything in my
Windows 8 related default charset except the one for "language for
non-Unicode Programs" in control panel.
If I declare a String in Java like this: String str = "abc";, I don't
know the process of charset/encoding. I firstly need to input the
Java statement by keyboard. How the keyboard translates my key
button into Java Unicode charset? The String str is stored in my
.java source code file. What is the charset to store Java source
code?
EDIT:
Why does we say "Java use Unicode to represent char and String"? In my Java program, when should I care about the Unicode thing?
Usually, I only need to care about encoding/decoding with UTF-8 ISO-8859-1 GBK etc. But I never care about Unicode representation of char and String. So how and when should I use the Unicode?
Check the doc: "The default charset is determined during virtual-machine startup and typically depends upon the locale and charset of the underlying operating system." So no, the default character set is not necessarily Unicode.
In OpenJDK it is determined from the file.encoding property. See also Setting the default Java character encoding?.
The default file.encoding value is fetched (on Windows) using* the GetUserDefaultLCID() function, which corresponds to the setting in the "Regional and language options". That's why Charset.defaultCharset() is returning GBK, because you set the locale to Chinese.
Although the default character set is OS-dependent, the strings in a compiled Java class are always stored as UTF-16.
The encoding of a *.java source code is whatever you specify to the Java compiler, or the OS's default one if not provided. See Java compiler platform file encoding problem.
*: See http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/tip/src/windows/native/java/lang/java_props_md.c, line 577.
the default character set is the character set that Java will use to convert bytes to chars or Strings (and vice versa) if you don't specify anything else (for example if you create a InputStreamReader and don't pass an explicit charset).
Charset.defaultCharset() returns ... the default char set. What exactly that is is implementation dependent, but usually is just what the OS would use in the same case.
That setting is exactly what your Java installation is using: "Chinese" means that some encoding that handles chinese characters has to be provided and GBK matches that just fine.
The encoding of Java source files can be specified when you compile it (using the -encoding parameter). If you don't specify it explictly, then Java will use the platform default encoding (see #1).
What is JAVA default charset?
It's picked up from the default set in your OS. This could be Windows-1252-???
Is it Unicode?
This is not a charset. A charset defines how to encode characters as bytes.
How JAVA default charset interact with programmers?
It's the default used when you don't specify a charset.
For example, if JAVA use Unicode, then a string "abc" cannot be encoded into other charset since they are different from Unicode like charset for Russia, Frence etc, since they are totally different encoding method.
Internally Java uses UTF-16 but you don't need to know that. This has no issues with most languages except some Chinese dialects require the use of code points.
What does Charset.defaultCharset() returns?
It does what it appears to do. You can confirm this by reading the javadoc for this method.
Does it return my WIN8's default charset?
Because that is what it is supposed to do. You only have a problem if your OS's character set cannot be mapped into Java or is not correctly mapped into Java. If it is the same, everything is fine.
How Charset.defaultCharset() return GBK. I didn't set anything in my WIN8 related default charset except the one for "language for non-Unicode Programs" in control panel.
It is this because Java thinks you set this for Windows. To correct this, you must have the correct character set in Windows.
If I declare a String in java like: String str = "abc";, I don't know the process of charset/encoding.
For the purposes of this question, there isn't any encoding involved. There is only characters they don't need to be encoded to make characters because they are already characters.
How the keyboard translates my key button into Java Unicode charset?
The keyboard doesn't. It only knows which keys you pressed. The OS turns these keys into characters.
The String str is stored in my .java source code file. What is the charset to store java source code?
That is determined by the editor which does the storing. Most likely it will be the OS default again, or if you change it you might make it UTF-8.
I am not sure if this could help. To Change encoding in Eclipse:
--- Project Explorer
--- Right click on Java file
--- Run As
--- Run Configurations
--- Common (tab)
--- Encoding (In Linux it is set on UTF-8 by default

Java Charset problem on linux

problem: I have a string containing special characters which i convert to bytes and vice versa..the conversion works properly on windows but on linux the special character is not converted properly.the default charset on linux is UTF-8 as seen with Charset.defaultCharset.getdisplayName()
however if i run on linux with option -Dfile.encoding=ISO-8859-1 it works properly..
how to make it work using the UTF-8 default charset and not setting the -D option in unix environment.
edit: i use jdk1.6.13
edit:code snippet
works with cs = "ISO-8859-1"; or cs="UTF-8"; on win but not in linux
String x = "½";
System.out.println(x);
byte[] ba = x.getBytes(Charset.forName(cs));
for (byte b : ba) {
System.out.println(b);
}
String y = new String(ba, Charset.forName(cs));
System.out.println(y);
~regards
daed
Your characters are probably being corrupted by the compilation process and you're ending up with junk data in your class file.
if i run on linux with option -Dfile.encoding=ISO-8859-1 it works properly..
The "file.encoding" property is not required by the J2SE platform specification; it's an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only; it's technically impossible to support the setting of this property to arbitrary values on the command line or at any other time during program execution.
In short, don't use -Dfile.encoding=...
String x = "½";
Since U+00bd (½) will be represented by different values in different encodings:
windows-1252 BD
UTF-8 C2 BD
ISO-8859-1 BD
...you need to tell your compiler what encoding your source file is encoded as:
javac -encoding ISO-8859-1 Foo.java
Now we get to this one:
System.out.println(x);
As a PrintStream, this will encode data to the system encoding prior to emitting the byte data. Like this:
System.out.write(x.getBytes(Charset.defaultCharset()));
That may or may not work as you expect on some platforms - the byte encoding must match the encoding the console is expecting for the characters to show up correctly.
Your problem is a bit vague. You mentioned that -Dfile.encoding solved your linux problem, but this is in fact only used to inform the Sun(!) JVM which encoding to use to manage filenames/pathnames at the local disk file system. And ... this does't fit in the problem description you literally gave: "converting chars to bytes and back to chars failed". I don't see what -Dfile.encoding has to do with this. There must be more into the story. How did you conclude that it failed? Did you read/write those characters from/into a pathname/filename or so? Or was you maybe printing to the stdout? Did the stdout itself use the proper encoding?
That said, why would you like to convert the chars forth and back to/from bytes? I don't see any useful business purposes for this.
(sorry, this didn't fit in a comment, but I will update this with the answer if you have given more info about the actual functional requirement).
Update: as per the comments: you basically just need to configure the stdout/cmd so that it uses the proper encoding to display those characters. In Windows you can do that with chcp command, but there's one major caveat: the standard fonts used in Windows cmd does not have the proper glyphs (the actual font pictures) for characters outside the ISO-8859 charsets. You can hack the one or other in registry to add proper fonts. No wording about Linux as I don't do it extensively, but it look like that -Dfile.encoding is somehow the way to go. After all ... I think it's better to replace cmd with a crossplatform UI tool to display the characters the way you want, for example Swing.
You should make the conversion explicitly:
byte[] byteArray = "abcd".getBytes( "ISO-8859-1" );
new String( byteArray, "ISO-8859-1" );
EDIT:
It seems that the problem is the encoding of your java file. If it works on windows, try compiling the source files on linux with javac -encondig ISO-8859-1. This should solve your problem.

Categories