I have a problem with turkish special characters on different machines. The following code:
String turkish = "ğüşçĞÜŞÇı";
String test1 = new String(turkish.getBytes());
String test2 = new String(turkish.getBytes("UTF-8"));
String test3 = new String(turkish.getBytes("UTF-8"), "UTF-8");
System.out.println(test1);
System.out.println(test2);
System.out.println(test3);
On a Mac the three Strings are the same as the original string. On a Windows machine the three lines are (Printed with the Netbeans 6.7 console):
?ü?ç?Ü?Ç?
ğüşçĞÜŞÇı
?ü?ç?Ü?Ç?
I don't get the problem.
String test1 = new String(turkish.getBytes());
You're taking the Unicode String including the Turkish characters, and turning it into bytes using the default encoding (using the default encoding is usually a mistake). You're then taking those bytes and decoding them back into a String, again using the default encoding. The result is you've achieved nothing (except losing any characters that don't fit in the default encoding); whether you have put a String through an encode/decode cycle has no effect on what the following System.out.println(test1) does because that's still printing a String and not bytes.
String test2 = new String(turkish.getBytes("UTF-8"));
Encodes as UTF-8 and then decodes using the default encoding. On Mac the default encoding is UTF-8 so this does nothing. On Windows the default encoding is never UTF-8 so the result is the wrong characters.
String test3 = new String(turkish.getBytes("UTF-8"), "UTF-8");
Does precisely nothing.
To write Strings to stdout with a different encoding than the default encoding, you'd create a encoder something like new OutputStreamWriter(System.out, "cp1252") and send the string content to that.
However in this case, it looks like the console is using Windows code page 1252 Western European (+1 ATorres). There is no encoding mismatch issue here at all, so you won't be able to solve it by re-encoding strings!
The default encoding cp1252 matches the console's encoding, it's just that cp1252 doesn't contain the Turkish characters ğşĞŞı at all. You can see the other characters that are in cp1252, üçÜÇ, come through just fine. Unless you can reconfigure the console to use a different encoding that does include all the characters you want, there is no way you'll be able to output those characters.
Presumably on a Turkish Windows install, the default code page will be cp1254 instead and you will get the characters you expect (but other characters don't work). You can test this by changing the ‘Language to use for non-Unicode applications’ setting in the Regional and Language Options Control Panel app.
Unfortunately no Windows locale uses UTF-8 as the default code page. Putting non-ASCII output onto the console with the stdio stream functions is not something that's really reliable at all. There is a Win32 API to write Unicode directly to the console, but unfortunately nothing much uses it.
Don't rely on the console, or on the default platform encoding. Always specify the character encoding for calls like getBytes and the String constructor taking a byte array, and if you want to examine the contents of a string, print out the unicode value of each character.
I would also advise either restricting your source code to use ASCII (and \uxxxx to encode non-ASCII characters) or explicitly specifying the character encoding when you compile.
Now, what bigger problem are you trying to solve?
You may be dealing with different settings of the default encoding.
java -Dfile.encoding=utf-8
versus
java -Dfile.encoding=something else
Or, you may just be seeing the fact that the Mac terminal window works in UTF-8, and the Windows DOS box does not work in UTF-8.
As per Mr. Skeet, you have a third possible problem, which is that you are trying to embed UTF-8 chars in your source. Depending on the compiler options, you may or may not be getting what you intend there. Put this data in a properties file, or use \u escapes.
Finally, also per Mr. Skeet, never, ever call the zero-argument getBytes().
If you are using AspectJ compiler do not forget to set it's encoding to UTF-8 too. I have struggled to find this for hours.
Related
I have a problem with turkish special characters on different machines. The following code:
String turkish = "ğüşçĞÜŞÇı";
String test1 = new String(turkish.getBytes());
String test2 = new String(turkish.getBytes("UTF-8"));
String test3 = new String(turkish.getBytes("UTF-8"), "UTF-8");
System.out.println(test1);
System.out.println(test2);
System.out.println(test3);
On a Mac the three Strings are the same as the original string. On a Windows machine the three lines are (Printed with the Netbeans 6.7 console):
?ü?ç?Ü?Ç?
ğüşçĞÜŞÇı
?ü?ç?Ü?Ç?
I don't get the problem.
String test1 = new String(turkish.getBytes());
You're taking the Unicode String including the Turkish characters, and turning it into bytes using the default encoding (using the default encoding is usually a mistake). You're then taking those bytes and decoding them back into a String, again using the default encoding. The result is you've achieved nothing (except losing any characters that don't fit in the default encoding); whether you have put a String through an encode/decode cycle has no effect on what the following System.out.println(test1) does because that's still printing a String and not bytes.
String test2 = new String(turkish.getBytes("UTF-8"));
Encodes as UTF-8 and then decodes using the default encoding. On Mac the default encoding is UTF-8 so this does nothing. On Windows the default encoding is never UTF-8 so the result is the wrong characters.
String test3 = new String(turkish.getBytes("UTF-8"), "UTF-8");
Does precisely nothing.
To write Strings to stdout with a different encoding than the default encoding, you'd create a encoder something like new OutputStreamWriter(System.out, "cp1252") and send the string content to that.
However in this case, it looks like the console is using Windows code page 1252 Western European (+1 ATorres). There is no encoding mismatch issue here at all, so you won't be able to solve it by re-encoding strings!
The default encoding cp1252 matches the console's encoding, it's just that cp1252 doesn't contain the Turkish characters ğşĞŞı at all. You can see the other characters that are in cp1252, üçÜÇ, come through just fine. Unless you can reconfigure the console to use a different encoding that does include all the characters you want, there is no way you'll be able to output those characters.
Presumably on a Turkish Windows install, the default code page will be cp1254 instead and you will get the characters you expect (but other characters don't work). You can test this by changing the ‘Language to use for non-Unicode applications’ setting in the Regional and Language Options Control Panel app.
Unfortunately no Windows locale uses UTF-8 as the default code page. Putting non-ASCII output onto the console with the stdio stream functions is not something that's really reliable at all. There is a Win32 API to write Unicode directly to the console, but unfortunately nothing much uses it.
Don't rely on the console, or on the default platform encoding. Always specify the character encoding for calls like getBytes and the String constructor taking a byte array, and if you want to examine the contents of a string, print out the unicode value of each character.
I would also advise either restricting your source code to use ASCII (and \uxxxx to encode non-ASCII characters) or explicitly specifying the character encoding when you compile.
Now, what bigger problem are you trying to solve?
You may be dealing with different settings of the default encoding.
java -Dfile.encoding=utf-8
versus
java -Dfile.encoding=something else
Or, you may just be seeing the fact that the Mac terminal window works in UTF-8, and the Windows DOS box does not work in UTF-8.
As per Mr. Skeet, you have a third possible problem, which is that you are trying to embed UTF-8 chars in your source. Depending on the compiler options, you may or may not be getting what you intend there. Put this data in a properties file, or use \u escapes.
Finally, also per Mr. Skeet, never, ever call the zero-argument getBytes().
If you are using AspectJ compiler do not forget to set it's encoding to UTF-8 too. I have struggled to find this for hours.
I am trying to pass Arabic String into Function that store it into a database but the String's Chars is converted into '?'
as example
String str = new String();
str = "عشب";
System.out.print(str);
the output will be :
"???"
and it is stored like this in the database.
and if i insert into database directly it works well.
Make sure your character encoding is utf-8.
The snippet you showed works perfectly as expected.
For example if you are encoding your source files using windows-1252 it won't work.
The problem is that System.out.println is PrintWriter which converts the Arabic string into bytes using the default encoding; which presumably cannot handle the arabic characters. Try
System.out.write(str.getBytes("UTF-8"));
System.out.println();
Many modern operating systems use UTF-8 as default encoding which will support non-latin characters correctly. Windows is not one of those, with ANSI being the default in Western installations (I have not used Windows recently, so that may have changed). Either way, you should probably force the default character encoding for the Java process, irrespective of the platform.
As described in another Stackoverflow question (see Setting the default Java character encoding?), you'll need to changed the default as follows, for the Java process:
java -Dfile.encoding=UTF-8
Additionally, since you are running in IDE you may need to tell it to display the output in the indicated charset or risk corruption, though that is IDE specific and the exact instructions will depend on your IDE.
One other thing, is if you are reading or writing text files then you should always specify the expected character encoding, otherwise you will risk falling back to the platform default.
You need to set character set utf-8 for this.
at java level you can do:
Charset.forName("UTF-8").encode(myString);
If you want to do so at IDE level then you can do:
Window > Preferences > General > Content Types, set UTF-8 as the default encoding for all content types.
Config: Windows 8 English operating system; JDK1.7; Eclipse.
I installed a software written by a Chinese, and the GUI is Chinese characters. But the software displays ugly with square boxes. I searched the internet and found a method to fix it. In the control panel of Win8, set "language for non-Unicode Programs" to be "Chinese".
But problem arises when writing code in Eclipse. We know Java itself uses two byte Unicode to store char and String. But when I execute the following code:
import java.util.Arrays;
import java.nio.charset.Charset;
public class CharSetTest {
public static void main(String[] args) throws Exception {
System.out.println(Charset.defaultCharset());
String s = "哈哈";
byte[] b3 = s.getBytes("UTF-8");
System.out.println(b3.length);
System.out.format("%X %X %X\n", b3[0],b3[1],b3[2]);
System.out.println(new String(b3));
byte[] b4 = s.getBytes();
System.out.format("%X %X %X\n", b4[0],b4[1]);
}
}
The output is weird:
GBK //default charset is GBK, not Unicode or UTF-8
3 //this is obvious since a Chinese character is encoded into 3 bytes
E5 93 88 //this is corresponding UTF-8 code number
鍝? //something wrong here
B9 FE //I think s.getBytes() should use JAVA's default encode "Unicode", but NOT is this case
Several questions:
What is Java default charset? Is it Unicode? How Java default
charset interact with programmers? For example, if Java use Unicode,
then a string "abc" cannot be encoded into other charset since they
are different from Unicode like charset for Russia, Frence etc,
since they are totally different encoding method.
What does Charset.defaultCharset() return? Does it return my Windows 8's
default charset?
How does Charset.defaultCharset() return GBK? I didn't set anything in my
Windows 8 related default charset except the one for "language for
non-Unicode Programs" in control panel.
If I declare a String in Java like this: String str = "abc";, I don't
know the process of charset/encoding. I firstly need to input the
Java statement by keyboard. How the keyboard translates my key
button into Java Unicode charset? The String str is stored in my
.java source code file. What is the charset to store Java source
code?
EDIT:
Why does we say "Java use Unicode to represent char and String"? In my Java program, when should I care about the Unicode thing?
Usually, I only need to care about encoding/decoding with UTF-8 ISO-8859-1 GBK etc. But I never care about Unicode representation of char and String. So how and when should I use the Unicode?
Check the doc: "The default charset is determined during virtual-machine startup and typically depends upon the locale and charset of the underlying operating system." So no, the default character set is not necessarily Unicode.
In OpenJDK it is determined from the file.encoding property. See also Setting the default Java character encoding?.
The default file.encoding value is fetched (on Windows) using* the GetUserDefaultLCID() function, which corresponds to the setting in the "Regional and language options". That's why Charset.defaultCharset() is returning GBK, because you set the locale to Chinese.
Although the default character set is OS-dependent, the strings in a compiled Java class are always stored as UTF-16.
The encoding of a *.java source code is whatever you specify to the Java compiler, or the OS's default one if not provided. See Java compiler platform file encoding problem.
*: See http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/tip/src/windows/native/java/lang/java_props_md.c, line 577.
the default character set is the character set that Java will use to convert bytes to chars or Strings (and vice versa) if you don't specify anything else (for example if you create a InputStreamReader and don't pass an explicit charset).
Charset.defaultCharset() returns ... the default char set. What exactly that is is implementation dependent, but usually is just what the OS would use in the same case.
That setting is exactly what your Java installation is using: "Chinese" means that some encoding that handles chinese characters has to be provided and GBK matches that just fine.
The encoding of Java source files can be specified when you compile it (using the -encoding parameter). If you don't specify it explictly, then Java will use the platform default encoding (see #1).
What is JAVA default charset?
It's picked up from the default set in your OS. This could be Windows-1252-???
Is it Unicode?
This is not a charset. A charset defines how to encode characters as bytes.
How JAVA default charset interact with programmers?
It's the default used when you don't specify a charset.
For example, if JAVA use Unicode, then a string "abc" cannot be encoded into other charset since they are different from Unicode like charset for Russia, Frence etc, since they are totally different encoding method.
Internally Java uses UTF-16 but you don't need to know that. This has no issues with most languages except some Chinese dialects require the use of code points.
What does Charset.defaultCharset() returns?
It does what it appears to do. You can confirm this by reading the javadoc for this method.
Does it return my WIN8's default charset?
Because that is what it is supposed to do. You only have a problem if your OS's character set cannot be mapped into Java or is not correctly mapped into Java. If it is the same, everything is fine.
How Charset.defaultCharset() return GBK. I didn't set anything in my WIN8 related default charset except the one for "language for non-Unicode Programs" in control panel.
It is this because Java thinks you set this for Windows. To correct this, you must have the correct character set in Windows.
If I declare a String in java like: String str = "abc";, I don't know the process of charset/encoding.
For the purposes of this question, there isn't any encoding involved. There is only characters they don't need to be encoded to make characters because they are already characters.
How the keyboard translates my key button into Java Unicode charset?
The keyboard doesn't. It only knows which keys you pressed. The OS turns these keys into characters.
The String str is stored in my .java source code file. What is the charset to store java source code?
That is determined by the editor which does the storing. Most likely it will be the OS default again, or if you change it you might make it UTF-8.
I am not sure if this could help. To Change encoding in Eclipse:
--- Project Explorer
--- Right click on Java file
--- Run As
--- Run Configurations
--- Common (tab)
--- Encoding (In Linux it is set on UTF-8 by default
I am working with Java and PostgreSQL on Windows . I have some words which include turkish characters like İ,ş,ö,ç etc.
In Java I assign words to a string and try to write it to the database. When I print it on java its encoding appears correct and all characters display correctly. However, while writing it to database the text appears to get mangled/scrambled.
I created my database with this command:
CREATE DATABASE dbname ENCODING "UTF-8"
I tried to fix it by converting Turkish characters into the ISO-8859-1 encoding like (İ -> \u0130 , ş -> \u015F)
//\u0130leti\u015Fim = İletişim
title = \u0130leti\u015Fim
String mytitle = new String(title.getBytes("ISO-8859-1"), "UTF-8");
And then I tried to write mytitle to database but it did not work.
Thanks for your advice.
SOLVED : I realized that it could write turkish characters to database, but the problem was on the response. I added these lines before write to response.
String contentType= "text/html;charset=UTF-8";
response.setContentType(contentType);
response.setCharacterEncoding("utf-8");
After adding this, it works now. I hope, i could explain cleanly.
When you call title.getBytes("ISO-8859-1"), you're promising the Java runtime that the characters in the string can be represented as ISO-8859-1 bytes, which is not actually true for either \u0130 or \u015f.
Therefore already the conversion to bytes will do something unspecified with your Turkish characters -- probably they will just be dropped.
Next, attempting to interpret whichever bytes you get out of it as UTF-8 even though they're really ISO-8859-1 is then guaranteed to make a complete mess of everything that wasn't ASCII to begin with.
(The repretoire of ISO-8859-1 happens to coincide exactly with the Unicode characters that can be written as \u00XX for some XX).
With encoding issues you have several things to check:
Whether your source file is in the encoding you expect it to be.
How client_encoding is set
What the database encoding is
In the case of Java, PgJDBC requires client_encoding to always be UTF-8 and will choke if you set it to something else, so that's not going to be the issue. You've shown that your database is UTF-8 too. So it seems likely that your Java sources aren't in the same encoding the Java compiler and runtime expect them to be in.
By default javac will interpret your source code in the platform default encoding. If you've saved your sources in a different encoding, weird things will happen. Save your sources either:
in the default encoding for your Windows platform;
as Unicode ("UTF-16" or "UCS-2"); or
As UTF-8 with a Byte Order Mark (BOM). Many programs don't add a BOM for UTF-8.
Then recompile your program. If that doesn't help, you'll need to follow up with more detail, starting with what exactly "it did not work" means, output of SELECTing the data you inserted with Java using psql, etc.
You should create the database like this:
CREATE DATABASE <db name>
WITH OWNER <owner user name>
TEMPLATE template0
ENCODING 'UTF-8'
LC_COLLATE 'tr_TR.UTF-8'
LC_CTYPE = 'tr_TR.UTF-8';
problem: I have a string containing special characters which i convert to bytes and vice versa..the conversion works properly on windows but on linux the special character is not converted properly.the default charset on linux is UTF-8 as seen with Charset.defaultCharset.getdisplayName()
however if i run on linux with option -Dfile.encoding=ISO-8859-1 it works properly..
how to make it work using the UTF-8 default charset and not setting the -D option in unix environment.
edit: i use jdk1.6.13
edit:code snippet
works with cs = "ISO-8859-1"; or cs="UTF-8"; on win but not in linux
String x = "½";
System.out.println(x);
byte[] ba = x.getBytes(Charset.forName(cs));
for (byte b : ba) {
System.out.println(b);
}
String y = new String(ba, Charset.forName(cs));
System.out.println(y);
~regards
daed
Your characters are probably being corrupted by the compilation process and you're ending up with junk data in your class file.
if i run on linux with option -Dfile.encoding=ISO-8859-1 it works properly..
The "file.encoding" property is not required by the J2SE platform specification; it's an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only; it's technically impossible to support the setting of this property to arbitrary values on the command line or at any other time during program execution.
In short, don't use -Dfile.encoding=...
String x = "½";
Since U+00bd (½) will be represented by different values in different encodings:
windows-1252 BD
UTF-8 C2 BD
ISO-8859-1 BD
...you need to tell your compiler what encoding your source file is encoded as:
javac -encoding ISO-8859-1 Foo.java
Now we get to this one:
System.out.println(x);
As a PrintStream, this will encode data to the system encoding prior to emitting the byte data. Like this:
System.out.write(x.getBytes(Charset.defaultCharset()));
That may or may not work as you expect on some platforms - the byte encoding must match the encoding the console is expecting for the characters to show up correctly.
Your problem is a bit vague. You mentioned that -Dfile.encoding solved your linux problem, but this is in fact only used to inform the Sun(!) JVM which encoding to use to manage filenames/pathnames at the local disk file system. And ... this does't fit in the problem description you literally gave: "converting chars to bytes and back to chars failed". I don't see what -Dfile.encoding has to do with this. There must be more into the story. How did you conclude that it failed? Did you read/write those characters from/into a pathname/filename or so? Or was you maybe printing to the stdout? Did the stdout itself use the proper encoding?
That said, why would you like to convert the chars forth and back to/from bytes? I don't see any useful business purposes for this.
(sorry, this didn't fit in a comment, but I will update this with the answer if you have given more info about the actual functional requirement).
Update: as per the comments: you basically just need to configure the stdout/cmd so that it uses the proper encoding to display those characters. In Windows you can do that with chcp command, but there's one major caveat: the standard fonts used in Windows cmd does not have the proper glyphs (the actual font pictures) for characters outside the ISO-8859 charsets. You can hack the one or other in registry to add proper fonts. No wording about Linux as I don't do it extensively, but it look like that -Dfile.encoding is somehow the way to go. After all ... I think it's better to replace cmd with a crossplatform UI tool to display the characters the way you want, for example Swing.
You should make the conversion explicitly:
byte[] byteArray = "abcd".getBytes( "ISO-8859-1" );
new String( byteArray, "ISO-8859-1" );
EDIT:
It seems that the problem is the encoding of your java file. If it works on windows, try compiling the source files on linux with javac -encondig ISO-8859-1. This should solve your problem.