I am developing a JavaFX application. I need to create a TreeView programmatically using Persian language as it's nodes' name.
The problem is I see strange characters when I run the application. I have searched through the web and SO same questions. I code a function to do the encoding based on the answers to same question:
public static String getUTF(String encodeString) {
return new String(encodeString.getBytes(StandardCharsets.ISO_8859_1),
StandardCharsets.UTF_8);
}
And I use it to convert my string to build the TreeView:
CheckBoxTreeItem<String> userManagement =
new CheckBoxTreeItem<>(GlobalItems.getUTF("کاربران"));
This answer dowsn't work properly for some characters:
I still get strange results. If I don't use encoding, I get:
For hard coded string literals you need to tell the javac compiler to use the same encoding as the java source, say UTF-8. Check the IDE / build settings. You can u-escape some Farsi symbols,
\u062f for Dal, د. If the escaped characters come thru correctly, the compiler uses the wrong encoding.
String will always contain Unicode, no new Strings with hacking reconversion needed.
Reading files with text, one needs to convert those bytes (byte/InputStream) to java text (String/Reader) specifying the encoding of those bytes.
Related
I have a problem with turkish special characters on different machines. The following code:
String turkish = "ğüşçĞÜŞÇı";
String test1 = new String(turkish.getBytes());
String test2 = new String(turkish.getBytes("UTF-8"));
String test3 = new String(turkish.getBytes("UTF-8"), "UTF-8");
System.out.println(test1);
System.out.println(test2);
System.out.println(test3);
On a Mac the three Strings are the same as the original string. On a Windows machine the three lines are (Printed with the Netbeans 6.7 console):
?ü?ç?Ü?Ç?
ğüşçĞÜŞÇı
?ü?ç?Ü?Ç?
I don't get the problem.
String test1 = new String(turkish.getBytes());
You're taking the Unicode String including the Turkish characters, and turning it into bytes using the default encoding (using the default encoding is usually a mistake). You're then taking those bytes and decoding them back into a String, again using the default encoding. The result is you've achieved nothing (except losing any characters that don't fit in the default encoding); whether you have put a String through an encode/decode cycle has no effect on what the following System.out.println(test1) does because that's still printing a String and not bytes.
String test2 = new String(turkish.getBytes("UTF-8"));
Encodes as UTF-8 and then decodes using the default encoding. On Mac the default encoding is UTF-8 so this does nothing. On Windows the default encoding is never UTF-8 so the result is the wrong characters.
String test3 = new String(turkish.getBytes("UTF-8"), "UTF-8");
Does precisely nothing.
To write Strings to stdout with a different encoding than the default encoding, you'd create a encoder something like new OutputStreamWriter(System.out, "cp1252") and send the string content to that.
However in this case, it looks like the console is using Windows code page 1252 Western European (+1 ATorres). There is no encoding mismatch issue here at all, so you won't be able to solve it by re-encoding strings!
The default encoding cp1252 matches the console's encoding, it's just that cp1252 doesn't contain the Turkish characters ğşĞŞı at all. You can see the other characters that are in cp1252, üçÜÇ, come through just fine. Unless you can reconfigure the console to use a different encoding that does include all the characters you want, there is no way you'll be able to output those characters.
Presumably on a Turkish Windows install, the default code page will be cp1254 instead and you will get the characters you expect (but other characters don't work). You can test this by changing the ‘Language to use for non-Unicode applications’ setting in the Regional and Language Options Control Panel app.
Unfortunately no Windows locale uses UTF-8 as the default code page. Putting non-ASCII output onto the console with the stdio stream functions is not something that's really reliable at all. There is a Win32 API to write Unicode directly to the console, but unfortunately nothing much uses it.
Don't rely on the console, or on the default platform encoding. Always specify the character encoding for calls like getBytes and the String constructor taking a byte array, and if you want to examine the contents of a string, print out the unicode value of each character.
I would also advise either restricting your source code to use ASCII (and \uxxxx to encode non-ASCII characters) or explicitly specifying the character encoding when you compile.
Now, what bigger problem are you trying to solve?
You may be dealing with different settings of the default encoding.
java -Dfile.encoding=utf-8
versus
java -Dfile.encoding=something else
Or, you may just be seeing the fact that the Mac terminal window works in UTF-8, and the Windows DOS box does not work in UTF-8.
As per Mr. Skeet, you have a third possible problem, which is that you are trying to embed UTF-8 chars in your source. Depending on the compiler options, you may or may not be getting what you intend there. Put this data in a properties file, or use \u escapes.
Finally, also per Mr. Skeet, never, ever call the zero-argument getBytes().
If you are using AspectJ compiler do not forget to set it's encoding to UTF-8 too. I have struggled to find this for hours.
I am trying to pass Arabic String into Function that store it into a database but the String's Chars is converted into '?'
as example
String str = new String();
str = "عشب";
System.out.print(str);
the output will be :
"???"
and it is stored like this in the database.
and if i insert into database directly it works well.
Make sure your character encoding is utf-8.
The snippet you showed works perfectly as expected.
For example if you are encoding your source files using windows-1252 it won't work.
The problem is that System.out.println is PrintWriter which converts the Arabic string into bytes using the default encoding; which presumably cannot handle the arabic characters. Try
System.out.write(str.getBytes("UTF-8"));
System.out.println();
Many modern operating systems use UTF-8 as default encoding which will support non-latin characters correctly. Windows is not one of those, with ANSI being the default in Western installations (I have not used Windows recently, so that may have changed). Either way, you should probably force the default character encoding for the Java process, irrespective of the platform.
As described in another Stackoverflow question (see Setting the default Java character encoding?), you'll need to changed the default as follows, for the Java process:
java -Dfile.encoding=UTF-8
Additionally, since you are running in IDE you may need to tell it to display the output in the indicated charset or risk corruption, though that is IDE specific and the exact instructions will depend on your IDE.
One other thing, is if you are reading or writing text files then you should always specify the expected character encoding, otherwise you will risk falling back to the platform default.
You need to set character set utf-8 for this.
at java level you can do:
Charset.forName("UTF-8").encode(myString);
If you want to do so at IDE level then you can do:
Window > Preferences > General > Content Types, set UTF-8 as the default encoding for all content types.
i read a list for my android app from a csv or txt file.
If the File is encoded UTF-8 with Notepad++ i seh the list all right. But i cant search/find strings with .equals.
If the file is encoded with widows as ansi, is cant see äöü etc. But now i can find strings.
Now my question. How can i found out what charset my string has?
I compare my frist string (from the file) with another string, read in in the app with searchview.
I "THINK" my searchview string from the app is ansi too, how to change that to UTF-8 and hope that the compare then works, again.
Android 4.4.2
Thank you
following dosent work:
String s = null;
try
{
s = new String(query.getBytes(), "UTF-8");
}
catch (UnsupportedEncodingException e)
{
Log.e("utf8", "conversion", e);
}
Java strings are always encoded as UTF-16, regardless of where the string data comes from.
It is important that you correctly identify the charset of the source data when converting it to a Java string. new String(query.getBytes(), "UTF-8") will work fine if the byte[] array is actually UTF-8 encoded. If you specify the wrong charset, you will get an UnsupportedEncodingException error only if you specify a charset that Java does not support. However, if you specify a charset that Java does support, and then the decoding of the data fails (typically because you specified the wrong charset for the data), you will get other errors instead, such as MalformedInputException or UnmappableCharacterException, or worse you will not get any errors at all and malformed/illegal bytes will simply be converted to the Unicode U+FFFD replacement character instead. If you need more control over error handling during the conversion process, you need to use the CharsetDecoder class instead.
Sometimes UTF-encoded files will have a BOM in the front, so you can check for that. But Ansi files do not use BOMs. If a UTF BOM is not present in the file, then you have to either analyze the raw data and take a guess (which will lead to problems if you guess wrong), or simply ask the user which charset to use.
Always know the charset of your data. If you don't know, ask. Avoid guessing.
I want to display a string which is in traditional Chinese language into my application GUI.
While debugging eclipse showed this string as some mixture of English alphabets and square boxes.
This is the java code which I used to decode it. The string 'str' I am getting from a traditional Chinese .mpg stream.
String TRADITIONAL_CHINESE_ENC = "Big5";
byte[] tmp = str.getBytes();
String decodedString=new String(tmp,TRADITIONAL_CHINESE_ENC);
But the result i am getting in decodedString is also a mixture of alphabets,square boxes,and some question mark embedded in a diamond shaped box etc.
This is happening only in case of traditional Chinese language. The same code works fine for simplified chinese,korean languages etc.
What could be wrong in my code when dealing with traditional Chinese?
I am using UTF-8 encoding for eclipse.
I can't see anything wrong with that code.
According to this Wikipedia page, there are 3 common encodings for traditional Chinese characters: Guobiao, UTF-8 and Big5. I suggest you try the two alternatives that you haven't tried, and if that fails try some of the less common alternatives listed.
(It is also possible that the real problem is in the way you are displaying the String ... but the fact that you are displaying Simplified Chinese and Korean correctly suggests that this is not the problem.)
I am using UTF-8 encoding for eclipse.
I don't think that is relevant. The code you showed us doesn't depend on the default character encoding of either the execution platform or the IDE.
I am working with Java and PostgreSQL on Windows . I have some words which include turkish characters like İ,ş,ö,ç etc.
In Java I assign words to a string and try to write it to the database. When I print it on java its encoding appears correct and all characters display correctly. However, while writing it to database the text appears to get mangled/scrambled.
I created my database with this command:
CREATE DATABASE dbname ENCODING "UTF-8"
I tried to fix it by converting Turkish characters into the ISO-8859-1 encoding like (İ -> \u0130 , ş -> \u015F)
//\u0130leti\u015Fim = İletişim
title = \u0130leti\u015Fim
String mytitle = new String(title.getBytes("ISO-8859-1"), "UTF-8");
And then I tried to write mytitle to database but it did not work.
Thanks for your advice.
SOLVED : I realized that it could write turkish characters to database, but the problem was on the response. I added these lines before write to response.
String contentType= "text/html;charset=UTF-8";
response.setContentType(contentType);
response.setCharacterEncoding("utf-8");
After adding this, it works now. I hope, i could explain cleanly.
When you call title.getBytes("ISO-8859-1"), you're promising the Java runtime that the characters in the string can be represented as ISO-8859-1 bytes, which is not actually true for either \u0130 or \u015f.
Therefore already the conversion to bytes will do something unspecified with your Turkish characters -- probably they will just be dropped.
Next, attempting to interpret whichever bytes you get out of it as UTF-8 even though they're really ISO-8859-1 is then guaranteed to make a complete mess of everything that wasn't ASCII to begin with.
(The repretoire of ISO-8859-1 happens to coincide exactly with the Unicode characters that can be written as \u00XX for some XX).
With encoding issues you have several things to check:
Whether your source file is in the encoding you expect it to be.
How client_encoding is set
What the database encoding is
In the case of Java, PgJDBC requires client_encoding to always be UTF-8 and will choke if you set it to something else, so that's not going to be the issue. You've shown that your database is UTF-8 too. So it seems likely that your Java sources aren't in the same encoding the Java compiler and runtime expect them to be in.
By default javac will interpret your source code in the platform default encoding. If you've saved your sources in a different encoding, weird things will happen. Save your sources either:
in the default encoding for your Windows platform;
as Unicode ("UTF-16" or "UCS-2"); or
As UTF-8 with a Byte Order Mark (BOM). Many programs don't add a BOM for UTF-8.
Then recompile your program. If that doesn't help, you'll need to follow up with more detail, starting with what exactly "it did not work" means, output of SELECTing the data you inserted with Java using psql, etc.
You should create the database like this:
CREATE DATABASE <db name>
WITH OWNER <owner user name>
TEMPLATE template0
ENCODING 'UTF-8'
LC_COLLATE 'tr_TR.UTF-8'
LC_CTYPE = 'tr_TR.UTF-8';