Not able to decode traditional chinese using java - java

I want to display a string which is in traditional Chinese language into my application GUI.
While debugging eclipse showed this string as some mixture of English alphabets and square boxes.
This is the java code which I used to decode it. The string 'str' I am getting from a traditional Chinese .mpg stream.
String TRADITIONAL_CHINESE_ENC = "Big5";
byte[] tmp = str.getBytes();
String decodedString=new String(tmp,TRADITIONAL_CHINESE_ENC);
But the result i am getting in decodedString is also a mixture of alphabets,square boxes,and some question mark embedded in a diamond shaped box etc.
This is happening only in case of traditional Chinese language. The same code works fine for simplified chinese,korean languages etc.
What could be wrong in my code when dealing with traditional Chinese?
I am using UTF-8 encoding for eclipse.

I can't see anything wrong with that code.
According to this Wikipedia page, there are 3 common encodings for traditional Chinese characters: Guobiao, UTF-8 and Big5. I suggest you try the two alternatives that you haven't tried, and if that fails try some of the less common alternatives listed.
(It is also possible that the real problem is in the way you are displaying the String ... but the fact that you are displaying Simplified Chinese and Korean correctly suggests that this is not the problem.)
I am using UTF-8 encoding for eclipse.
I don't think that is relevant. The code you showed us doesn't depend on the default character encoding of either the execution platform or the IDE.

Related

Thai script seems to lose UTF-8 encoding in java for-each loop

I'm trying to develop an application within Android Studio on Windows 10.
PROBLEM: The following string array of Thai words:
String[] myTHarr = {"มาก","เชี่ยว","แน่","ม่อน","บ้าน","พูด","เลื่อย","เมื่อ","ช่ำ","แร่"};
...when processed by the following for-each loop:
for (String s:myTHarr){
//s = มา� before executing any of the below code:
byte[] utf8EncodedThaiArr = s.getBytes("UTF-8");
String utf8EncodedThai = new String(utf8EncodedThaiArr); //setting breakpoint here
// s is still มาà¸� (I want it to be มาก)
//do stuff
}
results in s = มา� when attempting to process the first word (none of the other words work either, but that's expected given the first fails).
The Thai script appears in the string array correctly (the declaration was copied straight from Android Studio), the file encoding is set to UTF-8 for the java file (per here), and the File Encoding Settings look like this (per here):
According to the documentation, String(byte[]) constructor "Constructs a new String by decoding the specified array of bytes using the platform's default charset."
I'm guessing that the default character set is not UTF-8. So the solution is to specify the encoding for the array of bytes.
String utf8EncodedThai = new String(utf8EncodedThaiArr, "UTF-8"); //setting breakpoint here
As several in the comments pointed out the problem had to be within my environment. After a bit more searching I found I should have rebuilt the project after changing the encodings (so merely switching to UTF8 and clicking 'Apply'/'OK' wasn't enough). I should note here that my File Encoding settings look like this, for reference:
Once I rebuilt, I started getting the compiler error "unmappable character for encoding cp1252" on the String array containing the Thai (side note: Some of the Thai characters were fine, others rendered as � and friends. I would have thought either all of the Thai would work or none of it, but was surprised to see even common Thai letters such as ก cause the compiler to choke).
That error led to this post in which I tried a few things to set the compiler options to UTF8. Since my application happens to be a sort of 'pre-process' for an android app, and is therefore separate from the app itself (if that makes any sense), I didn't have the luxury of using the compilerOptions attribute as the answers in the aforementioned SO post recommended (though I have since added it to the gradle on the android app side). This led me to setting the environment variable JAVA_TOOLS_OPTIONS via powershell:
setx JAVA_TOOLS_OPTIONS "-Dfile.encoding=UTF8"
Which fixed the issue!
I tried your code with the attached settings, and the code worked fine.

Encoding string doesn't work properly in java

I am developing a JavaFX application. I need to create a TreeView programmatically using Persian language as it's nodes' name.
The problem is I see strange characters when I run the application. I have searched through the web and SO same questions. I code a function to do the encoding based on the answers to same question:
public static String getUTF(String encodeString) {
return new String(encodeString.getBytes(StandardCharsets.ISO_8859_1),
StandardCharsets.UTF_8);
}
And I use it to convert my string to build the TreeView:
CheckBoxTreeItem<String> userManagement =
new CheckBoxTreeItem<>(GlobalItems.getUTF("کاربران"));
This answer dowsn't work properly for some characters:
I still get strange results. If I don't use encoding, I get:
For hard coded string literals you need to tell the javac compiler to use the same encoding as the java source, say UTF-8. Check the IDE / build settings. You can u-escape some Farsi symbols,
\u062f for Dal, د. If the escaped characters come thru correctly, the compiler uses the wrong encoding.
String will always contain Unicode, no new Strings with hacking reconversion needed.
Reading files with text, one needs to convert those bytes (byte/InputStream) to java text (String/Reader) specifying the encoding of those bytes.

Store Arabic in String and insert it into database using Java

I am trying to pass Arabic String into Function that store it into a database but the String's Chars is converted into '?'
as example
String str = new String();
str = "عشب";
System.out.print(str);
the output will be :
"???"
and it is stored like this in the database.
and if i insert into database directly it works well.
Make sure your character encoding is utf-8.
The snippet you showed works perfectly as expected.
For example if you are encoding your source files using windows-1252 it won't work.
The problem is that System.out.println is PrintWriter which converts the Arabic string into bytes using the default encoding; which presumably cannot handle the arabic characters. Try
System.out.write(str.getBytes("UTF-8"));
System.out.println();
Many modern operating systems use UTF-8 as default encoding which will support non-latin characters correctly. Windows is not one of those, with ANSI being the default in Western installations (I have not used Windows recently, so that may have changed). Either way, you should probably force the default character encoding for the Java process, irrespective of the platform.
As described in another Stackoverflow question (see Setting the default Java character encoding?), you'll need to changed the default as follows, for the Java process:
java -Dfile.encoding=UTF-8
Additionally, since you are running in IDE you may need to tell it to display the output in the indicated charset or risk corruption, though that is IDE specific and the exact instructions will depend on your IDE.
One other thing, is if you are reading or writing text files then you should always specify the expected character encoding, otherwise you will risk falling back to the platform default.
You need to set character set utf-8 for this.
at java level you can do:
Charset.forName("UTF-8").encode(myString);
If you want to do so at IDE level then you can do:
Window > Preferences > General > Content Types, set UTF-8 as the default encoding for all content types.

Java: Advise on Charset Conversion

I have been working on a scenario that does the following:
Get input data in Unicode format; [UTF-8]
Convert to ISO-8559;
Detect & replace unsupported characters for encoding; [Based on user-defined key-value pairs]
My question is, I have been trying to find information on ISO-8559 in depth with no luck yet. Has anybody happen to know more about this? How different is this one from ISO-8859? Any details will be much helpful.
Secondly, keeping the ISO-8559 requirement aside, I went ahead to write my program to convert the incoming data to ISO-8859 in Java. While I am able to achieve what is needed using character based replacement, it obviously seem to be time-consuming when data size is huge. [in MBs]
I am sure there must be a better way to do this. Can someone advise me, please?
I assume you want to convert UTF-8 to ISO-8859 -1, that is Western Latin-1. There are many char set tables in the net.
In general for web browsers and Windows, it would be better to convert to Windows-1252, which is an extension redefining the range 0x80 - 0xBF, undermore with special quotes as seen in MS Word. Browsers are defacto capable to interprete these codes in an ISO-559-1 even on a Mac.
Java standard conversion like new OutputStreamWriter(new FileOutputStream("..."), "Windows-1252") does already much. You can either write a kind of filter, or find introduced ? untranslated special characters. You could translate latin letters with accents not in Windows-1252 as ASCII letters:
String s = ...
s = Normalizer.normalize(s, Normalizer.Form.NFD);
return s = s.replaceAll("\\p{InCombiningDiacriticalMarks}", "");
For other scripts like Hindi or Cyrillic the keyword to search for is transliteration.

How can I enter multiple Unicode strings (including right-to-lesft reading order) in Java source file?

I am testing a piece of Java code and need to create an array of strings. These strings are words in different languages, including those like Arabic with the right-to-left reading order (don't know if that matters...)
So I need to do something like this:
ArrayList<String> words = ...
words.add(<word-in-english>);
words.add(<word-in-chinese>);
words.add(<word-in-russian>);
words.add(<word-in-arabic>);
What's the best way to put these into my Java code? Is there a way to do it other than using "\u" escape for every character in a string? Thanks
You can set the encoding of editor/IDE to UTF-8, and java compiler too. For international projects this begins to become more and more a convention.
Unfortunately you would need to set your IDE font to a full unicode font which might be 35 MB or such. Or use for a missing chinese "\uXXXX" escaping, using native2ascii.
Depending on your sources, you might use files per language.
In order for it to work you must do these 2 things:
Save the source file in Unicode format (UTF-8). How to do this is IDE/Text Editor dependent.
Compile the file by specifying the UTF-8 charset. Like this:
javac -encoding utf-8 MyFile.java
As far as I know there is no problem to put any Unicode characters into your java code including RTL languages. It a little bit depends on your IDE but I believe that all modern IDEs support RTL typing. At least Eclipse does.
You have to save your source code using UTF-8 charset. Again it depends on your IDE. I eclipse right-click on file, then choose resource and change its encoding to UTF-8.
Sometimes it is just not convenient to type RTL texts in IDE. In this case type text using other program (MS Word, Notepad etc) and then copy and paste it into java code.
BTW think about storing unicode strings in separate resouce file. It is usually more convenient.
shouldn't something like this work:
BufferedReader bufReader =
new BufferedReader(
new InputStreamReader(new FileInputStream(file_name), "UTF-16"));
Pay attention to UTF-16 .

Categories