Unicodes are not getting translated in java - java

I wrote the below code snippets and i'm expecting the output to be Açılış Tarih/Saati: but instead i'm getting Açılış Tarih/Saati: The code is as follows.
public class ResourceBundleTest {
public static void main(String[] args) {
try{
String turkey = "A\u00e7\u0131l\u0131\u015f Tarih/Saati:";
System.out.println(new String(turkey.getBytes("UTF-8")));
}
catch(Exception e)
{
System.out.println("hello");
}
}
}
Please help me how can i get rid of this issue.

This code is basically broken:
new String(turkey.getBytes("UTF-8"))
That:
Starts with a string.
Encodes the text as UTF-8.
Decodes that binary data using the platform default encoding.
It's like saving an image as a PNG and then trying to load it as a JPEG.
You can try:
System.out.println(turkey);
... if that doesn't show up as you want it to, the problem is probably just that your console doesn't support those characters. Trying to change the encoding won't help at all - the best it could do (if you were really lucky) is to get back to the original string. More likely (and it seems this is the case with your output) is that the platform default encoding isn't UTF-8, so you're losing data.

In turkey.getBytes("UTF-8") you explicitly create a byte-array containing the String contents encoded in UTF-8. However the contructor String(byte[]) decodes, according to its documentation, this byte sequence using the system's default encoding. Which may not be UTF-8.
Why don't you just write System.out.println(turkey), though? What's the point of encoding the String in UTF-8 and then decoding it before printing it?

Related

Unicode not shown in my app when i run on jar file

I build a Java GUI application and i had some jLabel with Unicode text. When i run the app from NetBeans IDE the text displayed as a correct form, But when i run from .jar file the text displayed as a deformed form.
My code:
try {
jLabel1.setText(new String("ژمارا ناسنامی".getBytes(), "UTF-8"));
} catch (UnsupportedEncodingException ex) {
Logger.getLogger(dataEntry.class.getName()).log(Level.SEVERE, null, ex);
}
Output :
Try this only:
jLabel1.setText("ژمارا ناسنامی");
Indeed your error here is to first encode your String with getBytes() which uses the default encoding and then you decode it into UTF-8 which is incorrect as you rely on the default encoding that could not be UTF-8 and useless as a String is already UTF-16 encoded so it covers already arabic characters.
Here is the Javadoc of the method String#getBytes() as reminder:
Encodes this String into a sequence of bytes using the platform's
default charset, storing the result into a new byte array. The
behavior of this method when this string cannot be encoded in the
default charset is unspecified. The CharsetEncoder class should be
used when more control over the encoding process is required.
If you want to encode properly a String, you need to to use String#getBytes(Charset) or String#getBytes(String) instead. But once again, it is not even needed in this particular case.

Converting decode utf-8 string to file

I am trying to save image which I am receiving from android device. From Android getting utf-8 encode string and below is the code I am using to save.
String test = java.net.URLDecoder.decode(image_base64, "UTF-8");
byte[] data = Base64.decodeBase64(test.getBytes());
FileOutputStream stream = null;
try {
stream = new FileOutputStream("/var/lib/easy-tomcat7/webapps/test/test1.bmp");
stream.write(data);
stream.flush();
test1 += "success";
}
catch (IOException e)
{
test1 = "failuare";
e.getMessage();
}
finally
{
test1 += "finally";
stream.close();
}
File is creating but the it is corrupted. I did lot of research on this but not getting why it is happening. Please help me to solve this issue.
I assume you are using Base64 from Apache Commons Codec.
Note that you are dealing with multiple different kinds of encodings:
URL encoding
Base64 encoding
UTF-8 character encoding
Those are three totally different things, and you should understand all of them to understand what is happening exactly.
Check how exactly the image is encoded that you get from the Android device. Your code is assuming that you are getting it as URL-encoded Base64 data, using the UTF-8 character set. Is that indeed how the Android device is sending the data? You will have to check that with whoever wrote the Android application.
What does the string image_base64 contain? Is it valid, URL-encoded Base64 data?
You shouldn't call getBytes() on the string before you pass it to Base64.decodeBase64 - that will convert the string into a byte array using the default character encoding of the system you're running it on. Just do this instead:
byte[] data = Base64.decodeBase64(test);
To make matters worse, there are several variants of Base64 encoding (as you can see on the Wikipedia page about Base64). It may be the case that whatever variant the Android app used is different from what the Base64 class is using.
Use the encoding also for getBytes()
Base64.decodeBase64(test.getBytes("utf-8"));

How to get UTF-8 conversion for a string

Frédéric in java converted to Frédéric.
However i need to pass the proper string to my client.
How to achieve this in Java ?
Did tried
String a = "Frédéric";
String b = new String(a.getBytes(), "UTF-8");
However string b also contain same value as a.
I am expecting string should able to store value as : Frédéric
How to pass this value properly to client.
If I understand the question correctly, you're looking for a function that will repair strings that have been damaged by others' encoding mistakes?
Here's one that seems to work on the example you gave:
static String fix(String badInput) {
byte[] bytes = badInput.getBytes(Charset.forName("cp1252"));
return new String(bytes, Charset.forName("UTF-8"));
}
fix("Frédéric") == "Frédéric"
The answer is quite complicated. See http://www.joelonsoftware.com/articles/Unicode.html for basic understanding.
My first suggestion would be to save your Java file with utf-8. Default for Eclipse on Windows would be cp1252 which might be your problem. Hope I could help.
Find your language code here and use that.
String a = new String(yourString.getBytes(), YOUR_ENCODING);
You can also try:
String a = URLEncoder.encode(yourString, HTTP.YOUR_ENCODING);
If System.out.println("Frédéric") shows the garbled output on the console it is most likely that the encodings used in your sourcecode (seems to be UTF-8) is not the same as the one used by the compiler - which by default is the platform-encoding, so probably some flavor of ISO-8859. Try using javac -encoding UTF-8 to compile your source (or set the appropriate property of your build environment) and you should be OK.
If you are sending this to some other piece of client software it's most likely an encoding issue on the client-side.

Japanese Character Encoding in Base64

I have been asked to fix a bug in our email processing software.
When a message whose subject is encoded in RFC 2047 like this:
=?ISO-2022-JP?B?GyRCR1s/LiVGJTklSC1qRnxLXDhsGyhC?=
is received, it is incorrectly decoded - one of the Japanese characters is not rendered properly. It is rendered like this: 配信テスト?日本語 when it should be 配信テスト㈱日本語
(I do not understand Japanese) - clearly one of the characters, the one which looks its in brackets, has not been rendered.
The decoding is carried out by javax.mail.internet.MimeUtility.decodeText()
If I try it with an on-line decoder (the only one I've found is here) it seems to work OK, so I was suspecting a bug in MimeUtility.
So I tried some experiments, in the form of this little program:
public class Encoding {
private static final Charset CHARSET = Charset.forName("ISO-2022-JP");
public static void main(String[] args) throws UnsupportedEncodingException {
String control = "繋がって";
String subject= "配信テスト㈱日本語";
String controlBase64 = japaneseToBase64(control);
System.out.println(controlBase64);
System.out.println(base64ToJapanese(controlBase64));
String subjectBase64 = japaneseToBase64(subject);
System.out.println(subjectBase64);
System.out.println(base64ToJapanese(subjectBase64));
}
private static String japaneseToBase64(String in) {
return Base64.encodeBase64String(in.getBytes(CHARSET));
}
private static String base64ToJapanese(String in) {
return new String(Base64.decodeBase64(in), CHARSET);
}
}
(The Base64 and Hex classes are in org.apache.commons.codec)
When I run it, here's the output:
GyRCN1IkLCRDJEYbKEI=
繋がって
GyRCR1s/LiVGJTklSCEpRnxLXDhsGyhC
配信テスト?日本語
The first, shorter Japanese string is a control, and this returns the same as the input, having been converted into Base64 and back again, using Charset ISO-2022-JP. All OK there.
The second Japanese string is the one with the dodgy character. As you see, it returns with a ? instead of the character. The Base64 encoding output is also different from the original subject encoding.
Sorry if this is long, I wanted to be thorough. What's going on, and how can I decode this character correctly?
The bug is not in your software, but the subject string itself is incorrectly encoded. Other software may be able to decode the text by making further assumptions about the content, just as it is often assumed that characters in the range 0x80-0x9f are Cp1252-encoded, although ISO-8859-1 or ISO-8859-15 is specified.
ISO-2022-JP is a multi-charset encoding, using escape sequences to switch between the actually used character set. Your encoded string starts with ESC $ B, indicating that the character set JIS X 0208-1983 is used. The offending character is encoded as 0x2d6a. That code point is not defined in the referred character set, but later added to JIS X 0213:2000, a newer version of the JIS X character set specifications.
Try using "MS932" or "Shift-JIS" in your encoding. Means
private static final Charset CHARSET = Charset.forName("MS932");
There are different scripts in Japanese like kanji, katakana. Some of the encoding like Cp132 will not support some characters of Japanese. The problem you face is because of the encoding "ISO-2022-JP" you have used in your code.
ISO-2022-JP uses pairs of bytes, called ku and ten, that index into a 94×94 table of characters. The pair that fails has ku 12 and ten 73, which is not listed in table of valid characters I have (based on JIS X 0208). All of ku=12 seems to be unused.
Wikipedia doesn't list any updates to JIS X 0208, either. Perhaps the sender is using some sort of vendor-defined extension?
Despite the fact that ISO-2022-JP is a variable width encoding, it seems as though Java doesn't support the section of the character set that it lies in (possibly as a result of the missing escape sequences in ISO-2022-JP-2 that are present in ISO-2022-JP-3 and ISO-2022-JP-2004 which aren't supported). UTF-8, UTF-16 and UTF-32 do however support all of the characters.
UTF-32:
AAB+SwAAMEwAADBjAAAwZg==
繋がって
AACRTQAAT+EAADDGAAAwuQAAMMgAADIxAABl5QAAZywAAIqe
配信テスト㈱日本語
As an extra tidbit, regardless of whether UTF-32 was used, when the strings were printed as-is they retained their natural encoding and appeared normally.

UTF-8 character encoding in Java

I am having some problems getting some French text to convert to UTF8 so that it can be displayed properly, either in a console, text file or in a GUI element.
The original string is
HANDICAP╔ES
which is supposed to be
HANDICAPÉES
Here is a code snippet that shows how I am using the jackcess Database driver to read in the Acccess MDB file in an Eclipse/Linux environment.
Database database = Database.open(new File(filepath));
Table table = database.getTable(tableName, true);
Iterator rowIter = table.iterator();
while (rowIter.hasNext()) {
Map<String, Object> row = this.rowIter.next();
// convert fields to UTF
Map<String, Object> rowUTF = new HashMap<String, Object>();
try {
for (String key : row.keySet()) {
Object o = row.get(key);
if (o != null) {
String valueCP850 = o.toString();
// String nameUTF8 = new String(valueCP850.getBytes("CP850"), "UTF8"); // does not work!
String valueISO = new String(valueCP850.getBytes("CP850"), "ISO-8859-1");
String valueUTF8 = new String(valueISO.getBytes(), "UTF-8"); // works!
rowUTF.put(key, valueUTF8);
}
}
} catch (UnsupportedEncodingException e) {
System.err.println("Encoding exception: " + e);
}
}
In the code you'll see where I want to convert directly to UTF8, which doesn't seem to work, so I have to do a double conversion. Also note that there doesn't seem to be a way to specify the encoding type when using the jackcess driver.
Thanks,
Cam
New analysis, based on new information.
It looks like your problem is with the encoding of the text before it was stored in the Access DB. It seems it had been encoded as ISO-8859-1 or windows-1252, but decoded as cp850, resulting in the string HANDICAP╔ES being stored in the DB.
Having correctly retrieved that string from the DB, you're now trying to reverse the original encoding error and recover the string as it should have been stored: HANDICAPÉES. And you're accomplishing that with this line:
String valueISO = new String(valueCP850.getBytes("CP850"), "ISO-8859-1");
getBytes("CP850") converts the character ╔ to the byte value 0xC9, and the String constructor decodes that according to ISO-8859-1, resulting in the character É. The next line:
String valueUTF8 = new String(valueISO.getBytes(), "UTF-8");
...does nothing. getBytes() encodes the string in the platform default encoding, which is UTF-8 on your Linux system. Then the String constructor decodes it with the same encoding. Delete that line and you should still get the same result.
More to the point, your attempt to create a "UTF-8 string" was misguided. You don't need to concern yourself with the encoding of Java's strings--they're always UTF-16. When bringing text into a Java app, you just need to make sure you decode it with the correct encoding.
And if my analysis is correct, your Access driver is decoding it correctly; the problem is at the other end, possibly before the DB even comes into the picture. That's what you need to fix, because that new String(getBytes()) hack can't be counted on to work in all cases.
Original analysis, based on no information. :-/
If you're seeing HANDICAP╔ES on the console, there's probably no problem. Given this code:
System.out.println("HANDICAPÉES");
The JVM converts the (Unicode) string to the platform default encoding, windows-1252, before sending it to the console. Then the console decodes that using its own default encoding, which happens to be cp850. So the console displays it wrong, but that's normal. If you want it to display correctly, you can change the console's encoding with this command:
CHCP 1252
To display the string in a GUI element, such as a JLabel, you don't have to do anything special. Just make sure you use a font that can display all the characters, but that shouldn't be problem for French.
As for writing to a file, just specify the desired encoding when you create the Writer:
OutputStreamWriter osw = new OutputStreamWriter(
new FileOutputStream("myFile.txt"), "UTF-8");
String s = "HANDICAP╔ES";
System.out.println(new String(s.getBytes("CP850"), "ISO-8859-1")); // HANDICAPÉES
This shows the correct string value. This means that it was originally encoded/decoded with ISO-8859-1 and then incorrectly encoded with CP850 (originally CP1252 a.k.a. Windows ANSI as pointed in a comment is indeed also possible since the É has the same codepoint there as in ISO-8859-1).
Align your environment and binary pipelines to use all the one and same character encoding. You can't and shouldn't convert between them. You would risk losing information in the non-ASCII range that way.
Note: do NOT use the above code snippet to "fix" the problem! That would not be the right solution.
Update: you are apparently still struggling with the problem. I'll repeat the important parts of the answer:
Align your environment and binary pipelines to use all the one and same character encoding.
You can not and should not convert between them. You would risk losing information in the non-ASCII range that way.
Do NOT use the above code snippet to "fix" the problem! That would not be the right solution.
To fix the problem you need to choose character encoding X which you'd like to use throughout the entire application. I suggest UTF-8. Update MS Access to use encoding X. Update your development environment to use encoding X. Update the java.io readers and writers in your code to use encoding X. Update your editor to read/write files with encoding X. Update the application's user interface to use encoding X. Do not use Y or Z or whatever at some step. If the characters are already corrupted in some datastore (MS Access, files, etc), then you need to fix it by manually replacing the characters right there in the datastore. Do not use Java for this.
If you're actually using the "command prompt" as user interface, then you're actually lost. It doesn't support UTF-8. As suggested in the comments and in the article linked in the comments, you need to create a Swing application instead of relying on the restricted command prompt environment.
You can specify encoding when establishing connection. This way was perfect and solve my encoding problem:
DatabaseImpl open = DatabaseImpl.open(new File("main.mdb"), true, null, Database.DEFAULT_AUTO_SYNC, java.nio.charset.Charset.availableCharsets().get("windows-1251"), null, null);
Table table = open.getTable("FolderInfo");
Using "ISO-8859-1" helped me deal with the French charactes.

Categories