Store Arabic in String and insert it into database using Java - java

I am trying to pass Arabic String into Function that store it into a database but the String's Chars is converted into '?'
as example
String str = new String();
str = "عشب";
System.out.print(str);
the output will be :
"???"
and it is stored like this in the database.
and if i insert into database directly it works well.

Make sure your character encoding is utf-8.
The snippet you showed works perfectly as expected.
For example if you are encoding your source files using windows-1252 it won't work.

The problem is that System.out.println is PrintWriter which converts the Arabic string into bytes using the default encoding; which presumably cannot handle the arabic characters. Try
System.out.write(str.getBytes("UTF-8"));
System.out.println();

Many modern operating systems use UTF-8 as default encoding which will support non-latin characters correctly. Windows is not one of those, with ANSI being the default in Western installations (I have not used Windows recently, so that may have changed). Either way, you should probably force the default character encoding for the Java process, irrespective of the platform.
As described in another Stackoverflow question (see Setting the default Java character encoding?), you'll need to changed the default as follows, for the Java process:
java -Dfile.encoding=UTF-8
Additionally, since you are running in IDE you may need to tell it to display the output in the indicated charset or risk corruption, though that is IDE specific and the exact instructions will depend on your IDE.
One other thing, is if you are reading or writing text files then you should always specify the expected character encoding, otherwise you will risk falling back to the platform default.

You need to set character set utf-8 for this.
at java level you can do:
Charset.forName("UTF-8").encode(myString);
If you want to do so at IDE level then you can do:
Window > Preferences > General > Content Types, set UTF-8 as the default encoding for all content types.

Related

Java String some characters not showing [duplicate]

I have a problem with turkish special characters on different machines. The following code:
String turkish = "ğüşçĞÜŞÇı";
String test1 = new String(turkish.getBytes());
String test2 = new String(turkish.getBytes("UTF-8"));
String test3 = new String(turkish.getBytes("UTF-8"), "UTF-8");
System.out.println(test1);
System.out.println(test2);
System.out.println(test3);
On a Mac the three Strings are the same as the original string. On a Windows machine the three lines are (Printed with the Netbeans 6.7 console):
?ü?ç?Ü?Ç?
ğüşçĞÜŞÇı
?ü?ç?Ü?Ç?
I don't get the problem.
String test1 = new String(turkish.getBytes());
You're taking the Unicode String including the Turkish characters, and turning it into bytes using the default encoding (using the default encoding is usually a mistake). You're then taking those bytes and decoding them back into a String, again using the default encoding. The result is you've achieved nothing (except losing any characters that don't fit in the default encoding); whether you have put a String through an encode/decode cycle has no effect on what the following System.out.println(test1) does because that's still printing a String and not bytes.
String test2 = new String(turkish.getBytes("UTF-8"));
Encodes as UTF-8 and then decodes using the default encoding. On Mac the default encoding is UTF-8 so this does nothing. On Windows the default encoding is never UTF-8 so the result is the wrong characters.
String test3 = new String(turkish.getBytes("UTF-8"), "UTF-8");
Does precisely nothing.
To write Strings to stdout with a different encoding than the default encoding, you'd create a encoder something like new OutputStreamWriter(System.out, "cp1252") and send the string content to that.
However in this case, it looks like the console is using Windows code page 1252 Western European (+1 ATorres). There is no encoding mismatch issue here at all, so you won't be able to solve it by re-encoding strings!
The default encoding cp1252 matches the console's encoding, it's just that cp1252 doesn't contain the Turkish characters ğşĞŞı at all. You can see the other characters that are in cp1252, üçÜÇ, come through just fine. Unless you can reconfigure the console to use a different encoding that does include all the characters you want, there is no way you'll be able to output those characters.
Presumably on a Turkish Windows install, the default code page will be cp1254 instead and you will get the characters you expect (but other characters don't work). You can test this by changing the ‘Language to use for non-Unicode applications’ setting in the Regional and Language Options Control Panel app.
Unfortunately no Windows locale uses UTF-8 as the default code page. Putting non-ASCII output onto the console with the stdio stream functions is not something that's really reliable at all. There is a Win32 API to write Unicode directly to the console, but unfortunately nothing much uses it.
Don't rely on the console, or on the default platform encoding. Always specify the character encoding for calls like getBytes and the String constructor taking a byte array, and if you want to examine the contents of a string, print out the unicode value of each character.
I would also advise either restricting your source code to use ASCII (and \uxxxx to encode non-ASCII characters) or explicitly specifying the character encoding when you compile.
Now, what bigger problem are you trying to solve?
You may be dealing with different settings of the default encoding.
java -Dfile.encoding=utf-8
versus
java -Dfile.encoding=something else
Or, you may just be seeing the fact that the Mac terminal window works in UTF-8, and the Windows DOS box does not work in UTF-8.
As per Mr. Skeet, you have a third possible problem, which is that you are trying to embed UTF-8 chars in your source. Depending on the compiler options, you may or may not be getting what you intend there. Put this data in a properties file, or use \u escapes.
Finally, also per Mr. Skeet, never, ever call the zero-argument getBytes().
If you are using AspectJ compiler do not forget to set it's encoding to UTF-8 too. I have struggled to find this for hours.

Java Unicode Characters after u 00ff

I can not print the unicode values after 00ff Instead I'm getting '?' character after execution of this in Eclipse. Is that an expectable behaviour?
System.out.println("\u01ff");
Eclipse uses by default platform default encoding (which is cp1252 in Windows) to decode characters during saving textbased files and during writing to standard output stream (as used by System.out). You need to explicitly set it to UTF-8 in order to achieve world domination.
Note that this way you also don't need to use those \uXXXX Unicode escapes anymore to represent those characters in your textbased source files.
Those question marks are caused because the charset as used by the output stream does not support the character as specified in the input stream.
See also:
Unicode - How to get the characters right?
The problem is the encoding used with System.out; by default, it's your platform's native encoding (usually not UTF-8). You can explicitly change the encoding by replacing the stream:
try {
PrintStream ps = new PrintStream(System.out, true, "UTF-8");
System.setOut(ps);
} catch (UnsupportedEncodingException error) {
System.err.println(error);
}
After this (barring font or encoding issues with the underlying environment), all Unicode characters should print correctly on System.out.
EDIT Based on the back-and-forth between me and BalusC on his answer, this isn't enough (or even necessary) to get things working inside Eclipse. There you have two problems to solve: using the correct encoding when writing to System.out and then using the correct encoding when presenting the console output in the Eclipse console view. BalusC's answer addresses both of those issues. Unfortunately, it won't address running outside Eclipse. For that, you need to either use my approach above or set the default run-time encoding. This can be done by using the flag -Dfile.encoding=UTF-8 on the command line or setting the environment variable JAVA_TOOL_OPTIONS to include -Dfile.encoding=UTF-8. To run your code successfully outside Eclipse, then that's probably the best approach.
Java can handle those characters just fine. But the outputs team will have a specific encoding.
And unless that encoding is something like UTF-8 or UTF-16, it won't be able to encode every character in Unicode. And when it encounters a character it can't represent, it will be replaced with a question mark.
The JVM uses the default file encoding for System.out as well.
On Linux for example, if your $LANG variable is set to de_DE.UTF-8, the value for file.encoding will be derived accordingly, and set to utf-8.
If the JVM cannot derive the setting you want, you can change the file encoding by setting a system property:
java -Dfile.encoding=utf-8 ...
on the command line.
You can do this as well in Eclipse using a Run configuration (context menu - Run as - Run configurations ... - Arguments - VM arguments)
So this works both for the command line and Eclipse, and there is no need to define the encoding explicitly in the source.
If the value is set to
java -Dfile.encoding=iso-8859-1
for example, only a subset of the Unicode characters can be represented, because that character set only supports a limited number of characters. The other ones will turn up as ?.
There are two other things to bear in mind:
The device which receives the stream (a terminal, shell etc.) must decode it correctly. As for GNOME's terminal for example, you can set the character encoding in a menu.
The font being used by that terminal etc. must contain the graphical representation for this character

Why is my Java Charset.defaultCharset() GBK and not Unicode?

Config: Windows 8 English operating system; JDK1.7; Eclipse.
I installed a software written by a Chinese, and the GUI is Chinese characters. But the software displays ugly with square boxes. I searched the internet and found a method to fix it. In the control panel of Win8, set "language for non-Unicode Programs" to be "Chinese".
But problem arises when writing code in Eclipse. We know Java itself uses two byte Unicode to store char and String. But when I execute the following code:
import java.util.Arrays;
import java.nio.charset.Charset;
public class CharSetTest {
public static void main(String[] args) throws Exception {
System.out.println(Charset.defaultCharset());
String s = "哈哈";
byte[] b3 = s.getBytes("UTF-8");
System.out.println(b3.length);
System.out.format("%X %X %X\n", b3[0],b3[1],b3[2]);
System.out.println(new String(b3));
byte[] b4 = s.getBytes();
System.out.format("%X %X %X\n", b4[0],b4[1]);
}
}
The output is weird:
GBK //default charset is GBK, not Unicode or UTF-8
3 //this is obvious since a Chinese character is encoded into 3 bytes
E5 93 88 //this is corresponding UTF-8 code number
鍝? //something wrong here
B9 FE //I think s.getBytes() should use JAVA's default encode "Unicode", but NOT is this case
Several questions:
What is Java default charset? Is it Unicode? How Java default
charset interact with programmers? For example, if Java use Unicode,
then a string "abc" cannot be encoded into other charset since they
are different from Unicode like charset for Russia, Frence etc,
since they are totally different encoding method.
What does Charset.defaultCharset() return? Does it return my Windows 8's
default charset?
How does Charset.defaultCharset() return GBK? I didn't set anything in my
Windows 8 related default charset except the one for "language for
non-Unicode Programs" in control panel.
If I declare a String in Java like this: String str = "abc";, I don't
know the process of charset/encoding. I firstly need to input the
Java statement by keyboard. How the keyboard translates my key
button into Java Unicode charset? The String str is stored in my
.java source code file. What is the charset to store Java source
code?
EDIT:
Why does we say "Java use Unicode to represent char and String"? In my Java program, when should I care about the Unicode thing?
Usually, I only need to care about encoding/decoding with UTF-8 ISO-8859-1 GBK etc. But I never care about Unicode representation of char and String. So how and when should I use the Unicode?
Check the doc: "The default charset is determined during virtual-machine startup and typically depends upon the locale and charset of the underlying operating system." So no, the default character set is not necessarily Unicode.
In OpenJDK it is determined from the file.encoding property. See also Setting the default Java character encoding?.
The default file.encoding value is fetched (on Windows) using* the GetUserDefaultLCID() function, which corresponds to the setting in the "Regional and language options". That's why Charset.defaultCharset() is returning GBK, because you set the locale to Chinese.
Although the default character set is OS-dependent, the strings in a compiled Java class are always stored as UTF-16.
The encoding of a *.java source code is whatever you specify to the Java compiler, or the OS's default one if not provided. See Java compiler platform file encoding problem.
*: See http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/tip/src/windows/native/java/lang/java_props_md.c, line 577.
the default character set is the character set that Java will use to convert bytes to chars or Strings (and vice versa) if you don't specify anything else (for example if you create a InputStreamReader and don't pass an explicit charset).
Charset.defaultCharset() returns ... the default char set. What exactly that is is implementation dependent, but usually is just what the OS would use in the same case.
That setting is exactly what your Java installation is using: "Chinese" means that some encoding that handles chinese characters has to be provided and GBK matches that just fine.
The encoding of Java source files can be specified when you compile it (using the -encoding parameter). If you don't specify it explictly, then Java will use the platform default encoding (see #1).
What is JAVA default charset?
It's picked up from the default set in your OS. This could be Windows-1252-???
Is it Unicode?
This is not a charset. A charset defines how to encode characters as bytes.
How JAVA default charset interact with programmers?
It's the default used when you don't specify a charset.
For example, if JAVA use Unicode, then a string "abc" cannot be encoded into other charset since they are different from Unicode like charset for Russia, Frence etc, since they are totally different encoding method.
Internally Java uses UTF-16 but you don't need to know that. This has no issues with most languages except some Chinese dialects require the use of code points.
What does Charset.defaultCharset() returns?
It does what it appears to do. You can confirm this by reading the javadoc for this method.
Does it return my WIN8's default charset?
Because that is what it is supposed to do. You only have a problem if your OS's character set cannot be mapped into Java or is not correctly mapped into Java. If it is the same, everything is fine.
How Charset.defaultCharset() return GBK. I didn't set anything in my WIN8 related default charset except the one for "language for non-Unicode Programs" in control panel.
It is this because Java thinks you set this for Windows. To correct this, you must have the correct character set in Windows.
If I declare a String in java like: String str = "abc";, I don't know the process of charset/encoding.
For the purposes of this question, there isn't any encoding involved. There is only characters they don't need to be encoded to make characters because they are already characters.
How the keyboard translates my key button into Java Unicode charset?
The keyboard doesn't. It only knows which keys you pressed. The OS turns these keys into characters.
The String str is stored in my .java source code file. What is the charset to store java source code?
That is determined by the editor which does the storing. Most likely it will be the OS default again, or if you change it you might make it UTF-8.
I am not sure if this could help. To Change encoding in Eclipse:
--- Project Explorer
--- Right click on Java file
--- Run As
--- Run Configurations
--- Common (tab)
--- Encoding (In Linux it is set on UTF-8 by default

Turkish character while writing to database (postgresql)

I am working with Java and PostgreSQL on Windows . I have some words which include turkish characters like İ,ş,ö,ç etc.
In Java I assign words to a string and try to write it to the database. When I print it on java its encoding appears correct and all characters display correctly. However, while writing it to database the text appears to get mangled/scrambled.
I created my database with this command:
CREATE DATABASE dbname ENCODING "UTF-8"
I tried to fix it by converting Turkish characters into the ISO-8859-1 encoding like (İ -> \u0130 , ş -> \u015F)
//\u0130leti\u015Fim = İletişim
title = \u0130leti\u015Fim
String mytitle = new String(title.getBytes("ISO-8859-1"), "UTF-8");
And then I tried to write mytitle to database but it did not work.
Thanks for your advice.
SOLVED : I realized that it could write turkish characters to database, but the problem was on the response. I added these lines before write to response.
String contentType= "text/html;charset=UTF-8";
response.setContentType(contentType);
response.setCharacterEncoding("utf-8");
After adding this, it works now. I hope, i could explain cleanly.
When you call title.getBytes("ISO-8859-1"), you're promising the Java runtime that the characters in the string can be represented as ISO-8859-1 bytes, which is not actually true for either \u0130 or \u015f.
Therefore already the conversion to bytes will do something unspecified with your Turkish characters -- probably they will just be dropped.
Next, attempting to interpret whichever bytes you get out of it as UTF-8 even though they're really ISO-8859-1 is then guaranteed to make a complete mess of everything that wasn't ASCII to begin with.
(The repretoire of ISO-8859-1 happens to coincide exactly with the Unicode characters that can be written as \u00XX for some XX).
With encoding issues you have several things to check:
Whether your source file is in the encoding you expect it to be.
How client_encoding is set
What the database encoding is
In the case of Java, PgJDBC requires client_encoding to always be UTF-8 and will choke if you set it to something else, so that's not going to be the issue. You've shown that your database is UTF-8 too. So it seems likely that your Java sources aren't in the same encoding the Java compiler and runtime expect them to be in.
By default javac will interpret your source code in the platform default encoding. If you've saved your sources in a different encoding, weird things will happen. Save your sources either:
in the default encoding for your Windows platform;
as Unicode ("UTF-16" or "UCS-2"); or
As UTF-8 with a Byte Order Mark (BOM). Many programs don't add a BOM for UTF-8.
Then recompile your program. If that doesn't help, you'll need to follow up with more detail, starting with what exactly "it did not work" means, output of SELECTing the data you inserted with Java using psql, etc.
You should create the database like this:
CREATE DATABASE <db name>
WITH OWNER <owner user name>
TEMPLATE template0
ENCODING 'UTF-8'
LC_COLLATE 'tr_TR.UTF-8'
LC_CTYPE = 'tr_TR.UTF-8';

Java: Turkish Encoding Mac/Windows

I have a problem with turkish special characters on different machines. The following code:
String turkish = "ğüşçĞÜŞÇı";
String test1 = new String(turkish.getBytes());
String test2 = new String(turkish.getBytes("UTF-8"));
String test3 = new String(turkish.getBytes("UTF-8"), "UTF-8");
System.out.println(test1);
System.out.println(test2);
System.out.println(test3);
On a Mac the three Strings are the same as the original string. On a Windows machine the three lines are (Printed with the Netbeans 6.7 console):
?ü?ç?Ü?Ç?
ğüşçĞÜŞÇı
?ü?ç?Ü?Ç?
I don't get the problem.
String test1 = new String(turkish.getBytes());
You're taking the Unicode String including the Turkish characters, and turning it into bytes using the default encoding (using the default encoding is usually a mistake). You're then taking those bytes and decoding them back into a String, again using the default encoding. The result is you've achieved nothing (except losing any characters that don't fit in the default encoding); whether you have put a String through an encode/decode cycle has no effect on what the following System.out.println(test1) does because that's still printing a String and not bytes.
String test2 = new String(turkish.getBytes("UTF-8"));
Encodes as UTF-8 and then decodes using the default encoding. On Mac the default encoding is UTF-8 so this does nothing. On Windows the default encoding is never UTF-8 so the result is the wrong characters.
String test3 = new String(turkish.getBytes("UTF-8"), "UTF-8");
Does precisely nothing.
To write Strings to stdout with a different encoding than the default encoding, you'd create a encoder something like new OutputStreamWriter(System.out, "cp1252") and send the string content to that.
However in this case, it looks like the console is using Windows code page 1252 Western European (+1 ATorres). There is no encoding mismatch issue here at all, so you won't be able to solve it by re-encoding strings!
The default encoding cp1252 matches the console's encoding, it's just that cp1252 doesn't contain the Turkish characters ğşĞŞı at all. You can see the other characters that are in cp1252, üçÜÇ, come through just fine. Unless you can reconfigure the console to use a different encoding that does include all the characters you want, there is no way you'll be able to output those characters.
Presumably on a Turkish Windows install, the default code page will be cp1254 instead and you will get the characters you expect (but other characters don't work). You can test this by changing the ‘Language to use for non-Unicode applications’ setting in the Regional and Language Options Control Panel app.
Unfortunately no Windows locale uses UTF-8 as the default code page. Putting non-ASCII output onto the console with the stdio stream functions is not something that's really reliable at all. There is a Win32 API to write Unicode directly to the console, but unfortunately nothing much uses it.
Don't rely on the console, or on the default platform encoding. Always specify the character encoding for calls like getBytes and the String constructor taking a byte array, and if you want to examine the contents of a string, print out the unicode value of each character.
I would also advise either restricting your source code to use ASCII (and \uxxxx to encode non-ASCII characters) or explicitly specifying the character encoding when you compile.
Now, what bigger problem are you trying to solve?
You may be dealing with different settings of the default encoding.
java -Dfile.encoding=utf-8
versus
java -Dfile.encoding=something else
Or, you may just be seeing the fact that the Mac terminal window works in UTF-8, and the Windows DOS box does not work in UTF-8.
As per Mr. Skeet, you have a third possible problem, which is that you are trying to embed UTF-8 chars in your source. Depending on the compiler options, you may or may not be getting what you intend there. Put this data in a properties file, or use \u escapes.
Finally, also per Mr. Skeet, never, ever call the zero-argument getBytes().
If you are using AspectJ compiler do not forget to set it's encoding to UTF-8 too. I have struggled to find this for hours.

Categories