Storing Unicode and special characters in MySQL tables - java

My current requirement is to store Unicode and other special characters, such as double quotes in MySQL tables. For that purpose, as many have suggested, we should use Apache's StringEscapeUtils.escapeJava() method. The problem is, although this method does replace special characters with their respective unicodes (\uxxxx), the MySQL table stores them as uxxxx and not \uxxxx. Due to this, when I try to decode it while fetching from the database, StringEscapeUtils.unescapeJava() fails (since it cannot find the '\').
Here are my questions:
Why is it happening (that is, '\' are skipped by the table).
What is the solution for this?

Don't use Unicode "codepoints" (\uxxxx), use UTF8.
Dont' use any special functions. Instead announce that everything is UTF-8 (utf8mb4 in MySQL).
See Best Practice
(If you are being provided \uxxxx, then you are stuck with converting to utf8 first. If your real question is on how to convert, then ask it that way.)
`

Related

Hibernate functions lower and upper not working on Polish special characters

I have a problem with a lower and upper function in JPA (Hibernate). In my application a user should add a new item to the database, but the name should be unique. In order to achieve that, I need to compare user entered string with the strings in the database and ignore case while checking that.
Unfortunately, as I am using the Hibernate function to make all data upper-cased (in order to compare that) everything works fine except for the Polish special characters that remain the same.
This is the code I've used for testing purposes in order to check if it works:
TypedQuery<String> query = em.createQuery("SELECT upper(i.name) FROM Item i", String.class);
for (String name: query.getResultList())
System.out.println(name);
And that's what I get:
CZYSTY BANDAż
MAłY CHEMIK
MAłY MECHANIK
SPRZęT
ŚPIWóR
ŚRODEK DEZYNFEKUJąCY
ŚRODEK CZYSZCZąCY
All letters should be upper-cased. In the database every first letter of a first word is always capitalized. The problem concerns such characters like: ą, ę, ż, ź, ó, ł - they should look like Ą, Ę, Ż, Ź, Ó, Ł, but Hibernate seems not to recognize them as a single character which differs only in regard to the case.
The same thing happens when I use a lower function. Polish characters are not affected at all and remain the same.
I do not know if it concerns only Polish characters or from any other languages too.
I would be very grateful for any hint in this matter.
EDIT: I'm using Hibernate 5.2.2 Final with SQLite database and driver Xerial 3.8.11.2.
EDIT2: The same happens if I try to achieve that using native SQL query with Hibernate.
I've already found the solution. It turned out, that SQLite doesn't support the Unicode collation. It can only support ASCII latin characters while using lower, upper function or sorting.
There is an extension (SQLite ICU Extension), that SQLite must be compiled with in order to use Unicode collation (or other collations), but as far as I'm concerned it is not as simple solution as I would like it to be. I've decided to change the database provider to H2, which support Unicode collation by default without performing any modifications and it works like a charm now :)
So it's not Hibernate's fault, but the SQLite's. Thank you very much for your help :)

Getting UTF-8 character issue in Oracle and Java

Have one table in Oracle which contains column type is NVARCHAR2. In this column I am trying to save Russian characters. But it showing ¿¿¿¿¿¿¿¿¿¿¿ like this. When I try to fetch same characters from Java, I am getting same string.
I have try with NCHAR VARCHAR2. But in all cases same issue.
Is this problem from Java or Oracle? I have try same Java code, same characters using PostgreSQL which is working fine. I am not getting whether this problem is from Oracle or Java?
In my oracle database NLS_NCHAR_CHARACTERSET property value is AL16UTF16.
Any Idea how can I show UTF-8 characters as it is in Java which saved in Oracle.
Problem with characters is that you cannot trust your eyes. Maybe the database stores the correct character values but your viewing tool does not understand them. Or maybe the characters get converted somewhere along the way due to language settings.
To find out what character values are stored in your table use the dump function:
Example:
select dump(mycolumn),mycolumn from mytable;
This will give you the byte values of your data and you can check whether or not the values in your table are as they should be.
After doing some google I have resolved my issue. Here is ans: AL32UTF8 is the Oracle Database character set that is appropriate for XMLType data. It is equivalent to the IANA registered standard UTF-8 encoding, which supports all valid XML characters.
It means while creating database in Oracle, set AL32UTF8 character set.
Here is link for this
http://docs.oracle.com/cd/B19306_01/server.102/b14231/create.htm#i1008322
You need to specify useUnicode=true&characterEncoding=UTF-8 while getting the connection from the database. Also make sure the column supports UTF-8 encoding. Go through Oracle documentation.

Replacing Java unicode encodings with actual characters

When I make web queries, for accented characters, I get special character encodings back as strings such as "\u00f3" , but I need to replace it with the actual character, like "ó" before making another query.
How would I find these cases without actually looking for each one, one by one?
It seems you're handling JSON formatted data.
Use any of the many freely available JSON libraries to handle this (and other parsing issues) for you instead of trying to do it manually.
The one from JSON.org is pretty widely used, but there are surely others that work just as well.

Are escape sequences preserved in CLOB?

We are using Java and Oracle for development.
I have table in a oracle database which has a CLOB column in it. Some XYZ application dumps a text file in this column. The text file has multiple rows.
Is it possible that while reading the same CLOB file thru Java application, the escape sequences (new line chars, etc) may get lost??
Reason I asked this is, we gona parse this file line by line and if the escape sequences are lost, then we would be trouble. I would have done this analysis myself, but I am on vacation and my team needs urgent help.
Would really appreciate if you could provide any thoughts/inputs.
You need to ensure that you use the one correct and same character encoding throughout the whole process. I strongly recommend you to pickup UTF-8 for that. It covers every human character known at the world. Every step which involves handling of character data should be instructed to use the very same encoding.
In SQL context, ensure that the DB and table is created with UTF-8 charset. In JDBC context, ensure that JDBC driver is using UTF-8; this is often configureable by JDBC connection string. In Java code context, ensure that you're using UTF-8 when reading/writing character data from/to streams; you can specify it as 2nd constructor argument in InputStreamReader and OutputStreamWriter.
A CLOB stores character data. Carriage returns and line feeds are valid characters, though unprintable ones. As long as your XYZ app is correctly filling your CLOBs, the contents should be just as manageable to you as if they had come from the file.
Depending on the platform and the nature of said "XYZ app," lines could be separated by either \r(Mac), \r\n (DOS/Windows) or \n (Unix/Linux), and you should make allowance for this fact if necessary. This is one aspect where BufferedReader.readLine() is more convenient, as it transparently gets rid of this difference for you.
I'm not 100% sure what you mean by escape sequences in this context. Within a (for example) Java literal string, "\n" is an escape sequence representing a newline, but once that string is outputted into something (say, a database), it's not an escape sequence any more, it's an actual newline character.
Anyhow, to your direct question, Java through can read text from Oracle CLOBs perfectly fine. Newlines are not lost.

DB2 database using unicode

I have a problem with DB2 databases that should store unicode characters. The connection is established using JDBC.
What do I have to do if I would like to insert a unicode string into the database?
INSERT INTO my_table(id, string_field) VALUES(1, N'my unicode string');
or
INSERT INTO my_table(id, string_field) VALUES(1, 'my unicode string');
I don't know if I have to use the N-prefix or not. For most of the databases out there it works pretty well when using it but I am not quite sure about DB2. I also have the problem that I do not have a DB2 database at hand where I could test these statements. :-(
Thanks a lot!
The documentation on constants (as of DB2 9.7) says this about graphic strings:
A graphic string constant specifies a varying-length graphic string consisting of a sequence of double-byte characters that starts and ends with a single-byte apostrophe ('), and that is preceded by a single-byte G or N. The characters between the apostrophes must represent an even number of bytes, and the length of the graphic string must not exceed 16 336 bytes.
I have never heard of this in context of DB2. Google learns me that this is more MS SQL Server specific. In DB2 and every other decent RDBMS you only need to ensure that the database is using the UTF-8 charset. You normally specify that in the CREATE statement. Here's the DB2 variant:
CREATE DATABASE my_db USING CODESET UTF-8;
That should be it in the DB2 side. You don't need to change the standard SQL statements for that. You also don't need to worry about Java as it internally already uses Unicode.
Enclosing the unicode string constant within N'' worked through JDBC application for DB2 DB.

Categories