Hibernate functions lower and upper not working on Polish special characters

Hibernate functions lower and upper not working on Polish special characters - java

I have a problem with a lower and upper function in JPA (Hibernate). In my application a user should add a new item to the database, but the name should be unique. In order to achieve that, I need to compare user entered string with the strings in the database and ignore case while checking that.
Unfortunately, as I am using the Hibernate function to make all data upper-cased (in order to compare that) everything works fine except for the Polish special characters that remain the same.
This is the code I've used for testing purposes in order to check if it works:
TypedQuery<String> query = em.createQuery("SELECT upper(i.name) FROM Item i", String.class);
for (String name: query.getResultList())
System.out.println(name);
And that's what I get:
CZYSTY BANDAż
MAłY CHEMIK
MAłY MECHANIK
SPRZęT
ŚPIWóR
ŚRODEK DEZYNFEKUJąCY
ŚRODEK CZYSZCZąCY
All letters should be upper-cased. In the database every first letter of a first word is always capitalized. The problem concerns such characters like: ą, ę, ż, ź, ó, ł - they should look like Ą, Ę, Ż, Ź, Ó, Ł, but Hibernate seems not to recognize them as a single character which differs only in regard to the case.
The same thing happens when I use a lower function. Polish characters are not affected at all and remain the same.
I do not know if it concerns only Polish characters or from any other languages too.
I would be very grateful for any hint in this matter.
EDIT: I'm using Hibernate 5.2.2 Final with SQLite database and driver Xerial 3.8.11.2.
EDIT2: The same happens if I try to achieve that using native SQL query with Hibernate.

I've already found the solution. It turned out, that SQLite doesn't support the Unicode collation. It can only support ASCII latin characters while using lower, upper function or sorting.
There is an extension (SQLite ICU Extension), that SQLite must be compiled with in order to use Unicode collation (or other collations), but as far as I'm concerned it is not as simple solution as I would like it to be. I've decided to change the database provider to H2, which support Unicode collation by default without performing any modifications and it works like a charm now :)
So it's not Hibernate's fault, but the SQLite's. Thank you very much for your help :)

Related

Storing Unicode and special characters in MySQL tables

My current requirement is to store Unicode and other special characters, such as double quotes in MySQL tables. For that purpose, as many have suggested, we should use Apache's StringEscapeUtils.escapeJava() method. The problem is, although this method does replace special characters with their respective unicodes (\uxxxx), the MySQL table stores them as uxxxx and not \uxxxx. Due to this, when I try to decode it while fetching from the database, StringEscapeUtils.unescapeJava() fails (since it cannot find the '\').
Here are my questions:
Why is it happening (that is, '\' are skipped by the table).
What is the solution for this?

Don't use Unicode "codepoints" (\uxxxx), use UTF8.
Dont' use any special functions. Instead announce that everything is UTF-8 (utf8mb4 in MySQL).
See Best Practice
(If you are being provided \uxxxx, then you are stuck with converting to utf8 first. If your real question is on how to convert, then ask it that way.)
`

how to match search normal letters with accent letters using JAVA and SQLITE3 [duplicate]

I am new in Android and I'm working on a query in SQLite.
My problem is that when I use accent in strings e.g.
ÁÁÁ
ááá
ÀÀÀ
ààà
aaa
AAA
If I do:
SELECT * FROM TB_MOVIE WHERE MOVIE_NAME LIKE '%a%' ORDER BY MOVIE_NAME;
It's return:
AAA
aaa (It's ignoring the others)
But if I do:
SELECT * FROM TB_MOVIE WHERE MOVIE_NAME LIKE '%à%' ORDER BY MOVIE_NAME;
It's return:
ààà (ignoring the title "ÀÀÀ")
I want to select strings in a SQLite DB without caring for the accents and the case. Please help.

Generally, string comparisons in SQL are controlled by column or expression COLLATE rules. In Android, only three collation sequences are pre-defined: BINARY (default), LOCALIZED and UNICODE. None of them is ideal for your use case, and the C API for installing new collation functions is unfortunately not exposed in the Java API.
To work around this:
Add another column to your table, for example MOVIE_NAME_ASCII
Store values into this column with the accent marks removed. You can remove accents by normalizing your strings to Unicode Normal Form D (NFD) and removing non-ASCII code points since NFD represents accented characters roughly as plain ASCII + combining accent markers:
String asciiName = Normalizer.normalize(unicodeName, Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "");
Do your text searches on this ASCII-normalized column but display data from the original unicode column.

In Android sqlite, LIKE and GLOB ignore both COLLATE LOCALIZED and COLLATE UNICODE (they only work for ORDER BY). However, there is a solution without having to add extra columns to your table. As #asat explains in this answer, you can use GLOB with a pattern that will replace each letter with all the available alternatives of that letter. In Java:
public static String addTildeOptions(String searchText) {
return searchText.toLowerCase()
.replaceAll("[aáàäâã]", "\\[aáàäâã\\]")
.replaceAll("[eéèëê]", "\\[eéèëê\\]")
.replaceAll("[iíìî]", "\\[iíìî\\]")
.replaceAll("[oóòöôõ]", "\\[oóòöôõ\\]")
.replaceAll("[uúùüû]", "\\[uúùüû\\]")
.replace("*", "[*]")
.replace("?", "[?]");
}
And then (not literally like this, of course):
SELECT * from table WHERE lower(column) GLOB "*addTildeOptions(searchText)*"
This way, for example in Spanish, a user searching for either mas or más will get the search converted into m[aáàäâã]s, returning both results.
It is important to notice that GLOB ignores COLLATE NOCASE, that's why I converted everything to lower case both in the function and in the query. Notice also that the lower() function in sqlite doesn't work on non-ASCII characters - but again those are probably the ones that you are already replacing!
The function also replaces both GLOB wildcards, * and ?, with "escaped" versions.

You can use Android NDK to recompile the SQLite source including the desired ICU (International Components for Unicode).
Explained in russian here:
http://habrahabr.ru/post/122408/
The process of compiling the SQLilte with source with ICU explained here:
How to compile sqlite with ICU?
Unfortunately you will end up with different APKs for different CPUs.

You need to look at these, not as accented characters, but as entirely different characters. You might as well be looking for a, b, or c. That being said, I would try using a regex for it. It would look something like:
SELECT * from TB_MOVIE WHERE MOVIE_NAME REGEXP '.*[aAàÀ].*' ORDER BY MOVIE_NAME;

Search database table with all special characters

I have a table of project in which i have a project name and that project name may contain any special character or any alpha numeric value or any combination of number word or special characters.
Now i need to apply keyword search in that and that may contain any special character in search.
So my question is: How we can search either single or multiple special characters in database?
I am using mysql 5.0 with java hibernate api.

This should be possible with some simple sanitization of you query.
e.g: a search for \#(%*#$\ becomes:
SELECT * FROM foo WHERE name LIKE "%\\#(\%*#$\\%";
when evaluated the back slashes escape so that the search ends up being anything that contains "\#(%*#$\"
In general anything that's a special character in a string can be escaped via a backslash. This only really becomes tricky if you have a name such as: "\\foo\\bar\\" which to escape properly would become "\\\\foo\\\\bar\\\\"
A side note, please proof read your posts prior to finalizing. Its really depressing and shows a lack of effort when your questions title has spelling errors in it.

elasticsearch wildcard query with ascii folding

i am searching names using the wildcard query it works fine however when we do search for ascii characters it is not working well
like when user search for "Hélè*", its not able to search.
note that i have already created analyzer that does ascii folding and lowercase on name field.
also its working fine when we do search in query_string. does that mean wildcard is not analyzing the ascii folding and query string does ?
if yes then is there any way to achieve wildcard with ascii ?
any help will be greatly appreciated.
Thanks,
Mohsin

Try using field query with analyze_wildcard set to true.
By default, elasticsearch doesn't try to analyze text in wildcard queries, it only lowercases it for some queries. Because of this your query is searching for all terms that start with hélè and there is no such terms in your index because of ascii folding filter.

In Solr there is a ReversedWildcardFilterFactory and it is used at index time. When it is used if query contains wildcard character then it is not converted to ascii otherwise it is converted and searched using ascii. You can define it after ASCIIFoldingFilterFactory.
I don't know something similar exists in Lucene but you can write your FilterFactory by looking its source code.
Also you can find this document useful.

How to sort strings correctly starting with special characters (such as İ,ı,Ç,ç,Ğ,ğ,ş,ö,ü...) in PrimeFaces dataTables?

We are using PrimeFaces 2.2 (w/JSF 2.x in a Java EE 5 project) and we are having trouble in correctly sorting strings starting with special characters (e.g. İstanbul, Çankaya, Ödemiş...) in PrimeFaces dataTables although we are using UTF-8.
The problem is that words starting with special characters are put at the end of the words starting with Z, whereas, for example, a city name starting with "İ" (i.e. İstanbul) should normally show up between Ibiza and Jacksonville, rather it ends up appearing after Zurich. This rule is based on the Turkish (tr_TR) locale.
In selectOneMenus however the sorting is performed correctly (and as desired above)
Any suggestions for a workaround would be greatly appreciated.
_ EDIT _
This issue relates to hibernate (hsql) based sort, not a sql based sort

This should help a bit, here everything get converted to basic uppercase/lowercase chars before the sort
select k from test order by convert(k using utf8_bin)

You cannot sort properly unless you know the language sorting order for the words. And if the words are mixed language, there is no correct sorting order however in that case people usually use the language sorting order for the majority of the users/audience.
The exact same characters with the exact same sounds and the exact same Unicode code points, will sort in different places depending on the language and even the country.
Here is the definition of the Unicode collation algorithm http://unicode.org/reports/tr10/

Locale-aware collation is a tricky business, and therefore probably best dealt with dedicated libraries. I'd suggest using ICU. I cannot provide any details on how to integrate that with the hsql workflow, so I'd probably try to sort things separately if that is an option.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Hibernate functions lower and upper not working on Polish special characters - java

Related

Storing Unicode and special characters in MySQL tables

how to match search normal letters with accent letters using JAVA and SQLITE3 [duplicate]

Search database table with all special characters

elasticsearch wildcard query with ascii folding

How to sort strings correctly starting with special characters (such as İ,ı,Ç,ç,Ğ,ğ,ş,ö,ü...) in PrimeFaces dataTables?

Categories

Resources