Whacky latin1 to UTF8 conversion in JDBC

Whacky latin1 to UTF8 conversion in JDBC - java

JDBC seems to insert a utf8 replacement character when asked to read from a latin1 column containing undefined latin1 codepage characters. This behaviour is different from what MySQL's internal functions do.
Character encoding is a rabbit hole that i've been stuck in for the last week, and in interest of not generating 100 obvious answers i'll demonstrate whats happening with a couple of code examples.
Mysql:
[admin#yarekt ~]$ echo 'SELECT CONVERT(UNHEX("81") using latin1);' | mysql --init-command='set names latin1' | tail -1| hexdump -C
00000000 81 0a |..|
00000002
[admin#yarekt ~]$ echo 'SELECT CONVERT(UNHEX("81") using latin1);' | mysql --init-command='set names utf8' | tail -1| hexdump -C
00000000 c2 81 0a |...|
00000003
This is pretty obvious and works exactly as expected. 0x81 is an undefined latin1 codepoint. It is represented as \u0081 in UTF8 or c2 81 in hex "on disk".
Now the weirdness comes from JDBC, Take this groovy example:
#GrabConfig(systemClassLoader=true)
#Grab(group='mysql', module='mysql-connector-java', version='5.1.6')
import groovy.sql.Sql
sql = Sql.newInstance( 'jdbc:mysql://localhost/test', 'root', '', 'com.mysql.jdbc.Driver' )
sql.eachRow( 'SELECT CONVERT(UNHEX("C281") using utf8) as a;' ) { println "$it.a --" }
The output of this query is two bytes, c2 81 as expected. Its pretty easy to understand whats happening here. The Mysql Connection is defaulting to UTF8. The unhexxed column is also cast to UTF8 (without encoding, as the source is binary, the data after CONVERT() is still c2 81).
Now consider this case. The connection is still in UTF8, as is default with JDBC. we cast our 0x81 byte as latin1, so hopefully mysql will convert it to c2 81 like it did in the bash example above.
#GrabConfig(systemClassLoader=true)
#Grab(group='mysql', module='mysql-connector-java', version='5.1.6')
import groovy.sql.Sql
sql = Sql.newInstance( 'jdbc:mysql://localhost/test', 'root', '', 'com.mysql.jdbc.Driver' )
sql.eachRow( 'SELECT CONVERT(UNHEX("81") using latin1) as a;' ) { println "$it.a --" }
Running this with groovy latin1_test.groovy | hexdump -C yields this:
00000000 ef bf bd 0a |....|
00000004
ef bf bd is a utf8 replacement char. A char used when a utf8 conversion has failed.

JDBC seems to insert a utf8 replacement character when asked to read from a latin1 column containing undefined latin1 codepage characters
Yes, this is the default behavior of CharsetDecoder instances which by default, when the (byte) input is malformed, will perform a substitution of this unmappable byte sequence with Unicode's replacement character, U+FFFD.
Examples of methods which use this behavior are all Readers but also String constructors which take a byte array as an argument. And this is the reason why you should never use String to store binary data!
The only solution to make that an error is to grab the raw byte input, create your own decoder and tell it to fail in that situation...

Related

How to convert character encoding from Shift-JIS to UTF-8 [duplicate]

I have a String created from a byte[] array, using UTF-8 encoding.
However, it should have been created using another encoding (Windows-1252).
Is there a way to convert this String back to the right encoding?
I know it's easy to do if you have access to the original byte array, but it my case it's too late because it's given by a closed source library.

As there seems to be some confusion on whether this is possible or not I think I'll need to provide an extensive example.
The question claims that the (initial) input is a byte[] that contains Windows-1252 encoded data. I'll call that byte[] ib (for "initial bytes").
For this example I'll choose the German word "Bär" (meaning bear) as the input:
byte[] ib = new byte[] { (byte) 0x42, (byte) 0xE4, (byte) 0x72 };
String correctString = new String(ib, "Windows-1252");
assert correctString.charAt(1) == '\u00E4'; //verify that the character was correctly decoded.
(If your JVM doesn't support that encoding, then you can use ISO-8859-1 instead, because those three letters (and most others) are at the same position in those two encodings).
The question goes on to state that some other code (that is outside of our influence) already converted that byte[] to a String using the UTF-8 encoding (I'll call that String is for "input String"). That String is the only input that is available to achieve our goal (if ib were available, it would be trivial):
String is = new String(ib, "UTF-8");
System.out.println(is);
This obviously produces the incorrect output "B�".
The goal would be to produce ib (or the correct decoding of that byte[]) with only is available.
Now some people claim that getting the UTF-8 encoded bytes from that is will return an array with the same values as the initial array:
byte[] utf8Again = is.getBytes("UTF-8");
But that returns the UTF-8 encoding of the two characters B and � and definitely returns the wrong result when re-interpreted as Windows-1252:
System.out.println(new String(utf8Again, "Windows-1252");
This line produces the output "Bï¿½", which is totally wrong (it is also the same output that would be the result if the initial array contained the non-word "Bür" instead).
So in this case you can't undo the operation, because some information was lost.
There are in fact cases where such mis-encodings can be undone. It's more likely to work, when all possible (or at least occuring) byte sequences are valid in that encoding. Since UTF-8 has several byte sequences that are simply not valid values, you will have problems.

I tried this and it worked for some reason
Code to repair encoding problem (it doesn't work perfectly, which we will see shortly):
final Charset fromCharset = Charset.forName("windows-1252");
final Charset toCharset = Charset.forName("UTF-8");
String fixed = new String(input.getBytes(fromCharset), toCharset);
System.out.println(input);
System.out.println(fixed);
The results are:
input: â€¦Und ich beweg mich (aber heut nur langsam)
fixed: …Und ich beweg mich (aber heut nur langsam)
Here's another example:
input: Waun da wuan ned wa (feat. Wolfgang KÃ¼hn)
fixed: Waun da wuan ned wa (feat. Wolfgang Kühn)
Here's what is happening and why the trick above seems to work:
The original file was a UTF-8 encoded text file (comma delimited)
That file was imported with Excel BUT the user mistakenly entered Windows 1252 for the encoding (which was probably the default encoding on his or her computer)
The user thought the import was successful because all of the characters in the ASCII range looked okay.
Now, when we try to "reverse" the process, here is what happens:
// we start with this garbage, two characters we don't want!
String input = "Ã¼";
final Charset cp1252 = Charset.forName("windows-1252");
final Charset utf8 = Charset.forName("UTF-8");
// lets convert it to bytes in windows-1252:
// this gives you 2 bytes: c3 bc
// "Ã" ==> c3
// "¼" ==> bc
bytes[] windows1252Bytes = input.getBytes(cp1252);
// but in utf-8, c3 bc is "ü"
String fixed = new String(windows1252Bytes, utf8);
System.out.println(input);
System.out.println(fixed);
The encoding fixing code above kind of works but fails for the following characters:
(Assuming the only characters used 1 byte characters from Windows 1252):
char utf-8 bytes | string decoded as cp1252 --> as cp1252 bytes
” e2 80 9d | â€� e2 80 3f
Á c3 81 | Ã� c3 3f
Í c3 8d | Ã� c3 3f
Ï c3 8f | Ã� c3 3f
Ð c3 90 | Ã� c3 3f
Ý c3 9d | Ã� c3 3f
It does work for some of the characters, e.g. these:
Þ c3 9e | Ãž c3 9e Þ
ß c3 9f | ÃŸ c3 9f ß
à c3 a0 | Ã  c3 a0 à
á c3 a1 | Ã¡ c3 a1 á
â c3 a2 | Ã¢ c3 a2 â
ã c3 a3 | Ã£ c3 a3 ã
ä c3 a4 | Ã¤ c3 a4 ä
å c3 a5 | Ã¥ c3 a5 å
æ c3 a6 | Ã¦ c3 a6 æ
ç c3 a7 | Ã§ c3 a7 ç
NOTE - I originally thought this was relevant to your question (and as I was working on the same thing myself I figured I'd share what I've learned), but it seems my problem was slightly different. Maybe this will help someone else.

What you want to do is impossible. Once you have a Java String, the information about the byte array is lost. You may have luck doing a "manual conversion". Create a list of all windows-1252 characters and their mapping to UTF-8. Then iterate over all characters in the string to convert them to the right encoding.
Edit:
As a commenter said this won't work. When you convert a Windows-1252 byte array as it if was UTF-8 you are bound to get encoding exceptions. (See here and here).

You can use this tutorial
The charset you need should be defined in rt.jar (according to this)

delete unwanted characters from URL

I have this variable String var = class.getSomething that contains this url http://www.google.com§°§#[]|£%/^<> .The output that comes out is this: http://www.google.comÃ§Â°Â§#[]|Â£%/^<>. How can i delete that Ã? Thanks!

You could do this, it replaces any character for empty getting your purpouse.
str = str.replace("Â", "");
With that you will replace Â for nothing, getting the result you want.

Use String.replace
var = var.replace("Ã", "");

specify the charset as UTF-8 to get rid of unwanted extra chars :
String var = class.getSomething;
var = new String(var.getBytes(),"UTF-8");

Do you really want to delete only that one character or all invalid characters? Otherwise you can check each character with CharacterUtils.isAsciiPrintable(char ch). However, according to RFC 3986 even fewer character are allowed in URLs (alphanumerics and "-_.+=!*'()~,:;/?$#&%", see Characters allowed in a URL).
In any case, you have to create a new String object (like with replace in the answer by Elias MP or putting valid characters one by one into a StringBuilder and convert it to a String) as Strings are immutable in Java.

The string in var is output using utf-8, which results in the byte sequence:
c2 a7 c2 b0 c2 a7 23 5b 5d 7c c2 a3 25 2f 5e 3c 3e
This happens to be the iso-8859-1 encoding of the characters as you see them:
§ ° §#[]| £%/^<>
Ã§Â°Â§#[]|Â£%/^<>
C2 is the encoding for Â.
I'm not sure how the Ã was produced; it's encoding is C3.
We need the full code to learn how this happened, and a description how the character encoding for text files on your system is configured.
Modifying the variable var is useless.

MYSQL special chars issue

I´ve been having this problem for a long time, I´ve searched the internet many times for the solution, tried lots of them but not found an adequate solution.
I really don´t know what to do so if you could please help me I´d be very thankful.
(Sorry for my poor english).
Question: How can I solve the charset incompatibility between the input archive and a MYSql table?
Problem: When importing the archive from on my computer the information appears in my database, but some chars as ('ã', 'ç', 'á', etc..) are shown as ?.
Aditional information
I'm using MYSql, my version and variable status are:
MySQL VERSION : 5.5.10
HOST : localhost
USER : root
PORT : 3306
SERVER DEFAULT CHARSET : utf8
character_set_client : utf8
character_set_connection : utf8
character_set_database : utf8
character_set_filesystem : BINARY
character_set_results : utf8
character_set_server : utf8
character_set_system : utf8
collation_connection : utf8_general_ci
collation_database : utf8_general_ci
collation_server : utf8_general_ci
completion_type : NO_CHAIN
concurrent_insert : AUTO
The query that´s being used is:
LOAD DATA LOCAL INFILE 'xxxxx/file.txt'
INTO TABLE xxxxTable
FIELDS TERMINATED BY ';'
LINES TERMINATED BY ' '
IGNORE 1 LINES
( status_ordenar,numero,newstatus,rede,data_emissao,inicio,termino,tempo_indisp
, cli_afet,qtd_cli_afet,cod_encerr,uf_ofensor,localidades,clientes_afetados
, especificacao,equipamentos,area_ofens,descricao_encerr,criticidade,cod_erro
, observacao,id_falha_perc,id_falha_conf,nba,solucao,falhapercebida,falhaconfirmada
, resp_i,resp_f,resp_ue,pre_handover,falha_identificada,report_netcool,tipo_falha
, num_notificacao,equip_afetados,descricao)
About the file being imported:
I´ve opened the file with open office whith 3 charsets:
UTF8 - Gave me strange chars in place of the 'ç', 'ã', etc...
ISO-8859-1 - OK.
WIN-1252 - OK.
ASCII/US - OK.
Already tested: I´ve tested some charsets in my database: latin1, utf-8, ascii, but all of them gave me the same result (? instead of 'á', 'ç' etc).
Extra: I'm using Java with Java JDBC to generate and send the query.

file.txt is saved in ISO-8859-1 or Windows-1252 (these two are very similar), and being interpreted as UTF-8 by MySQL. These are incompatible.
How can I tell?
See point 3.: the file displays correctly when interpreted as ISO-8859-1 or Windows-1252.
See point 1.: character_set_database : utf8
Solution: either convert the file to UTF-8, or tell MySQL to interpret it as ISO-8859-1 or Windows-1252.
Background: the characters you provide (ã etc.) are single-byte values in windows-1252, and these bytes are illegal values in UTF-8, thus yielding the '?'s (unicode replacement characters).
Snippet from MySQL docs:
LOAD DATA INFILE Syntax
The character set indicated by the character_set_database system variable is used to interpret the information in the file.

Saved your characters with standard Windows Notepad as UTF-8 file (Notepad++ is also OK).
Exact file content:
'ã', 'ç', 'á'
MySQL version: 5.5.22
Database charset: utf8
Database collation: utf8_general_ci
CREATE TABLE `abc` (
`qwe` text
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Imported data with command
LOAD DATA LOCAL INFILE 'C:/test/utf8.txt'
INTO TABLE abc
FIELDS TERMINATED BY ';'
LINES TERMINATED BY ' '
IGNORE 1 LINES
( qwe)
Result (displayed in SQLyog):
So, first - you should check original file with reliable editor (notepad, notepad++). If file corrupted, then you should take another file.
Second - if file is OK, add you Java code for sending data to MySql into question.

Java cannot retrieve Unicode (Lithuanian) letters from Access via JDBC-ODBC

i have DB where some names are written with Lithuanian letters, but when I try to get them using java it ignores Lithuanian letters
DbConnection();
zadanie=connect.createStatement(ResultSet.TYPE_SCROLL_INSENSITIVE,ResultSet.CONCUR_UPDATABLE);
sql="SELECT * FROM Clients;";
dane=zadanie.executeQuery(sql);
String kas="Imonė";
while(dane.next())
{
String var=dane.getString("Pavadinimas");
if (var!= null) {var =var.trim();}
String rus =dane.getString("Rusys");
System.out.println(kas+" "+rus);
}
void DbConnection() throws SQLException
{
String baza="jdbc:odbc:DatabaseDC";
try
{
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
}catch(Exception e){System.out.println("Connection error");}
connect=DriverManager.getConnection(baza);
}
in DB type of field is TEXT, size 20, don't use any additional letter decoding or something like this.
it gives me " Imonė Imone " despite that in DB is written "Imonė" which equals rus.

Now that the JDBC-ODBC Bridge has been removed from Java 8 this particular question will increasingly become just an item of historical interest, but for the record:
The JDBC-ODBC Bridge has never worked correctly with the Access ODBC Drivers ("Jet" and "ACE") for Unicode characters above code point U+00FF. That is because Access stores such characters as Unicode but it does not use UTF-8 encoding. Instead, it uses a "compressed" variation of UTF-16LE where characters with code points U+00FF and below are stored as a single byte, while characters above U+00FF are stored as a null byte followed by their UTF-16LE byte pair(s).
If the string 'Imonė' is stored within the Access database so that it appears properly in Access itself
then it is stored as
I m o n ė
-- -- -- -- --------
49 6D 6F 6E 00 17 01
('ė' is U+0117).
The JDBC-ODBC Bridge does not understand what it receives from the Access ODBC driver for that final character, so it just returns
Imon?
On the other hand, if we try to store the string in the Access database with UTF-8 encoding, as would happen if the JDBC-ODBC Bridge attempted to insert the string itself
Statement s = con.createStatement();
s.executeUpdate("UPDATE vocabulary SET word='Imonė' WHERE ID=5");
the string would be UTF-8 encoded as
I m o n ė
-- -- -- -- -----
49 6D 6F 6E C4 97
and then the Access ODBC Driver will store it in the database as
I m o n Ä —
-- -- -- -- -- ---------
49 6D 6F 6E C4 00 14 20
C4 is 'Ä' in Windows-1252 which is U+00C4 so it is stored as just C4
97 is "em dash" in Windows-1252 which is U+2014 so it is stored as 00 14 20
Now the JDBC-ODBC Bridge can retrieve it okay (since the Access ODBC Driver "un-mangles" the character back to C4 97 on the way out), but if we open the database in Access we see
ImonÄ—
The JDBC-ODBC Bridge has never and will never be able to provide full native Unicode support for Access databases. Adding various properties to the JDBC connection will not solve the problem.
For full Unicode character support of Access databases without ODBC, consider using UCanAccess instead. (More details available in another question here.)

As you're using the JDBC-ODBC bridge, you can specify a charset in the connection details.
Try this:
Properties prop = new java.util.Properties();
prop.put("charSet", "UTF-8");
String baza="jdbc:odbc:DatabaseDC";
connect=DriverManager.getConnection(baza, prop);

Try to use this "Windows-1257" instead of UTF-8, this is for Baltic region.
java.util.Properties prop = new java.util.Properties();
prop.put("charSet", "Windows-1257");

"Fix" String encoding in Java

I have a String created from a byte[] array, using UTF-8 encoding.
However, it should have been created using another encoding (Windows-1252).
Is there a way to convert this String back to the right encoding?
I know it's easy to do if you have access to the original byte array, but it my case it's too late because it's given by a closed source library.

As there seems to be some confusion on whether this is possible or not I think I'll need to provide an extensive example.
The question claims that the (initial) input is a byte[] that contains Windows-1252 encoded data. I'll call that byte[] ib (for "initial bytes").
For this example I'll choose the German word "Bär" (meaning bear) as the input:
byte[] ib = new byte[] { (byte) 0x42, (byte) 0xE4, (byte) 0x72 };
String correctString = new String(ib, "Windows-1252");
assert correctString.charAt(1) == '\u00E4'; //verify that the character was correctly decoded.
(If your JVM doesn't support that encoding, then you can use ISO-8859-1 instead, because those three letters (and most others) are at the same position in those two encodings).
The question goes on to state that some other code (that is outside of our influence) already converted that byte[] to a String using the UTF-8 encoding (I'll call that String is for "input String"). That String is the only input that is available to achieve our goal (if ib were available, it would be trivial):
String is = new String(ib, "UTF-8");
System.out.println(is);
This obviously produces the incorrect output "B�".
The goal would be to produce ib (or the correct decoding of that byte[]) with only is available.
Now some people claim that getting the UTF-8 encoded bytes from that is will return an array with the same values as the initial array:
byte[] utf8Again = is.getBytes("UTF-8");
But that returns the UTF-8 encoding of the two characters B and � and definitely returns the wrong result when re-interpreted as Windows-1252:
System.out.println(new String(utf8Again, "Windows-1252");
This line produces the output "Bï¿½", which is totally wrong (it is also the same output that would be the result if the initial array contained the non-word "Bür" instead).
So in this case you can't undo the operation, because some information was lost.
There are in fact cases where such mis-encodings can be undone. It's more likely to work, when all possible (or at least occuring) byte sequences are valid in that encoding. Since UTF-8 has several byte sequences that are simply not valid values, you will have problems.

I tried this and it worked for some reason
Code to repair encoding problem (it doesn't work perfectly, which we will see shortly):
final Charset fromCharset = Charset.forName("windows-1252");
final Charset toCharset = Charset.forName("UTF-8");
String fixed = new String(input.getBytes(fromCharset), toCharset);
System.out.println(input);
System.out.println(fixed);
The results are:
input: â€¦Und ich beweg mich (aber heut nur langsam)
fixed: …Und ich beweg mich (aber heut nur langsam)
Here's another example:
input: Waun da wuan ned wa (feat. Wolfgang KÃ¼hn)
fixed: Waun da wuan ned wa (feat. Wolfgang Kühn)
Here's what is happening and why the trick above seems to work:
The original file was a UTF-8 encoded text file (comma delimited)
That file was imported with Excel BUT the user mistakenly entered Windows 1252 for the encoding (which was probably the default encoding on his or her computer)
The user thought the import was successful because all of the characters in the ASCII range looked okay.
Now, when we try to "reverse" the process, here is what happens:
// we start with this garbage, two characters we don't want!
String input = "Ã¼";
final Charset cp1252 = Charset.forName("windows-1252");
final Charset utf8 = Charset.forName("UTF-8");
// lets convert it to bytes in windows-1252:
// this gives you 2 bytes: c3 bc
// "Ã" ==> c3
// "¼" ==> bc
bytes[] windows1252Bytes = input.getBytes(cp1252);
// but in utf-8, c3 bc is "ü"
String fixed = new String(windows1252Bytes, utf8);
System.out.println(input);
System.out.println(fixed);
The encoding fixing code above kind of works but fails for the following characters:
(Assuming the only characters used 1 byte characters from Windows 1252):
char utf-8 bytes | string decoded as cp1252 --> as cp1252 bytes
” e2 80 9d | â€� e2 80 3f
Á c3 81 | Ã� c3 3f
Í c3 8d | Ã� c3 3f
Ï c3 8f | Ã� c3 3f
Ð c3 90 | Ã� c3 3f
Ý c3 9d | Ã� c3 3f
It does work for some of the characters, e.g. these:
Þ c3 9e | Ãž c3 9e Þ
ß c3 9f | ÃŸ c3 9f ß
à c3 a0 | Ã  c3 a0 à
á c3 a1 | Ã¡ c3 a1 á
â c3 a2 | Ã¢ c3 a2 â
ã c3 a3 | Ã£ c3 a3 ã
ä c3 a4 | Ã¤ c3 a4 ä
å c3 a5 | Ã¥ c3 a5 å
æ c3 a6 | Ã¦ c3 a6 æ
ç c3 a7 | Ã§ c3 a7 ç
NOTE - I originally thought this was relevant to your question (and as I was working on the same thing myself I figured I'd share what I've learned), but it seems my problem was slightly different. Maybe this will help someone else.

What you want to do is impossible. Once you have a Java String, the information about the byte array is lost. You may have luck doing a "manual conversion". Create a list of all windows-1252 characters and their mapping to UTF-8. Then iterate over all characters in the string to convert them to the right encoding.
Edit:
As a commenter said this won't work. When you convert a Windows-1252 byte array as it if was UTF-8 you are bound to get encoding exceptions. (See here and here).

You can use this tutorial
The charset you need should be defined in rt.jar (according to this)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Whacky latin1 to UTF8 conversion in JDBC - java

Related

How to convert character encoding from Shift-JIS to UTF-8 [duplicate]

delete unwanted characters from URL

MYSQL special chars issue

Java cannot retrieve Unicode (Lithuanian) letters from Access via JDBC-ODBC

"Fix" String encoding in Java

Categories

Resources