i have DB where some names are written with Lithuanian letters, but when I try to get them using java it ignores Lithuanian letters
DbConnection();
zadanie=connect.createStatement(ResultSet.TYPE_SCROLL_INSENSITIVE,ResultSet.CONCUR_UPDATABLE);
sql="SELECT * FROM Clients;";
dane=zadanie.executeQuery(sql);
String kas="Imonė";
while(dane.next())
{
String var=dane.getString("Pavadinimas");
if (var!= null) {var =var.trim();}
String rus =dane.getString("Rusys");
System.out.println(kas+" "+rus);
}
void DbConnection() throws SQLException
{
String baza="jdbc:odbc:DatabaseDC";
try
{
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
}catch(Exception e){System.out.println("Connection error");}
connect=DriverManager.getConnection(baza);
}
in DB type of field is TEXT, size 20, don't use any additional letter decoding or something like this.
it gives me " Imonė Imone " despite that in DB is written "Imonė" which equals rus.
Now that the JDBC-ODBC Bridge has been removed from Java 8 this particular question will increasingly become just an item of historical interest, but for the record:
The JDBC-ODBC Bridge has never worked correctly with the Access ODBC Drivers ("Jet" and "ACE") for Unicode characters above code point U+00FF. That is because Access stores such characters as Unicode but it does not use UTF-8 encoding. Instead, it uses a "compressed" variation of UTF-16LE where characters with code points U+00FF and below are stored as a single byte, while characters above U+00FF are stored as a null byte followed by their UTF-16LE byte pair(s).
If the string 'Imonė' is stored within the Access database so that it appears properly in Access itself
then it is stored as
I m o n ė
-- -- -- -- --------
49 6D 6F 6E 00 17 01
('ė' is U+0117).
The JDBC-ODBC Bridge does not understand what it receives from the Access ODBC driver for that final character, so it just returns
Imon?
On the other hand, if we try to store the string in the Access database with UTF-8 encoding, as would happen if the JDBC-ODBC Bridge attempted to insert the string itself
Statement s = con.createStatement();
s.executeUpdate("UPDATE vocabulary SET word='Imonė' WHERE ID=5");
the string would be UTF-8 encoded as
I m o n ė
-- -- -- -- -----
49 6D 6F 6E C4 97
and then the Access ODBC Driver will store it in the database as
I m o n Ä —
-- -- -- -- -- ---------
49 6D 6F 6E C4 00 14 20
C4 is 'Ä' in Windows-1252 which is U+00C4 so it is stored as just C4
97 is "em dash" in Windows-1252 which is U+2014 so it is stored as 00 14 20
Now the JDBC-ODBC Bridge can retrieve it okay (since the Access ODBC Driver "un-mangles" the character back to C4 97 on the way out), but if we open the database in Access we see
ImonÄ—
The JDBC-ODBC Bridge has never and will never be able to provide full native Unicode support for Access databases. Adding various properties to the JDBC connection will not solve the problem.
For full Unicode character support of Access databases without ODBC, consider using UCanAccess instead. (More details available in another question here.)
As you're using the JDBC-ODBC bridge, you can specify a charset in the connection details.
Try this:
Properties prop = new java.util.Properties();
prop.put("charSet", "UTF-8");
String baza="jdbc:odbc:DatabaseDC";
connect=DriverManager.getConnection(baza, prop);
Try to use this "Windows-1257" instead of UTF-8, this is for Baltic region.
java.util.Properties prop = new java.util.Properties();
prop.put("charSet", "Windows-1257");
Related
I have a String created from a byte[] array, using UTF-8 encoding.
However, it should have been created using another encoding (Windows-1252).
Is there a way to convert this String back to the right encoding?
I know it's easy to do if you have access to the original byte array, but it my case it's too late because it's given by a closed source library.
As there seems to be some confusion on whether this is possible or not I think I'll need to provide an extensive example.
The question claims that the (initial) input is a byte[] that contains Windows-1252 encoded data. I'll call that byte[] ib (for "initial bytes").
For this example I'll choose the German word "Bär" (meaning bear) as the input:
byte[] ib = new byte[] { (byte) 0x42, (byte) 0xE4, (byte) 0x72 };
String correctString = new String(ib, "Windows-1252");
assert correctString.charAt(1) == '\u00E4'; //verify that the character was correctly decoded.
(If your JVM doesn't support that encoding, then you can use ISO-8859-1 instead, because those three letters (and most others) are at the same position in those two encodings).
The question goes on to state that some other code (that is outside of our influence) already converted that byte[] to a String using the UTF-8 encoding (I'll call that String is for "input String"). That String is the only input that is available to achieve our goal (if ib were available, it would be trivial):
String is = new String(ib, "UTF-8");
System.out.println(is);
This obviously produces the incorrect output "B�".
The goal would be to produce ib (or the correct decoding of that byte[]) with only is available.
Now some people claim that getting the UTF-8 encoded bytes from that is will return an array with the same values as the initial array:
byte[] utf8Again = is.getBytes("UTF-8");
But that returns the UTF-8 encoding of the two characters B and � and definitely returns the wrong result when re-interpreted as Windows-1252:
System.out.println(new String(utf8Again, "Windows-1252");
This line produces the output "B�", which is totally wrong (it is also the same output that would be the result if the initial array contained the non-word "Bür" instead).
So in this case you can't undo the operation, because some information was lost.
There are in fact cases where such mis-encodings can be undone. It's more likely to work, when all possible (or at least occuring) byte sequences are valid in that encoding. Since UTF-8 has several byte sequences that are simply not valid values, you will have problems.
I tried this and it worked for some reason
Code to repair encoding problem (it doesn't work perfectly, which we will see shortly):
final Charset fromCharset = Charset.forName("windows-1252");
final Charset toCharset = Charset.forName("UTF-8");
String fixed = new String(input.getBytes(fromCharset), toCharset);
System.out.println(input);
System.out.println(fixed);
The results are:
input: …Und ich beweg mich (aber heut nur langsam)
fixed: …Und ich beweg mich (aber heut nur langsam)
Here's another example:
input: Waun da wuan ned wa (feat. Wolfgang Kühn)
fixed: Waun da wuan ned wa (feat. Wolfgang Kühn)
Here's what is happening and why the trick above seems to work:
The original file was a UTF-8 encoded text file (comma delimited)
That file was imported with Excel BUT the user mistakenly entered Windows 1252 for the encoding (which was probably the default encoding on his or her computer)
The user thought the import was successful because all of the characters in the ASCII range looked okay.
Now, when we try to "reverse" the process, here is what happens:
// we start with this garbage, two characters we don't want!
String input = "ü";
final Charset cp1252 = Charset.forName("windows-1252");
final Charset utf8 = Charset.forName("UTF-8");
// lets convert it to bytes in windows-1252:
// this gives you 2 bytes: c3 bc
// "Ã" ==> c3
// "¼" ==> bc
bytes[] windows1252Bytes = input.getBytes(cp1252);
// but in utf-8, c3 bc is "ü"
String fixed = new String(windows1252Bytes, utf8);
System.out.println(input);
System.out.println(fixed);
The encoding fixing code above kind of works but fails for the following characters:
(Assuming the only characters used 1 byte characters from Windows 1252):
char utf-8 bytes | string decoded as cp1252 --> as cp1252 bytes
” e2 80 9d | â€� e2 80 3f
Á c3 81 | Ã� c3 3f
Í c3 8d | Ã� c3 3f
Ï c3 8f | Ã� c3 3f
Рc3 90 | � c3 3f
Ý c3 9d | Ã� c3 3f
It does work for some of the characters, e.g. these:
Þ c3 9e | Þ c3 9e Þ
ß c3 9f | ß c3 9f ß
à c3 a0 | Ã c3 a0 à
á c3 a1 | á c3 a1 á
â c3 a2 | â c3 a2 â
ã c3 a3 | ã c3 a3 ã
ä c3 a4 | ä c3 a4 ä
å c3 a5 | Ã¥ c3 a5 å
æ c3 a6 | æ c3 a6 æ
ç c3 a7 | ç c3 a7 ç
NOTE - I originally thought this was relevant to your question (and as I was working on the same thing myself I figured I'd share what I've learned), but it seems my problem was slightly different. Maybe this will help someone else.
What you want to do is impossible. Once you have a Java String, the information about the byte array is lost. You may have luck doing a "manual conversion". Create a list of all windows-1252 characters and their mapping to UTF-8. Then iterate over all characters in the string to convert them to the right encoding.
Edit:
As a commenter said this won't work. When you convert a Windows-1252 byte array as it if was UTF-8 you are bound to get encoding exceptions. (See here and here).
You can use this tutorial
The charset you need should be defined in rt.jar (according to this)
RFC 8410 lists this as an example of a Ed25519 public key: MCowBQYDK2VwAyEAGb9ECWmEzf6FQbrBZ9w7lshQhqowtrbLDFw4rXAxZuE=
Decoding this with an ASN.1 decoder, this becomes:
30 2A
30 05
06 03 2B6570 // Algorithm Identifier
03 21 0019BF44096984CDFE8541BAC167DC3B96C85086AA30B6B6CB0C5C38AD703166E1
As expected, this matches the SubjectPublicKeyInfo definition in the RFC.
Using the Sun cryptography provider in Java 11+ I can use this code to generate an X25519 (not Ed25519 - which is the difference in the algorithm identifier below) public key:
import java.security.KeyPairGenerator;
import java.util.Base64;
public class PrintPublicKey {
public static void main(String args[]) throws Exception {
KeyPairGenerator generator = KeyPairGenerator.getInstance("X25519");
byte[] encodedPublicKey = generator.generateKeyPair().getPublic().getEncoded();
System.out.println(Base64.getEncoder().encodeToString(encodedPublicKey));
}
}
Which will output something like: MCwwBwYDK2VuBQADIQDlXKI/cMoICnQRrV+4c//viHnXMoB190/z2MX/otJQQw==
Decoding this with an ASN.1 decoder, this becomes:
30 2C
30 07
06 03 2B656E // Algorithm Identifier
05 00 // Algorithm Parameters - NULL
03 21 00E55CA23F70CA080A7411AD5FB873FFEF8879D7328075F74FF3D8C5FFA2D25043
This has an explicit NULL after the object identifier. Is this valid according to the specification? It says:
In this document, we define four new OIDs for identifying the different curve/algorithm pairs: the curves being curve25519 and curve448 and the algorithms being ECDH and EdDSA in pure mode.
For all of the OIDs, the parameters MUST be absent.
The paragraph after the one you quoted says this:
It is possible to find systems that require the parameters to be
present. This can be due to either a defect in the original 1997
syntax or a programming error where developers never got input where
this was not true. The optimal solution is to fix these systems;
where this is not possible, the problem needs to be restricted to
that subsystem and not propagated to the Internet.
So a plausible explanation for the Oracle implementation's behavior is that they want to be interoperable with old systems that require parameters. It is the kind of thing that you do to prevent big customers with large support contracts from complaining loudly that "upgrading to Java 11 broke my infrastructure".
I have this variable String var = class.getSomething that contains this url http://www.google.com§°§#[]|£%/^<> .The output that comes out is this: http://www.google.comç°§#[]|£%/^<>. How can i delete that Ã? Thanks!
You could do this, it replaces any character for empty getting your purpouse.
str = str.replace("Â", "");
With that you will replace  for nothing, getting the result you want.
Use String.replace
var = var.replace("Ã", "");
specify the charset as UTF-8 to get rid of unwanted extra chars :
String var = class.getSomething;
var = new String(var.getBytes(),"UTF-8");
Do you really want to delete only that one character or all invalid characters? Otherwise you can check each character with CharacterUtils.isAsciiPrintable(char ch). However, according to RFC 3986 even fewer character are allowed in URLs (alphanumerics and "-_.+=!*'()~,:;/?$#&%", see Characters allowed in a URL).
In any case, you have to create a new String object (like with replace in the answer by Elias MP or putting valid characters one by one into a StringBuilder and convert it to a String) as Strings are immutable in Java.
The string in var is output using utf-8, which results in the byte sequence:
c2 a7 c2 b0 c2 a7 23 5b 5d 7c c2 a3 25 2f 5e 3c 3e
This happens to be the iso-8859-1 encoding of the characters as you see them:
§ ° §#[]| £%/^<>
ç°§#[]|£%/^<>
C2 is the encoding for Â.
I'm not sure how the à was produced; it's encoding is C3.
We need the full code to learn how this happened, and a description how the character encoding for text files on your system is configured.
Modifying the variable var is useless.
JDBC seems to insert a utf8 replacement character when asked to read from a latin1 column containing undefined latin1 codepage characters. This behaviour is different from what MySQL's internal functions do.
Character encoding is a rabbit hole that i've been stuck in for the last week, and in interest of not generating 100 obvious answers i'll demonstrate whats happening with a couple of code examples.
Mysql:
[admin#yarekt ~]$ echo 'SELECT CONVERT(UNHEX("81") using latin1);' | mysql --init-command='set names latin1' | tail -1| hexdump -C
00000000 81 0a |..|
00000002
[admin#yarekt ~]$ echo 'SELECT CONVERT(UNHEX("81") using latin1);' | mysql --init-command='set names utf8' | tail -1| hexdump -C
00000000 c2 81 0a |...|
00000003
This is pretty obvious and works exactly as expected. 0x81 is an undefined latin1 codepoint. It is represented as \u0081 in UTF8 or c2 81 in hex "on disk".
Now the weirdness comes from JDBC, Take this groovy example:
#GrabConfig(systemClassLoader=true)
#Grab(group='mysql', module='mysql-connector-java', version='5.1.6')
import groovy.sql.Sql
sql = Sql.newInstance( 'jdbc:mysql://localhost/test', 'root', '', 'com.mysql.jdbc.Driver' )
sql.eachRow( 'SELECT CONVERT(UNHEX("C281") using utf8) as a;' ) { println "$it.a --" }
The output of this query is two bytes, c2 81 as expected. Its pretty easy to understand whats happening here. The Mysql Connection is defaulting to UTF8. The unhexxed column is also cast to UTF8 (without encoding, as the source is binary, the data after CONVERT() is still c2 81).
Now consider this case. The connection is still in UTF8, as is default with JDBC. we cast our 0x81 byte as latin1, so hopefully mysql will convert it to c2 81 like it did in the bash example above.
#GrabConfig(systemClassLoader=true)
#Grab(group='mysql', module='mysql-connector-java', version='5.1.6')
import groovy.sql.Sql
sql = Sql.newInstance( 'jdbc:mysql://localhost/test', 'root', '', 'com.mysql.jdbc.Driver' )
sql.eachRow( 'SELECT CONVERT(UNHEX("81") using latin1) as a;' ) { println "$it.a --" }
Running this with groovy latin1_test.groovy | hexdump -C yields this:
00000000 ef bf bd 0a |....|
00000004
ef bf bd is a utf8 replacement char. A char used when a utf8 conversion has failed.
JDBC seems to insert a utf8 replacement character when asked to read from a latin1 column containing undefined latin1 codepage characters
Yes, this is the default behavior of CharsetDecoder instances which by default, when the (byte) input is malformed, will perform a substitution of this unmappable byte sequence with Unicode's replacement character, U+FFFD.
Examples of methods which use this behavior are all Readers but also String constructors which take a byte array as an argument. And this is the reason why you should never use String to store binary data!
The only solution to make that an error is to grab the raw byte input, create your own decoder and tell it to fail in that situation...
I have a String created from a byte[] array, using UTF-8 encoding.
However, it should have been created using another encoding (Windows-1252).
Is there a way to convert this String back to the right encoding?
I know it's easy to do if you have access to the original byte array, but it my case it's too late because it's given by a closed source library.
As there seems to be some confusion on whether this is possible or not I think I'll need to provide an extensive example.
The question claims that the (initial) input is a byte[] that contains Windows-1252 encoded data. I'll call that byte[] ib (for "initial bytes").
For this example I'll choose the German word "Bär" (meaning bear) as the input:
byte[] ib = new byte[] { (byte) 0x42, (byte) 0xE4, (byte) 0x72 };
String correctString = new String(ib, "Windows-1252");
assert correctString.charAt(1) == '\u00E4'; //verify that the character was correctly decoded.
(If your JVM doesn't support that encoding, then you can use ISO-8859-1 instead, because those three letters (and most others) are at the same position in those two encodings).
The question goes on to state that some other code (that is outside of our influence) already converted that byte[] to a String using the UTF-8 encoding (I'll call that String is for "input String"). That String is the only input that is available to achieve our goal (if ib were available, it would be trivial):
String is = new String(ib, "UTF-8");
System.out.println(is);
This obviously produces the incorrect output "B�".
The goal would be to produce ib (or the correct decoding of that byte[]) with only is available.
Now some people claim that getting the UTF-8 encoded bytes from that is will return an array with the same values as the initial array:
byte[] utf8Again = is.getBytes("UTF-8");
But that returns the UTF-8 encoding of the two characters B and � and definitely returns the wrong result when re-interpreted as Windows-1252:
System.out.println(new String(utf8Again, "Windows-1252");
This line produces the output "B�", which is totally wrong (it is also the same output that would be the result if the initial array contained the non-word "Bür" instead).
So in this case you can't undo the operation, because some information was lost.
There are in fact cases where such mis-encodings can be undone. It's more likely to work, when all possible (or at least occuring) byte sequences are valid in that encoding. Since UTF-8 has several byte sequences that are simply not valid values, you will have problems.
I tried this and it worked for some reason
Code to repair encoding problem (it doesn't work perfectly, which we will see shortly):
final Charset fromCharset = Charset.forName("windows-1252");
final Charset toCharset = Charset.forName("UTF-8");
String fixed = new String(input.getBytes(fromCharset), toCharset);
System.out.println(input);
System.out.println(fixed);
The results are:
input: …Und ich beweg mich (aber heut nur langsam)
fixed: …Und ich beweg mich (aber heut nur langsam)
Here's another example:
input: Waun da wuan ned wa (feat. Wolfgang Kühn)
fixed: Waun da wuan ned wa (feat. Wolfgang Kühn)
Here's what is happening and why the trick above seems to work:
The original file was a UTF-8 encoded text file (comma delimited)
That file was imported with Excel BUT the user mistakenly entered Windows 1252 for the encoding (which was probably the default encoding on his or her computer)
The user thought the import was successful because all of the characters in the ASCII range looked okay.
Now, when we try to "reverse" the process, here is what happens:
// we start with this garbage, two characters we don't want!
String input = "ü";
final Charset cp1252 = Charset.forName("windows-1252");
final Charset utf8 = Charset.forName("UTF-8");
// lets convert it to bytes in windows-1252:
// this gives you 2 bytes: c3 bc
// "Ã" ==> c3
// "¼" ==> bc
bytes[] windows1252Bytes = input.getBytes(cp1252);
// but in utf-8, c3 bc is "ü"
String fixed = new String(windows1252Bytes, utf8);
System.out.println(input);
System.out.println(fixed);
The encoding fixing code above kind of works but fails for the following characters:
(Assuming the only characters used 1 byte characters from Windows 1252):
char utf-8 bytes | string decoded as cp1252 --> as cp1252 bytes
” e2 80 9d | â€� e2 80 3f
Á c3 81 | Ã� c3 3f
Í c3 8d | Ã� c3 3f
Ï c3 8f | Ã� c3 3f
Рc3 90 | � c3 3f
Ý c3 9d | Ã� c3 3f
It does work for some of the characters, e.g. these:
Þ c3 9e | Þ c3 9e Þ
ß c3 9f | ß c3 9f ß
à c3 a0 | Ã c3 a0 à
á c3 a1 | á c3 a1 á
â c3 a2 | â c3 a2 â
ã c3 a3 | ã c3 a3 ã
ä c3 a4 | ä c3 a4 ä
å c3 a5 | Ã¥ c3 a5 å
æ c3 a6 | æ c3 a6 æ
ç c3 a7 | ç c3 a7 ç
NOTE - I originally thought this was relevant to your question (and as I was working on the same thing myself I figured I'd share what I've learned), but it seems my problem was slightly different. Maybe this will help someone else.
What you want to do is impossible. Once you have a Java String, the information about the byte array is lost. You may have luck doing a "manual conversion". Create a list of all windows-1252 characters and their mapping to UTF-8. Then iterate over all characters in the string to convert them to the right encoding.
Edit:
As a commenter said this won't work. When you convert a Windows-1252 byte array as it if was UTF-8 you are bound to get encoding exceptions. (See here and here).
You can use this tutorial
The charset you need should be defined in rt.jar (according to this)