"Fix" String encoding in Java

"Fix" String encoding in Java - java

I have a String created from a byte[] array, using UTF-8 encoding.
However, it should have been created using another encoding (Windows-1252).
Is there a way to convert this String back to the right encoding?
I know it's easy to do if you have access to the original byte array, but it my case it's too late because it's given by a closed source library.

As there seems to be some confusion on whether this is possible or not I think I'll need to provide an extensive example.
The question claims that the (initial) input is a byte[] that contains Windows-1252 encoded data. I'll call that byte[] ib (for "initial bytes").
For this example I'll choose the German word "Bär" (meaning bear) as the input:
byte[] ib = new byte[] { (byte) 0x42, (byte) 0xE4, (byte) 0x72 };
String correctString = new String(ib, "Windows-1252");
assert correctString.charAt(1) == '\u00E4'; //verify that the character was correctly decoded.
(If your JVM doesn't support that encoding, then you can use ISO-8859-1 instead, because those three letters (and most others) are at the same position in those two encodings).
The question goes on to state that some other code (that is outside of our influence) already converted that byte[] to a String using the UTF-8 encoding (I'll call that String is for "input String"). That String is the only input that is available to achieve our goal (if ib were available, it would be trivial):
String is = new String(ib, "UTF-8");
System.out.println(is);
This obviously produces the incorrect output "B�".
The goal would be to produce ib (or the correct decoding of that byte[]) with only is available.
Now some people claim that getting the UTF-8 encoded bytes from that is will return an array with the same values as the initial array:
byte[] utf8Again = is.getBytes("UTF-8");
But that returns the UTF-8 encoding of the two characters B and � and definitely returns the wrong result when re-interpreted as Windows-1252:
System.out.println(new String(utf8Again, "Windows-1252");
This line produces the output "Bï¿½", which is totally wrong (it is also the same output that would be the result if the initial array contained the non-word "Bür" instead).
So in this case you can't undo the operation, because some information was lost.
There are in fact cases where such mis-encodings can be undone. It's more likely to work, when all possible (or at least occuring) byte sequences are valid in that encoding. Since UTF-8 has several byte sequences that are simply not valid values, you will have problems.

I tried this and it worked for some reason
Code to repair encoding problem (it doesn't work perfectly, which we will see shortly):
final Charset fromCharset = Charset.forName("windows-1252");
final Charset toCharset = Charset.forName("UTF-8");
String fixed = new String(input.getBytes(fromCharset), toCharset);
System.out.println(input);
System.out.println(fixed);
The results are:
input: â€¦Und ich beweg mich (aber heut nur langsam)
fixed: …Und ich beweg mich (aber heut nur langsam)
Here's another example:
input: Waun da wuan ned wa (feat. Wolfgang KÃ¼hn)
fixed: Waun da wuan ned wa (feat. Wolfgang Kühn)
Here's what is happening and why the trick above seems to work:
The original file was a UTF-8 encoded text file (comma delimited)
That file was imported with Excel BUT the user mistakenly entered Windows 1252 for the encoding (which was probably the default encoding on his or her computer)
The user thought the import was successful because all of the characters in the ASCII range looked okay.
Now, when we try to "reverse" the process, here is what happens:
// we start with this garbage, two characters we don't want!
String input = "Ã¼";
final Charset cp1252 = Charset.forName("windows-1252");
final Charset utf8 = Charset.forName("UTF-8");
// lets convert it to bytes in windows-1252:
// this gives you 2 bytes: c3 bc
// "Ã" ==> c3
// "¼" ==> bc
bytes[] windows1252Bytes = input.getBytes(cp1252);
// but in utf-8, c3 bc is "ü"
String fixed = new String(windows1252Bytes, utf8);
System.out.println(input);
System.out.println(fixed);
The encoding fixing code above kind of works but fails for the following characters:
(Assuming the only characters used 1 byte characters from Windows 1252):
char utf-8 bytes | string decoded as cp1252 --> as cp1252 bytes
” e2 80 9d | â€� e2 80 3f
Á c3 81 | Ã� c3 3f
Í c3 8d | Ã� c3 3f
Ï c3 8f | Ã� c3 3f
Ð c3 90 | Ã� c3 3f
Ý c3 9d | Ã� c3 3f
It does work for some of the characters, e.g. these:
Þ c3 9e | Ãž c3 9e Þ
ß c3 9f | ÃŸ c3 9f ß
à c3 a0 | Ã  c3 a0 à
á c3 a1 | Ã¡ c3 a1 á
â c3 a2 | Ã¢ c3 a2 â
ã c3 a3 | Ã£ c3 a3 ã
ä c3 a4 | Ã¤ c3 a4 ä
å c3 a5 | Ã¥ c3 a5 å
æ c3 a6 | Ã¦ c3 a6 æ
ç c3 a7 | Ã§ c3 a7 ç
NOTE - I originally thought this was relevant to your question (and as I was working on the same thing myself I figured I'd share what I've learned), but it seems my problem was slightly different. Maybe this will help someone else.

What you want to do is impossible. Once you have a Java String, the information about the byte array is lost. You may have luck doing a "manual conversion". Create a list of all windows-1252 characters and their mapping to UTF-8. Then iterate over all characters in the string to convert them to the right encoding.
Edit:
As a commenter said this won't work. When you convert a Windows-1252 byte array as it if was UTF-8 you are bound to get encoding exceptions. (See here and here).

You can use this tutorial
The charset you need should be defined in rt.jar (according to this)

Related

How to convert character encoding from Shift-JIS to UTF-8 [duplicate]

I have a String created from a byte[] array, using UTF-8 encoding.
However, it should have been created using another encoding (Windows-1252).
Is there a way to convert this String back to the right encoding?
I know it's easy to do if you have access to the original byte array, but it my case it's too late because it's given by a closed source library.

As there seems to be some confusion on whether this is possible or not I think I'll need to provide an extensive example.
The question claims that the (initial) input is a byte[] that contains Windows-1252 encoded data. I'll call that byte[] ib (for "initial bytes").
For this example I'll choose the German word "Bär" (meaning bear) as the input:
byte[] ib = new byte[] { (byte) 0x42, (byte) 0xE4, (byte) 0x72 };
String correctString = new String(ib, "Windows-1252");
assert correctString.charAt(1) == '\u00E4'; //verify that the character was correctly decoded.
(If your JVM doesn't support that encoding, then you can use ISO-8859-1 instead, because those three letters (and most others) are at the same position in those two encodings).
The question goes on to state that some other code (that is outside of our influence) already converted that byte[] to a String using the UTF-8 encoding (I'll call that String is for "input String"). That String is the only input that is available to achieve our goal (if ib were available, it would be trivial):
String is = new String(ib, "UTF-8");
System.out.println(is);
This obviously produces the incorrect output "B�".
The goal would be to produce ib (or the correct decoding of that byte[]) with only is available.
Now some people claim that getting the UTF-8 encoded bytes from that is will return an array with the same values as the initial array:
byte[] utf8Again = is.getBytes("UTF-8");
But that returns the UTF-8 encoding of the two characters B and � and definitely returns the wrong result when re-interpreted as Windows-1252:
System.out.println(new String(utf8Again, "Windows-1252");
This line produces the output "Bï¿½", which is totally wrong (it is also the same output that would be the result if the initial array contained the non-word "Bür" instead).
So in this case you can't undo the operation, because some information was lost.
There are in fact cases where such mis-encodings can be undone. It's more likely to work, when all possible (or at least occuring) byte sequences are valid in that encoding. Since UTF-8 has several byte sequences that are simply not valid values, you will have problems.

I tried this and it worked for some reason
Code to repair encoding problem (it doesn't work perfectly, which we will see shortly):
final Charset fromCharset = Charset.forName("windows-1252");
final Charset toCharset = Charset.forName("UTF-8");
String fixed = new String(input.getBytes(fromCharset), toCharset);
System.out.println(input);
System.out.println(fixed);
The results are:
input: â€¦Und ich beweg mich (aber heut nur langsam)
fixed: …Und ich beweg mich (aber heut nur langsam)
Here's another example:
input: Waun da wuan ned wa (feat. Wolfgang KÃ¼hn)
fixed: Waun da wuan ned wa (feat. Wolfgang Kühn)
Here's what is happening and why the trick above seems to work:
The original file was a UTF-8 encoded text file (comma delimited)
That file was imported with Excel BUT the user mistakenly entered Windows 1252 for the encoding (which was probably the default encoding on his or her computer)
The user thought the import was successful because all of the characters in the ASCII range looked okay.
Now, when we try to "reverse" the process, here is what happens:
// we start with this garbage, two characters we don't want!
String input = "Ã¼";
final Charset cp1252 = Charset.forName("windows-1252");
final Charset utf8 = Charset.forName("UTF-8");
// lets convert it to bytes in windows-1252:
// this gives you 2 bytes: c3 bc
// "Ã" ==> c3
// "¼" ==> bc
bytes[] windows1252Bytes = input.getBytes(cp1252);
// but in utf-8, c3 bc is "ü"
String fixed = new String(windows1252Bytes, utf8);
System.out.println(input);
System.out.println(fixed);
The encoding fixing code above kind of works but fails for the following characters:
(Assuming the only characters used 1 byte characters from Windows 1252):
char utf-8 bytes | string decoded as cp1252 --> as cp1252 bytes
” e2 80 9d | â€� e2 80 3f
Á c3 81 | Ã� c3 3f
Í c3 8d | Ã� c3 3f
Ï c3 8f | Ã� c3 3f
Ð c3 90 | Ã� c3 3f
Ý c3 9d | Ã� c3 3f
It does work for some of the characters, e.g. these:
Þ c3 9e | Ãž c3 9e Þ
ß c3 9f | ÃŸ c3 9f ß
à c3 a0 | Ã  c3 a0 à
á c3 a1 | Ã¡ c3 a1 á
â c3 a2 | Ã¢ c3 a2 â
ã c3 a3 | Ã£ c3 a3 ã
ä c3 a4 | Ã¤ c3 a4 ä
å c3 a5 | Ã¥ c3 a5 å
æ c3 a6 | Ã¦ c3 a6 æ
ç c3 a7 | Ã§ c3 a7 ç
NOTE - I originally thought this was relevant to your question (and as I was working on the same thing myself I figured I'd share what I've learned), but it seems my problem was slightly different. Maybe this will help someone else.

What you want to do is impossible. Once you have a Java String, the information about the byte array is lost. You may have luck doing a "manual conversion". Create a list of all windows-1252 characters and their mapping to UTF-8. Then iterate over all characters in the string to convert them to the right encoding.
Edit:
As a commenter said this won't work. When you convert a Windows-1252 byte array as it if was UTF-8 you are bound to get encoding exceptions. (See here and here).

You can use this tutorial
The charset you need should be defined in rt.jar (according to this)

delete unwanted characters from URL

I have this variable String var = class.getSomething that contains this url http://www.google.com§°§#[]|£%/^<> .The output that comes out is this: http://www.google.comÃ§Â°Â§#[]|Â£%/^<>. How can i delete that Ã? Thanks!

You could do this, it replaces any character for empty getting your purpouse.
str = str.replace("Â", "");
With that you will replace Â for nothing, getting the result you want.

Use String.replace
var = var.replace("Ã", "");

specify the charset as UTF-8 to get rid of unwanted extra chars :
String var = class.getSomething;
var = new String(var.getBytes(),"UTF-8");

Do you really want to delete only that one character or all invalid characters? Otherwise you can check each character with CharacterUtils.isAsciiPrintable(char ch). However, according to RFC 3986 even fewer character are allowed in URLs (alphanumerics and "-_.+=!*'()~,:;/?$#&%", see Characters allowed in a URL).
In any case, you have to create a new String object (like with replace in the answer by Elias MP or putting valid characters one by one into a StringBuilder and convert it to a String) as Strings are immutable in Java.

The string in var is output using utf-8, which results in the byte sequence:
c2 a7 c2 b0 c2 a7 23 5b 5d 7c c2 a3 25 2f 5e 3c 3e
This happens to be the iso-8859-1 encoding of the characters as you see them:
§ ° §#[]| £%/^<>
Ã§Â°Â§#[]|Â£%/^<>
C2 is the encoding for Â.
I'm not sure how the Ã was produced; it's encoding is C3.
We need the full code to learn how this happened, and a description how the character encoding for text files on your system is configured.
Modifying the variable var is useless.

Whacky latin1 to UTF8 conversion in JDBC

JDBC seems to insert a utf8 replacement character when asked to read from a latin1 column containing undefined latin1 codepage characters. This behaviour is different from what MySQL's internal functions do.
Character encoding is a rabbit hole that i've been stuck in for the last week, and in interest of not generating 100 obvious answers i'll demonstrate whats happening with a couple of code examples.
Mysql:
[admin#yarekt ~]$ echo 'SELECT CONVERT(UNHEX("81") using latin1);' | mysql --init-command='set names latin1' | tail -1| hexdump -C
00000000 81 0a |..|
00000002
[admin#yarekt ~]$ echo 'SELECT CONVERT(UNHEX("81") using latin1);' | mysql --init-command='set names utf8' | tail -1| hexdump -C
00000000 c2 81 0a |...|
00000003
This is pretty obvious and works exactly as expected. 0x81 is an undefined latin1 codepoint. It is represented as \u0081 in UTF8 or c2 81 in hex "on disk".
Now the weirdness comes from JDBC, Take this groovy example:
#GrabConfig(systemClassLoader=true)
#Grab(group='mysql', module='mysql-connector-java', version='5.1.6')
import groovy.sql.Sql
sql = Sql.newInstance( 'jdbc:mysql://localhost/test', 'root', '', 'com.mysql.jdbc.Driver' )
sql.eachRow( 'SELECT CONVERT(UNHEX("C281") using utf8) as a;' ) { println "$it.a --" }
The output of this query is two bytes, c2 81 as expected. Its pretty easy to understand whats happening here. The Mysql Connection is defaulting to UTF8. The unhexxed column is also cast to UTF8 (without encoding, as the source is binary, the data after CONVERT() is still c2 81).
Now consider this case. The connection is still in UTF8, as is default with JDBC. we cast our 0x81 byte as latin1, so hopefully mysql will convert it to c2 81 like it did in the bash example above.
#GrabConfig(systemClassLoader=true)
#Grab(group='mysql', module='mysql-connector-java', version='5.1.6')
import groovy.sql.Sql
sql = Sql.newInstance( 'jdbc:mysql://localhost/test', 'root', '', 'com.mysql.jdbc.Driver' )
sql.eachRow( 'SELECT CONVERT(UNHEX("81") using latin1) as a;' ) { println "$it.a --" }
Running this with groovy latin1_test.groovy | hexdump -C yields this:
00000000 ef bf bd 0a |....|
00000004
ef bf bd is a utf8 replacement char. A char used when a utf8 conversion has failed.

JDBC seems to insert a utf8 replacement character when asked to read from a latin1 column containing undefined latin1 codepage characters
Yes, this is the default behavior of CharsetDecoder instances which by default, when the (byte) input is malformed, will perform a substitution of this unmappable byte sequence with Unicode's replacement character, U+FFFD.
Examples of methods which use this behavior are all Readers but also String constructors which take a byte array as an argument. And this is the reason why you should never use String to store binary data!
The only solution to make that an error is to grab the raw byte input, create your own decoder and tell it to fail in that situation...

Android BLE BluetoothAdapter.LeScanCallback scanRecord length ambiguity

I am using the example project from google (BluetoothLeGatt) to receive data from a BLE device and trying to read a specific byte within it's scanRecord obtained by the onLeScan method.
My problem is that there is missmatch between the data I am observing in the network and what I see on logs.
This is on Android 4.3 and using a Samsung Galaxy S4 to test it.
To verify that the scanRecord logs are correct on Android, I am using TI's Packet Sniffer to observe the byte stream being broadcasted by the device, and here it is:
That is 31 bytes of data being broadcasted by the device to the network, and there are no other working devices around.
02 01 1A 1A FF 4C 00 02 15 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 0C C6 64
On the other hand, Android logs claim that the data being received has the length of 62 bytes and it matches the data until the 29th[0-indexed] byte, having 0s for the rest of the data.
02-12 15:34:09.548: D/DEBUG(26801): len: 62
data:02011a1aff4c000215000000000000000000000000000000000000000cc60000000000000000000000000000000000000000000000000000000000000000
And this is the code piece I used in order to obtain the logs within the LeScanCallback method:
int len = scanRecord.length;
String scanHex = bytesToHex(scanRecord);
Log.d("DEBUG", "len: " + len + " data:" + scanHex);
The method used to convert byte array to hex representation:
private static String bytesToHex(byte[] bytes) {
char[] hexChars = new char[bytes.length * 2];
int v;
for ( int j = 0; j < bytes.length; j++ ) {
v = bytes[j] & 0xFF;
hexChars[j * 2] = hexArray[v >>> 4];
hexChars[j * 2 + 1] = hexArray[v & 0x0F];
}
return new String(hexChars);
}
I used a few other example projects including Dave Smith's example and RadiusNetworks' Android iBeacon Library and I ended up with the same results. I can't possibly understand why do I receive 62 bytes of data when "Packet Sniffer" shows (and I also know) that it should be 31 bytes. This would not be my main concern if I was able to read the data in the last byte correctly (I get 00 instead of 64 from Android's BluetoothAdapter). But that is not the case either.
I would appreciate any suggestions about what might potentially be the reason for this missmatch for both the data(last byte only) and the data size between what Android receives and what is actually on the network.

Your transmission is malformed, containing 31 bytes of payload data (PDU length of 37) when its internal length fields indicate it should in total contain only 30 bytes (PDU length 36).
Let's take a look at your data
02 01 1a
This is a length (2) of type codes - 01 and 1a, and good so far
1a ff 4c ...
Now we have a problem - the 1a is a length code for this field (manufacturer specific data), value of 26. Yet 27 bytes of data follow it in your case, instead of the proper 26 you have indicated you will provide.
Now, if you have a properly formed packet, you will still get a larger buffer padded with meaningless (likely uninitialized) values following the proper content, but you can simply ignore that by parsing the buffer in accordance with the field-length values and ignoring anything not accounted for in the proclaimed lengths.
But with your current malformed packet, the copying of packet data to the buffer stops at the proclaimed content size, and the unannounced extra byte never makes it into the buffer your program receives - so you see instead something random there, as with the rest of the unused length.
Probably, when you made up your all-zeroes "region UUID" (might want to rethink that) you simply typed an extra byte...

Java cannot retrieve Unicode (Lithuanian) letters from Access via JDBC-ODBC

i have DB where some names are written with Lithuanian letters, but when I try to get them using java it ignores Lithuanian letters
DbConnection();
zadanie=connect.createStatement(ResultSet.TYPE_SCROLL_INSENSITIVE,ResultSet.CONCUR_UPDATABLE);
sql="SELECT * FROM Clients;";
dane=zadanie.executeQuery(sql);
String kas="Imonė";
while(dane.next())
{
String var=dane.getString("Pavadinimas");
if (var!= null) {var =var.trim();}
String rus =dane.getString("Rusys");
System.out.println(kas+" "+rus);
}
void DbConnection() throws SQLException
{
String baza="jdbc:odbc:DatabaseDC";
try
{
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
}catch(Exception e){System.out.println("Connection error");}
connect=DriverManager.getConnection(baza);
}
in DB type of field is TEXT, size 20, don't use any additional letter decoding or something like this.
it gives me " Imonė Imone " despite that in DB is written "Imonė" which equals rus.

Now that the JDBC-ODBC Bridge has been removed from Java 8 this particular question will increasingly become just an item of historical interest, but for the record:
The JDBC-ODBC Bridge has never worked correctly with the Access ODBC Drivers ("Jet" and "ACE") for Unicode characters above code point U+00FF. That is because Access stores such characters as Unicode but it does not use UTF-8 encoding. Instead, it uses a "compressed" variation of UTF-16LE where characters with code points U+00FF and below are stored as a single byte, while characters above U+00FF are stored as a null byte followed by their UTF-16LE byte pair(s).
If the string 'Imonė' is stored within the Access database so that it appears properly in Access itself
then it is stored as
I m o n ė
-- -- -- -- --------
49 6D 6F 6E 00 17 01
('ė' is U+0117).
The JDBC-ODBC Bridge does not understand what it receives from the Access ODBC driver for that final character, so it just returns
Imon?
On the other hand, if we try to store the string in the Access database with UTF-8 encoding, as would happen if the JDBC-ODBC Bridge attempted to insert the string itself
Statement s = con.createStatement();
s.executeUpdate("UPDATE vocabulary SET word='Imonė' WHERE ID=5");
the string would be UTF-8 encoded as
I m o n ė
-- -- -- -- -----
49 6D 6F 6E C4 97
and then the Access ODBC Driver will store it in the database as
I m o n Ä —
-- -- -- -- -- ---------
49 6D 6F 6E C4 00 14 20
C4 is 'Ä' in Windows-1252 which is U+00C4 so it is stored as just C4
97 is "em dash" in Windows-1252 which is U+2014 so it is stored as 00 14 20
Now the JDBC-ODBC Bridge can retrieve it okay (since the Access ODBC Driver "un-mangles" the character back to C4 97 on the way out), but if we open the database in Access we see
ImonÄ—
The JDBC-ODBC Bridge has never and will never be able to provide full native Unicode support for Access databases. Adding various properties to the JDBC connection will not solve the problem.
For full Unicode character support of Access databases without ODBC, consider using UCanAccess instead. (More details available in another question here.)

As you're using the JDBC-ODBC bridge, you can specify a charset in the connection details.
Try this:
Properties prop = new java.util.Properties();
prop.put("charSet", "UTF-8");
String baza="jdbc:odbc:DatabaseDC";
connect=DriverManager.getConnection(baza, prop);

Try to use this "Windows-1257" instead of UTF-8, this is for Baltic region.
java.util.Properties prop = new java.util.Properties();
prop.put("charSet", "Windows-1257");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

"Fix" String encoding in Java - java

You can use this tutorial The charset you need should be defined in rt.jar (according to this)

Related

How to convert character encoding from Shift-JIS to UTF-8 [duplicate]

delete unwanted characters from URL

Whacky latin1 to UTF8 conversion in JDBC

Android BLE BluetoothAdapter.LeScanCallback scanRecord length ambiguity

Java cannot retrieve Unicode (Lithuanian) letters from Access via JDBC-ODBC

Categories

Resources