In a MySql database I have a column that contains a varchar string encoded with ISO-8859-1 (latin1_swedish_ci).
When the string is not latin1 MySql stores it, for example, as "à¸à¸µà¹à¸à¸."
Using Java I need to extract it and convert it to UTF-8.
Do you know how can I do it?
Thanks
Do you mean like ...
byte[] inIso_8859_1 = "à¸à¸µà¹à¸à¸.".getBytes("ISO-8859-1");
byte[] inUtf_8 = new String(inIso_8859_1, "ISO-8859-1").getBytes("UTF-8");
to check the UTF-8 encoding bytes
String s = new String(inUtf_8, "UTF-8");
System.out.println(s);
prints
à¸à¸µà¹à¸à¸.
Related
i want to save protocol-buffers object via string, in JAVA
but when i use ByteString with encode UTF_8 ,parse result not correct
public static void test2() throws InvalidProtocolBufferException {
CrcCertInfoRequest data = CrcCertInfoRequest.newBuilder().setCompanyType(222).build();
Charset charset = StandardCharsets.UTF_8;
String proStr = data.toByteString().toString(charset);
ByteString bs2 = ByteString.copyFrom(proStr, charset);
String json = ObjectMapperUtils.toJSON(data);
System.out.println("proStr=" + proStr.length() + "json=" + json.length());
System.out.println(ObjectMapperUtils.toJSON(CrcCertInfoRequest.parseFrom(bs2)));
System.out.println(ObjectMapperUtils.toJSON(ObjectMapperUtils.fromJSON(json, CrcCertInfoRequest.class)));
}
code output:
proStr=3json=119
{"appId":0,"createSource":0,"certType":0,"accountType":0,"companyType":3104751,"industryCategory1":0,"industryCategory2":0}
{"appId":0,"createSource":0,"certType":0,"accountType":0,"companyType":222,"industryCategory1":0,"industryCategory2":0}
the integer field companyType parse result is incorrect.supposed to be 222 but is 3104751
i tried other charset ,use ISO_8859_1 is ok ,but i'm not sure it's always ok.
protobuf version is protobuf-java-3.16.1.jar
java version is jdk1.8.0_171.jdk
how can i save and parse protobuf data using string in java?
ByteString is an immutable sequence of bytes and is not an actual String. Interpreting the bytes as UTF-8 does not work because it's not UTF-8 data. It's also not ISO_8859_1 or any other String encoding even if the parsing is lenient enough to not throw an error.
how can I save and parse protobuf data using string in java?
Convert the raw bytes to Base64.
I am quite perplexed on why I should not be encoding unicode text with UTF-8 for comparison when other text(to compare) has been encoded with UTF-8?
I wanted to compare a text(= アクセス拒否 - means Access denied) stored in external file encoded as UTF-8 with a constant string stored in a .java file as
public static final String ACCESS_DENIED_IN_JAPANESE = "\u30a2\u30af\u30bb\u30b9\u62d2\u5426"; // means Access denied
The java file was encoded as Cp1252.
I read the file as as input stream by using below code. Point to note that I am using UTF-8 for encoding.
InputStream in = new FileInputStream("F:\\sample.txt");
int b1;
byte[] bytes = new byte[4096];
int i = 0;
while (true) {
b1 = in.read();
if (b1 == -1)
break;
bytes[i++] = (byte) b1;
}
String japTextFromFile = new String(bytes, 0, i, Charset.forName("UTF-8"));
Now when I compare as
System.out.println(ACCESS_DENIED_IN_JAPANESE.equals(japTextFromFile)); // result is `true` , and works fine
but when I encode ACCESS_DENIED_IN_JAPANESE with UTF-8 and try to compare it with japTextFromFile result is false. The code is
String encodedAccessDenied = new String(ACCESS_DENIED_IN_JAPANESE.getBytes(),Charset.forName("UTF-8"));
System.out.println(encodedAccessDenied .equals(japTextFromFile)); // result is `false`
So my doubt is why above comparison is failing, when both the strings are same and have been encoded with UTF-8? The result should be true.
However, in first case, when compared different encoded strings- one with UTF-16(Java default way of encoding string) and other with UTF-8 , result is true, which I think should be false as it is different encoding ,no matter text we read, is same.
Where I am wrong in my understanding? Any clarification is greatly appreciated.
ACCESS_DENIED_IN_JAPANESE.getBytes() does not use UTF-8. It uses your platform's default charset. But then you use UTF-8 to turn those bytes back into a String. This gets you a different String to the one you started with.
Try this:
String encodedAccessDenied = new String(ACCESS_DENIED_IN_JAPANESE.getBytes(StandardCharsets.UTF_8),StandardCharsets.UTF_8
);
System.out.println(encodedAccessDenied .equals(japTextFromFile)); // result is `true`
The best way I know is put all static texts into a text file encoded with UTF-8. And then read those resources with FileReader, setting encoding parameter to "UTF-8"
In Java, how can I convert a string containing unicode characters escaped to utf 8, e.g. from Rüppell's_Vulture to R%c3%bcppell's_Vulture
String s = URLDecoder.decode("R%c3%bcppell's_Vulture", "UTF-8");
String s = URLEncoder.encode("Rüppell's_Vulture", "UTF-8");
With % it is an URL encoding.
Copy it to byte array with getBytes("UTF-8). Like this:
byte[] utf = String.getBytes("UTF-8")
Do not know the way of dealing with it just with Strings (I believe they have fixed encoding).
My code:
private static String convertToBase64(String string)
{
final byte[] encodeBase64 =
org.apache.commons.codec.binary.Base64.encodeBase64(string
.getBytes());
System.out.println(Hex.encodeHexString(encodeBase64));
final byte[] data = string.getBytes();
final String encoded =
javax.xml.bind.DatatypeConverter.printBase64Binary(data);
System.out.println(encoded);
return encoded;
}
Now I'm calling it: convertToBase64("stackoverflow"); and get following result:
6333526859327476646d56795a6d787664773d3d
c3RhY2tvdmVyZmxvdw==
Why I get different results?
I think Hex.encodeHexString will encode your String to hexcode, and the second one is a normal String
From the API doc of Base64.encodeBase64():
byte[] containing Base64 characters in their UTF-8 representation.
So instead
System.out.println(Hex.encodeHexString(encodeBase64));
you should write
System.out.println(new String(encodeBase64, "UTF-8"));
BTW: You should never use the String.getBytes() version without explicit encoding, because the result depends on the default platform encoding (for Windows this is usually "Cp1252" and Linux "UTF-8").
For converting a string, I am converting it into a byte as follows:
byte[] nameByteArray = cityName.getBytes();
To convert back, I did: String retrievedString = new String(nameByteArray); which obviously doesn't work. How would I convert it back?
What characters are there in your original city name? Try UTF-8 version like this:
byte[] nameByteArray = cityName.getBytes("UTF-8");
String retrievedString = new String(nameByteArray, "UTF-8");
which obviously doesn't work.
Actually that's exactly how you do it. The only thing that can go wrong is that you're implicitly using the platform default encoding, which could differ between systems, and might not be able to represent all characters in the string.
The solution is to explicitly use an encoding that can represent all characts, such as UTF-8:
byte[] nameByteArray = cityName.getBytes("UTF-8");
String retrievedString = new String(nameByteArray, "UTF-8");