i want to save protocol-buffers object via string, in JAVA
but when i use ByteString with encode UTF_8 ,parse result not correct
public static void test2() throws InvalidProtocolBufferException {
CrcCertInfoRequest data = CrcCertInfoRequest.newBuilder().setCompanyType(222).build();
Charset charset = StandardCharsets.UTF_8;
String proStr = data.toByteString().toString(charset);
ByteString bs2 = ByteString.copyFrom(proStr, charset);
String json = ObjectMapperUtils.toJSON(data);
System.out.println("proStr=" + proStr.length() + "json=" + json.length());
System.out.println(ObjectMapperUtils.toJSON(CrcCertInfoRequest.parseFrom(bs2)));
System.out.println(ObjectMapperUtils.toJSON(ObjectMapperUtils.fromJSON(json, CrcCertInfoRequest.class)));
}
code output:
proStr=3json=119
{"appId":0,"createSource":0,"certType":0,"accountType":0,"companyType":3104751,"industryCategory1":0,"industryCategory2":0}
{"appId":0,"createSource":0,"certType":0,"accountType":0,"companyType":222,"industryCategory1":0,"industryCategory2":0}
the integer field companyType parse result is incorrect.supposed to be 222 but is 3104751
i tried other charset ,use ISO_8859_1 is ok ,but i'm not sure it's always ok.
protobuf version is protobuf-java-3.16.1.jar
java version is jdk1.8.0_171.jdk
how can i save and parse protobuf data using string in java?
ByteString is an immutable sequence of bytes and is not an actual String. Interpreting the bytes as UTF-8 does not work because it's not UTF-8 data. It's also not ISO_8859_1 or any other String encoding even if the parsing is lenient enough to not throw an error.
how can I save and parse protobuf data using string in java?
Convert the raw bytes to Base64.
Related
I have this string
"=?UTF-8?B?VGLNBGNDQA==?="
to decode in a standard java String.
I wrote this quick and dirty main to get the String, but I'm having troubles
String s = "=?UTF-8?B?VGLNBGNDQA==?=";
s = s.split("=\\?UTF-8\\?B\\?")[1].split("\\?=")[0];
System.out.println(s);
byte[] decoded = Base64.getDecoder().decode(s);
String x = new String(decoded, "UTF8");
System.out.println(decoded);
System.out.println(x);
It is actually printing a strange string
"TbοΏ½cC#"
I do not know what is the text behind the encoded string, but I can assume my program works, since I can convert without problems any other encoded string, for example
"=?UTF-8?B?SGlfR3V5cyE="
That is "Hi_Guys!".
Should I assume that string is malformed?
I have some text strings that I need to process and inside the strings there are HTML special characters. For example:
10πππππππ’π10πππππππ’ππ
I would like to convert those characters to utf-8.
I used org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4 but didn't have any luck. Is there an easy way to deal with this problem?
Apache commons-text library has the StringEscapeUtils class that has the unescapeHtml4() utility method.
String utf8Str = StringEscapeUtils.unescapeHtml4(htmlStr);
You may also need unescapeXml()
#Bohemian 's code is correct, It works for me, your un-encoded string is 10πππππππ’π10πππππππ’ππ.
Now, I'm adding another answer instead of commenting on Bohemian's answer because there are two things that still need to be mentioned:
I copy-pasted your string into HTML code and the browser can't render your characters properly, because your String is incorrectly encoded, i. e. the string has encoded the high surrogate and the low one for two-bytes-chars separately, instead of encoding the whole codepoint (it seems the original string is a UTF-16 encoded string, maybe a Java String?).
You want the string to be re-encoded to UTF-8.
Once you have your String unencoded by StringEscapeUtils.unescapeHtml(htmlStr) (which un-encodes your string successfully despite being encoded incorrectly), it doesn't have much sense talking about "string encodings" as java strings are "unaware" about encodings. (they use UTF-16 internally though).
If you need a group of bytes containing a UTF-8 encoded "string", you need to get the "raw" bytes from a String encoded as UTF-8:
String javaStr = StringEscapeUtils.unescapeHtml(htmlStr);
byte[] rawUft8String = javaStr.getBytes("UTF-8");
And do with such byte array whatever you need.
Now if what you need is to write a UTF-8 encoded string to a File, instead of that byte array you need to specify the encoding when you create the proper java.io.Writer.
Try this code to un-encode your string (change the file path first) and then open the resulting file in any editor that supports UTF-8:
java.io.Writer approach (better):
public static void main(String[] args) throws IOException {
String str = "10πππππππ’π10πππππππ’ππ";
String javaString = StringEscapeUtils.unescapeHtml(str);
try(Writer output = new OutputStreamWriter(
new FileOutputStream("/path/to/testing.txt"), "UTF-8")) {
output.write(javaString);
}
}
java.io.OutputStream approach (if you already have a "raw string"):
public static void main(String[] args) throws IOException {
String str = "10πππππππ’π10πππππππ’ππ";
String javaString = StringEscapeUtils.unescapeHtml(str);
try(OutputStream output = new FileOutputStream("/path/to/testing.txt")) {
for (byte b : javaString.getBytes(Charset.forName("UTF-8"))) {
output.write(b);
}
}
}
I need to pass base64 encoded data into xml as a string value. I noticed that code below prints different string representation. Which one is correct and why?
String example = "Hello universe!";
byte[] base64data = Base64.encodeBase64(example.getBytes());
System.out.println(new String(base64data));
System.out.println(DatatypeConverter.printBase64Binary(base64data));
System.out.println(new String(Base64.decodeBase64(base64data), "UTF-8"));
And what I get as a result:
SGVsbG8gdW5pdmVyc2Uh
U0dWc2JHOGdkVzVwZG1WeWMyVWg=
Hello universe!
U0dWc2JHOGdkVzVwZG1WeWMyVWg= decoded is SGVsbG8gdW5pdmVyc2Uh which is Hello universe! encoded. So you did the encoding twice.
There is no difference. You are using the API the wrong way. Don't encode the already encoded data again.
In a MySql database I have a column that contains a varchar string encoded with ISO-8859-1 (latin1_swedish_ci).
When the string is not latin1 MySql stores it, for example, as "à ¸à ¸¡à ¹à ¸à ¸."
Using Java I need to extract it and convert it to UTF-8.
Do you know how can I do it?
Thanks
Do you mean like ...
byte[] inIso_8859_1 = "à ¸à ¸¡à ¹à ¸à ¸.".getBytes("ISO-8859-1");
byte[] inUtf_8 = new String(inIso_8859_1, "ISO-8859-1").getBytes("UTF-8");
to check the UTF-8 encoding bytes
String s = new String(inUtf_8, "UTF-8");
System.out.println(s);
prints
à ¸à ¸¡à ¹à ¸à ¸.
My code:
private static String convertToBase64(String string)
{
final byte[] encodeBase64 =
org.apache.commons.codec.binary.Base64.encodeBase64(string
.getBytes());
System.out.println(Hex.encodeHexString(encodeBase64));
final byte[] data = string.getBytes();
final String encoded =
javax.xml.bind.DatatypeConverter.printBase64Binary(data);
System.out.println(encoded);
return encoded;
}
Now I'm calling it: convertToBase64("stackoverflow"); and get following result:
6333526859327476646d56795a6d787664773d3d
c3RhY2tvdmVyZmxvdw==
Why I get different results?
I think Hex.encodeHexString will encode your String to hexcode, and the second one is a normal String
From the API doc of Base64.encodeBase64():
byte[] containing Base64 characters in their UTF-8 representation.
So instead
System.out.println(Hex.encodeHexString(encodeBase64));
you should write
System.out.println(new String(encodeBase64, "UTF-8"));
BTW: You should never use the String.getBytes() version without explicit encoding, because the result depends on the default platform encoding (for Windows this is usually "Cp1252" and Linux "UTF-8").