Could anyone please let me know how to convert protobuf's ByteString to an octal escape sequence String in java?
In my case, I am getting the ByteString value as \376\024\367 so, when I print the string value in console using System.out.println(), I should get "\376\024\367".
Many thanks.
Normally, you'd convert a ByteString to a String using ByteString#toString(Charset). This method lets you specify what charset the text is encoded in. If it's UTF-8, you can also use the method toStringUtf8() as a shortcut.
From your question, though, it sounds like you actually want to produce the escaped format using C-style three-digit octal escapes. AFAIK there's no public function to do this, but you can see the code here. You could copy that code into your own project and use it.
I have used http://doc.akka.io/japi/akka/2.3.7/akka/util/ByteString.ByteStrings.html
You will see to method decodeString(java.lang.String charset)
else see to https://github.com/akka/akka/issues/18738
Related
While printing certain unicode characters in java we get output as '?'. Why is it so and is there any way to print these characters?
This is my code
String symbol1="\u200d";
StringBuilder strg = new StringBuilder("unicodecharacter");
strg.insert(5,symbol1);
System.out.println("After insertion...");
System.out.println(strg.toString());
Output is
After insertion...
unico?decharacter
Here's a great article, written by Joel Spolsky, on the topic. It won't directly help you solve your problem, but it will help you understand what's going on. It'll also show you how involved the situation really is.
You have a character encoding which doesn't match the character you have or the supported characters on the screen.
I would check which encoding you are using through out and try to determine whether you are reading, storing or printing the value correctly.
Are you sure which encoding you need? You may need to explicitly encode your output as UTF-8 or ISO 8859-1 if you are dealing with European characters.
Java's default behaviour when reading an invalid unicode character is to replace it with the Replacement Character (\uFFFD). This character is often rendered as a question mark.
In your case, the text you're reading is not encoded as unicode, it's encoded as something else (Windows-1252 or ISO-8859-1 are probably the most common alternatives if your text is in English).
I wrote an Open Source Library that has a utility that converts any String to Unicode sequence and vise-versa. It helps to diagnose such issues. So for instance to print your String you can use something like this:
String str= StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString("\\u0197" +
StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Test"));
You can read about the library and where to download it and how to use it at Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison See the paragraph "String Unicode converter"
I'm pulling JSON data through a REST API using Requests in Python. Unfortunately, one of the fields contains all sorts of unescaped and control characters that breaks the JSON.
I don't control the data, but I can request it undecoded as a string that the application stores as a Java byte array.
For example: [B#1cf3bd82
The question is how do I decode the string back into the original UTF-8 text as I'm working through the JSON? All of the examples I've found seem to work with a byte object, not a encoded string.
Thoughts?
You're currently printing out the result of calling toString() on the byte[]. That's never a good idea - arrays don't override toString().
You should use the new String(byte[], Charset) constructor:
String text = new String(bytes, StandardCharsets.UTF_8);
It's not entirely clear to me from the question where what is happening in terms of the data, but basically you need to modify the Java code - any Python code is probably irrelevant here.
I am trying to debug a flaky Java application. I can't (easily) debug it in the only way I would know how - by putting a log statement in it and re-compiling. Then checking the logs.
(I don't have access to a reliable set of source code). And I'm not a Java developer.
The actual question:
If I did this:
str = URLDecoder.decode("%25C3%2596");
What would be in str?
Would it realize that this is double-encoded and handle that i.e. turn it into %C3%96 - and then decode that? (Which decodes into a German Umlaut).
Thanks
--Justin Wyllie
From the Java API URLDecoder:
A sequence of the form "%xy" will be treated as representing a byte where xy is the two-digit hexadecimal representation of the 8 bits.
So my guess would be - most likely not.
You could however call the decode method twice.
str = URLDecoder.decode(URLDecoder.decode("%25C3%2596"));
str = URLDecoder.decode("%25C3%2596");
The result of this operation is system-dependent (the reason the method is deprecated.)
The result of this call:
str = URLDecoder.decode("%25C3%2596", "UTF-8");
...would be %C3%96 which is Ö in percent-encoded UTF-8. The API does not try to recursively decode any percent-signs.
While printing certain unicode characters in java we get output as '?'. Why is it so and is there any way to print these characters?
This is my code
String symbol1="\u200d";
StringBuilder strg = new StringBuilder("unicodecharacter");
strg.insert(5,symbol1);
System.out.println("After insertion...");
System.out.println(strg.toString());
Output is
After insertion...
unico?decharacter
Here's a great article, written by Joel Spolsky, on the topic. It won't directly help you solve your problem, but it will help you understand what's going on. It'll also show you how involved the situation really is.
You have a character encoding which doesn't match the character you have or the supported characters on the screen.
I would check which encoding you are using through out and try to determine whether you are reading, storing or printing the value correctly.
Are you sure which encoding you need? You may need to explicitly encode your output as UTF-8 or ISO 8859-1 if you are dealing with European characters.
Java's default behaviour when reading an invalid unicode character is to replace it with the Replacement Character (\uFFFD). This character is often rendered as a question mark.
In your case, the text you're reading is not encoded as unicode, it's encoded as something else (Windows-1252 or ISO-8859-1 are probably the most common alternatives if your text is in English).
I wrote an Open Source Library that has a utility that converts any String to Unicode sequence and vise-versa. It helps to diagnose such issues. So for instance to print your String you can use something like this:
String str= StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString("\\u0197" +
StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Test"));
You can read about the library and where to download it and how to use it at Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison See the paragraph "String Unicode converter"
In my Java application I have been passed in a string that looks like this:
"\u00a5123"
When printing that string into the console, I get the same string as the output (as expected).
However, I want to print that out by having the unicode converted into the actual yen symbol (\u00a5 -> yen symbol) - how would I go about doing this?
i.e. so it looks like this: "[yen symbol]123"
I wrote a little program:
public static void main(String[] args) {
System.out.println("\u00a5123");
}
It's output:
¥123
i.e. it output exactly what you stated in your post. I am not sure there is not something else going on. What version of Java are you using?
edit:
In response to your clarification, there are a couple of different techniques. The most straightforward is to look for a "\u" followed by 4 hex-code characters, extract that piece and replace with a unicode version with the hexcode (using the Character class). This of course assumes the string will not have a \u in front of it.
I am not aware of any particular system to parse the String as though it was an encoded Java String.
As has been mentioned before, these strings will have to be parsed to get the desired result.
Tokenize the string by using \u as separator. For example: \u63A5\u53D7 => { "63A5", "53D7" }
Process these strings as follows:
String hex = "63A5";
int intValue = Integer.parseInt(hex, 16);
System.out.println((char)intValue);
You're probably going to have to write a parse for these, unless you can find one in a third party library. There is nothing in the JDK to parse these for you, I know because I fairly recently had an idea to use these kind of escapes as a way to smuggle unicode through a Latin-1-only database. (I ended up doing something else btw)
I will tell you that java.util.Properties escapes and unescapes Unicode characters in this manner when reading and writing files (since the files have to be ASCII). The methods it uses for this are private, so you can't call them, but you could use the JDK source code to inspire your solution.
Could replace the above with this:
System.out.println((char)0x63A5);
Here is the code to print all of the box building unicode characters.
public static void printBox()
{
for (int i=0x2500;i<=0x257F;i++)
{
System.out.printf("0x%x : %c\n",i,(char)i);
}
}