Convert Freebase Unicode codepoints to Java String - java

I'm doing some Freebase queries. Sometimes the result of the query contains Unicode characters. How could I convert those characters into a Java String? (e.g., The_Police_$0028band$0029 → The_Police_(band)). I've tried:
new String(arg_in_byte,"UTF-8")
but it doesn't work. I saw in another question that one solution is the method replaceAll but I think that there is some other method that will be cleaner.

Those aren't UTF-8 encoded, but rather private encoding of Unicode codepoints. If your Java client library for Freebase doesn't include the necessary decoding method, you'll need to write one yourself to take the four digits after the dollar sign ($), interpret them as a hexadecimal integer and then convert that to a Java character (which also uses Unicode code points internally).
Here is some documentation on the encoding:
http://wiki.freebase.com/wiki/MQL_key_escaping

Related

How do I maintain the backslash when converting to json String using Json Format of Protobuf?

I have to use gRPC.
I was converting the object I received into json string, and the following problem occurred
example proto
hash: "v\016\177\350\207y\225wM\335]1(Z\266\305\376\027\310_v\321\016Q\v\332\030\303^\032|\375"
but, However, if I convert using Protobuf's util JsonFormat, I get the following results
"hash": "dg5/6Id5lXdN3V0xKFq2xf4XyF920Q5RC9oYw14afP0="
I want to get this back to its original form, is there a way to write another library or decode it in reverse?
Forget about the format, basically; these are just two ways of representing the same data. The second version is base-64, and decodes to the bytes:
76-0E-7F-E8-87-79-95-77-4D-DD-5D-31-28-5A-B6-C5-FE-17-C8-5F-76-D1-0E-51-0B-DA-18-C3-5E-1A-7C-FD
The first version is C-literal style with octal escapes; v is ASCII 118, aka hex 0x76; \016 is escaped octal for decimal 14, aka hex 0x0E; \177 is escaped octal for decimal 127, aka hex 0x7F - and so on. Most languages have a base-64 encode/decode; the C-literal style with octal escape sequences is ... more niche, and you might need to write your own decoder for that. Depending on where the first string came from, it is worth noting that protobuf (at least the schema variant) also allows fixed-width unicode escapes, via \uNNNN and \UNNNNNNNN, IIRC. And note: the octal in .proto schemas can short-circuit: \12n means the same as \012n - at most 3 digits are taken, but if a non-digit character is encountered, it is still valid as a shorter form.

How do I store accented characters in S3 metadata?

I am trying to store accented characters such as ò in the metadata of an S3 object. I am using the REST API which according to this page only accepts US-ASCII: http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html
Is there a way to convert Strings in Scala or Java from Bòrd to B\u00F2rd?
I have tried using Normalizer.normalize(str, Normalizer.Form.NFD) but the character when submitted to S3 is still causing an error because it appears as ò. When I try to print out the returned String it is also showing ò.
A normalized unicode string is just normalized in terms of composing characters, not necessarily to ASCII. Using NFKC would be more likely to convert characters to ASCII forms, but certainly would not reliably to do so.
It sounds like what you want is to escape non-ascii characters. You could use e.g. UnicodeEscaper from commons-lang, and UnicodeUnescaper to translate back.

Unable to convert Hyphen to UTF-8

I'm reading some text that I got from Wikipedia.
The text contains hyphen like in this String: "Australia for the [[2011–12 NBL season]]"
I'm trying to do is to convert the text to utf-8, using this code:
String myStr = "Australia for the [[2011–12 NBL season]]";
new String(myStr.getBytes(), "utf-8");
The result is:
Australia for the [[2011�12 NBL season]]
The problem is that the hyphen is not being mapped correctly.
The hyphen value in bytes is [-106] (I have no idea what to do with it...)
Do you know how to convert it to a hyphen that utf-8 encoding recognizes?
I would be happy to replace other special characters as well by some general code, but also specific "hyphens" replacement code will help.
The problem code point is U+2013 EN DASH which can be represented with the escape \u2013.
Try replacing the string with "2011\u201312". If this works then there is a mismatch between your editor character encoding and the one the compiler is using.
Otherwise, the problem is with the transcoding operation from string to whatever device you are writing to. Anywhere where you convert from bytes to chars or chars to bytes is a potential point of corruption when the wrong encoding is used; this can include System.out.
Note: Java strings are always UTF-16.
new String(myStr.getBytes(), "utf-8");
This code takes UTF-16, converts it to the platform encoding, which might be anything, then pretends its UTF-8 and converts it back to UTF-16. At best, the platform encoding is UTF-8 and this is a no-op; otherwise it will just corrupt the data.
This is how you create UTF-8 in Java:
byte[] utf8 = str.getBytes(StandardCharsets.UTF_8); // Java 7
You can read more here.
This is because the source code (editor) is maybe in Windows-1252 (extended Latin-1), and it is compiled with another encoding UTF-8 (compiler). These two encodings must be the same, or use in the source: "\u00AD", the ASCII representation of the hyphen.

escaped html won't unescaped (now: unescaped html won't escape back)

So I'm currently using the commons lang apache library.
When I tried unescaping this string: 😀
This returns the same string: 😀
String characters = "😀"
StringEscapeUtils.unescapeHtml(characters);
Output: 😀
But when I tried unescaping a String with a less few characters, it works:
String characters = "㈳"
StringEscapeUtils.unescapeHtml(characters);
Output: ㈳
Any ideas? When I tried unescaping this String "😀" on online unescaping utility, it works, so maybe it's a bug in the apache common langs library? Or can anyone recommend another library?
Thanks.
UPDATES:
I'm now able to unescape the String successfully. The problem now is when I tried to escaped the result of that unescape, it won't bring back the String (😀).
unescapeHtml() leaves 😀 untouched because – as the documentation says – it only unescapes HTML 4.0 entities, which are limited to 65,536 characters. Unfortunately, 128,512 is far beyond that limit.
Have you tried using unescapeXml()?
XML supports up to 1,114,111 (10FFFFh) character entities (link).
This is a unicode character whose index is U+1F600 (128512) - GRINNING FACE
Refer the URL for details
The String you have mentioned is HTML Escape of U+1F600, If you unescape it using Apache commons lang it will draw you the required smiley as provided in screenshot
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
Regarding your update that its not converting back to 😀
You can also represent a character using a Numeric Character Reference, of the form &#dddd;, where dddd is the decimal value representing the character's Unicode scalar value. You can alternatively use a hexadecimal representation &#xhhhh;, where hhhh is the hexadecimal value equivalent to the decimal value.
A good site for this
Have added few SoP to help you understand this unicode better.
Well - the solution is pretty easy:
use org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4 instead! (unless you're using Java <1.5, which you probably won't)
String characters = "😀";
StringEscapeUtils.unescapeHtml4(characters);
i think the problem is that there is no unicode character "😀"
so the method simply returns this string.
the doc of the function says only
Returns: a new unescaped String, null if null string input
If it's a HTML specific question, then you can just use JavaScript for this purpose.
You can do
escape("😀") which gives you %26%23128512%3B
unescape("%26%23128512%3B") which gives you back 😀

XML, java , unicode

In XML, if one character unicode is written as \ue123 in Java
how can a string of two characters be written ?
note I tried \u123\u123 but it didn't work !
Well \u123\u123 doesn't work because \u needs to be followed by four hex digits. But this should work fine:
String text = "\u0123\u0123";
Note that this is just the Java string literal side - it has nothing to do with XML. XML has different ways of escaping the characters it needs to, but if you use an appropriate encoding (e.g. UTF-8) you shouldn't need to escape non-ASCII characters.

Categories