In XML, if one character unicode is written as \ue123 in Java
how can a string of two characters be written ?
note I tried \u123\u123 but it didn't work !
Well \u123\u123 doesn't work because \u needs to be followed by four hex digits. But this should work fine:
String text = "\u0123\u0123";
Note that this is just the Java string literal side - it has nothing to do with XML. XML has different ways of escaping the characters it needs to, but if you use an appropriate encoding (e.g. UTF-8) you shouldn't need to escape non-ASCII characters.
Related
I am trying to validate a file's content when is uploaded and I am stuck at the Unicode encoding. I am not interested to find Unicode special characters, that are not in the ASCII range. I am trying to find if the content of the file contains at least one Unicode pattern, like \u0046 for example.
For example, I exclude any file that contains the 'script' word, but what if the file contains this word written in Unicode? Sure, Java decodes it into a normal string when it reads the content, but what if I can't rely on this?
So, as far as I have searched on the Internet, I've seen Unicode characters written like \u0046, or like U+0046. Based on this, I have written the following regex:
(\\u|U\+)....
This means, \u or U+ followed by four characters. This pattern accomplishes what I desire, but I wonder if there are any other ways to write a Unicode character. It is always \u or U+? Can it be more or less than 4 characters after \u or U+?
Thanks
The notation U+Any-number-of-hex-digits belongs to Unicode will not be functional anywhere in code. In java source code and *.properties \u followed by four hex digits is a UTF-16 encoding of Unicode, automatically parsed.
The pattern to search for that:
"\\\\u[0-9A-Fa-f]{4}"
Or a String.contains on:
"\\u"
In other languages than Java \Uxxxxxx (six hex chars) is possible, for the full UTF-32 range. Unfortunately upto Java 8 not so.
I'm doing some Freebase queries. Sometimes the result of the query contains Unicode characters. How could I convert those characters into a Java String? (e.g., The_Police_$0028band$0029 → The_Police_(band)). I've tried:
new String(arg_in_byte,"UTF-8")
but it doesn't work. I saw in another question that one solution is the method replaceAll but I think that there is some other method that will be cleaner.
Those aren't UTF-8 encoded, but rather private encoding of Unicode codepoints. If your Java client library for Freebase doesn't include the necessary decoding method, you'll need to write one yourself to take the four digits after the dollar sign ($), interpret them as a hexadecimal integer and then convert that to a Java character (which also uses Unicode code points internally).
Here is some documentation on the encoding:
http://wiki.freebase.com/wiki/MQL_key_escaping
So I'm currently using the commons lang apache library.
When I tried unescaping this string: 😀
This returns the same string: 😀
String characters = "😀"
StringEscapeUtils.unescapeHtml(characters);
Output: 😀
But when I tried unescaping a String with a less few characters, it works:
String characters = "㈳"
StringEscapeUtils.unescapeHtml(characters);
Output: ㈳
Any ideas? When I tried unescaping this String "😀" on online unescaping utility, it works, so maybe it's a bug in the apache common langs library? Or can anyone recommend another library?
Thanks.
UPDATES:
I'm now able to unescape the String successfully. The problem now is when I tried to escaped the result of that unescape, it won't bring back the String (😀).
unescapeHtml() leaves 😀 untouched because – as the documentation says – it only unescapes HTML 4.0 entities, which are limited to 65,536 characters. Unfortunately, 128,512 is far beyond that limit.
Have you tried using unescapeXml()?
XML supports up to 1,114,111 (10FFFFh) character entities (link).
This is a unicode character whose index is U+1F600 (128512) - GRINNING FACE
Refer the URL for details
The String you have mentioned is HTML Escape of U+1F600, If you unescape it using Apache commons lang it will draw you the required smiley as provided in screenshot
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
Regarding your update that its not converting back to 😀
You can also represent a character using a Numeric Character Reference, of the form &#dddd;, where dddd is the decimal value representing the character's Unicode scalar value. You can alternatively use a hexadecimal representation &#xhhhh;, where hhhh is the hexadecimal value equivalent to the decimal value.
A good site for this
Have added few SoP to help you understand this unicode better.
Well - the solution is pretty easy:
use org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4 instead! (unless you're using Java <1.5, which you probably won't)
String characters = "😀";
StringEscapeUtils.unescapeHtml4(characters);
i think the problem is that there is no unicode character "😀"
so the method simply returns this string.
the doc of the function says only
Returns: a new unescaped String, null if null string input
If it's a HTML specific question, then you can just use JavaScript for this purpose.
You can do
escape("😀") which gives you %26%23128512%3B
unescape("%26%23128512%3B") which gives you back 😀
Since Java holds characters internally in UTF-16, what if you need to output in a certain encoding that includes characters that are not in unicode at all?
Java can only handle characters which are present in Unicode, basically. Text outside the BMP (i.e. above U+FFFF) is encoded as surrogate pairs (as each char is a UTF-16 code unit)... but if you want characters which aren't in Unicode at all, you're on your own - you could probably find some area of Unicode which is reserved for private use, and map the characters there... but you may well have "fun" in all kinds of odd ways.
Do you definitely need to handle characters which aren't in Unicode? I thought it covered almost everything these days...
I'm stumbling upon a few xxx_fr.properties, xxx_en.properties, etc. files and I'm a bit surprised for they contain both html entities and \uxxxx escapings.
I guess the HTML entities are fine as long as these resources are served to something awaiting HTML but what about the \uxxxx escaping?
Does Java specify that \uxxxx escaping are fine in .properties file?
Yes - see the documentation for load(Reader), which states
Characters in keys and elements can be
represented in escape sequences
similar to those used for character
and string literals.
and then clarifies that
Only a single 'u' character is allowed in a Unicode escape sequence.
Hence a Unicode escape sequence containing a single 'u' character is definitely supported.
Note that there's nothing special going on here at loading time with HTML entities - the String & for example would simply be seen within Java as a String containing 5 characters. As you point out, this might be interpreted in a special way if it were output to some other component later.
On the other hand, the escape sequence \u0061 would be seen within Java as the single-character string 'a', and would be indistinguishable from the file having contained that character instead.
The \u type escaping is a standard Java way of representing Unicode characters. You can read about it in Java Internationalization FAQ. With "How do I specify non-ASCII strings in a properties file?" question being the one you're most interested in:
http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp#properties-escape
And that's not Properties related only; you can use those in your typical Java code as well. See the Text Representation block:
http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp#core-textrep