I'm stumbling upon a few xxx_fr.properties, xxx_en.properties, etc. files and I'm a bit surprised for they contain both html entities and \uxxxx escapings.
I guess the HTML entities are fine as long as these resources are served to something awaiting HTML but what about the \uxxxx escaping?
Does Java specify that \uxxxx escaping are fine in .properties file?
Yes - see the documentation for load(Reader), which states
Characters in keys and elements can be
represented in escape sequences
similar to those used for character
and string literals.
and then clarifies that
Only a single 'u' character is allowed in a Unicode escape sequence.
Hence a Unicode escape sequence containing a single 'u' character is definitely supported.
Note that there's nothing special going on here at loading time with HTML entities - the String & for example would simply be seen within Java as a String containing 5 characters. As you point out, this might be interpreted in a special way if it were output to some other component later.
On the other hand, the escape sequence \u0061 would be seen within Java as the single-character string 'a', and would be indistinguishable from the file having contained that character instead.
The \u type escaping is a standard Java way of representing Unicode characters. You can read about it in Java Internationalization FAQ. With "How do I specify non-ASCII strings in a properties file?" question being the one you're most interested in:
http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp#properties-escape
And that's not Properties related only; you can use those in your typical Java code as well. See the Text Representation block:
http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp#core-textrep
Related
I am trying to validate a file's content when is uploaded and I am stuck at the Unicode encoding. I am not interested to find Unicode special characters, that are not in the ASCII range. I am trying to find if the content of the file contains at least one Unicode pattern, like \u0046 for example.
For example, I exclude any file that contains the 'script' word, but what if the file contains this word written in Unicode? Sure, Java decodes it into a normal string when it reads the content, but what if I can't rely on this?
So, as far as I have searched on the Internet, I've seen Unicode characters written like \u0046, or like U+0046. Based on this, I have written the following regex:
(\\u|U\+)....
This means, \u or U+ followed by four characters. This pattern accomplishes what I desire, but I wonder if there are any other ways to write a Unicode character. It is always \u or U+? Can it be more or less than 4 characters after \u or U+?
Thanks
The notation U+Any-number-of-hex-digits belongs to Unicode will not be functional anywhere in code. In java source code and *.properties \u followed by four hex digits is a UTF-16 encoding of Unicode, automatically parsed.
The pattern to search for that:
"\\\\u[0-9A-Fa-f]{4}"
Or a String.contains on:
"\\u"
In other languages than Java \Uxxxxxx (six hex chars) is possible, for the full UTF-32 range. Unfortunately upto Java 8 not so.
I have a property file which may/ may not contain unicode escaped characters in the values of its keys. Please see the sample below. My job is to ensure that if a value in the property file contains a non-ascii character, then it should be unicode escaped. So, in the sample below, first entry is OK, all entries like the second entry should be removed and converted to like the first entry.
##sample.properties
escaped=cari\u00F1o
nonescaped=cariño
normal=darling
Essentially my question is how can I differentiate in Java between cari\u00F1o and cariño since as far as Java is concerned it treats them as identical.
Properties files in Java must be saved in the ISO-8859-1 character set for Java to read them properly. That means that it is possible to use special characters from Western European languages without escaping them. It is not possible to use characters from other languages such as those from Easter Europe, Russia, or China without escaping them.
As such there are only a few non-ascii characters that can appear in a properties file without being escaped.
To detect whether characters have been escaped or not, you will need to open the properties file directly, rather than through the Properties class. The Properties class does all the unescaping for you when you load a file through it. You should open them using the File class or though System.getResourceAsStream as an InputStream. Once you do so you can scan through the input stream one byte at a time and ensure that all bytes are in the 0x20-0x7E range plus new lines \r and \n which is the ASCII range of characters you would expect in a properties file.
I would suggest that your translators don't try to write properties files directly. They should provide you with documents like spreadsheets that you convert into properties file. Or they could use a translation editor such as Attesoro (which I wrote) to let them save the properties files properly escaped.
You could simply use the native2ascii tool, which performs exactly this conversion (it will convert all non-ASCII characters to escapes but leave existing escapes intact).
Your problem is that the Java Properties class decodes the properties files, assuming ISO-8859-1 encoding, and parsing escaped unicode characters.
So from a Properties point of view, these two strings are indeed the same.
I believe if you need to differentiate these two, you will need to write your own parser.
It's actually a feauture that you do not need to care by default. The one thing that strikes me as the most odd is that the (only) encoding is ISO-8859-1, probably for historical reasons.
The library ICU4J seems to be what you're looking for. See the Normalization page.
In XML, if one character unicode is written as \ue123 in Java
how can a string of two characters be written ?
note I tried \u123\u123 but it didn't work !
Well \u123\u123 doesn't work because \u needs to be followed by four hex digits. But this should work fine:
String text = "\u0123\u0123";
Note that this is just the Java string literal side - it has nothing to do with XML. XML has different ways of escaping the characters it needs to, but if you use an appropriate encoding (e.g. UTF-8) you shouldn't need to escape non-ASCII characters.
I have a java JSP/servlet application in a Tomcat, and fronted by an Apache.
The server side checks to make sure only letters in ranges [A..Z][a..z], digits, and punctuation symbols are accepted.
However, when a, for example, chinese character is entered, the value in the server-side looks something like 'ᝈ".
Hence, as far as the server-side is concerned, these are valid punctuation symbols and digits.
Any pointers that can help? Driving me insane after a 10 coding marathon.
You can use Apache Commons StringEscapeUtils.unescapeHTML() in Java.
unescapeHtml(String str)
does the following:
Unescapes a string containing entity escapes to a string containing
the actual Unicode characters corresponding to the escapes.
You need to process the text using a unicode encoding like UTF-8.
First make sure your server is handling requests with UTF-8 encoding. Where you set or configure that will depend on how you're implementing your JSPs/Servlets, but see: http://docs.oracle.com/javaee/6/api/javax/servlet/ServletRequest.html#setCharacterEncoding(java.lang.String)
I read texts from different sources which can have characters from different languages/extended characters like € ƒ „ … † ® ©. And then I am supposed to write to an XML file, I am using PrinterWriter in java to write to an XML file whatever string I read. So for these types of extended characters which has ascii greater than 127 gives illegal characters error in XML file, so how can I encode it properly while writing to XML.
First, there's no such thing as an ASCII code above 127. ASCII only defines values up to 127. "Extended ASCII" is an ambiguous term, as it's used to describe many different encodings.
Now, as for XML: use whichever XML API you want to write the string, without worrying about the contents (so long as they are representable in XML; various control characters in the range U+0000 to U+001F aren't representable, unfortunately). Don't try to create the XML from scratch yourself - that's what XML APIs are for. Make sure that your XML document uses an encoding that will cope with the characters you need (UTF-8 is normally a good choice, and is often the default), make sure that your Java strings have the right Unicode data in them, and you should be fine.
EDIT: I hadn't actually spotted this bit before:
I am using PrinterWriter in java to write to an XML
Don't. Please use an XML API. There are plenty around, and you'll have a lot less to worry about. I'd also not recommend using PrintWriter anyway for the most part - suppressing exceptions isn't really a good idea in most cases.
Use the &#value; syntax. Space would be