I have a property file which may/ may not contain unicode escaped characters in the values of its keys. Please see the sample below. My job is to ensure that if a value in the property file contains a non-ascii character, then it should be unicode escaped. So, in the sample below, first entry is OK, all entries like the second entry should be removed and converted to like the first entry.
##sample.properties
escaped=cari\u00F1o
nonescaped=cariño
normal=darling
Essentially my question is how can I differentiate in Java between cari\u00F1o and cariño since as far as Java is concerned it treats them as identical.
Properties files in Java must be saved in the ISO-8859-1 character set for Java to read them properly. That means that it is possible to use special characters from Western European languages without escaping them. It is not possible to use characters from other languages such as those from Easter Europe, Russia, or China without escaping them.
As such there are only a few non-ascii characters that can appear in a properties file without being escaped.
To detect whether characters have been escaped or not, you will need to open the properties file directly, rather than through the Properties class. The Properties class does all the unescaping for you when you load a file through it. You should open them using the File class or though System.getResourceAsStream as an InputStream. Once you do so you can scan through the input stream one byte at a time and ensure that all bytes are in the 0x20-0x7E range plus new lines \r and \n which is the ASCII range of characters you would expect in a properties file.
I would suggest that your translators don't try to write properties files directly. They should provide you with documents like spreadsheets that you convert into properties file. Or they could use a translation editor such as Attesoro (which I wrote) to let them save the properties files properly escaped.
You could simply use the native2ascii tool, which performs exactly this conversion (it will convert all non-ASCII characters to escapes but leave existing escapes intact).
Your problem is that the Java Properties class decodes the properties files, assuming ISO-8859-1 encoding, and parsing escaped unicode characters.
So from a Properties point of view, these two strings are indeed the same.
I believe if you need to differentiate these two, you will need to write your own parser.
It's actually a feauture that you do not need to care by default. The one thing that strikes me as the most odd is that the (only) encoding is ISO-8859-1, probably for historical reasons.
The library ICU4J seems to be what you're looking for. See the Normalization page.
Related
I'm working on a Java plugin which would allow people to write to and read from a file by specifying a charset encoding they would wish to use. However, I was confused as to how I would encode multiple encodings in a single file. For example, suppose that A characters come from one charset and B characters come from another, would it be possible to write "AAAAABBBBBAAAAA" to a file?
If it is not possible, is this generally true for any programming language, or specifically for Java? And if it is possible, how would I then proceed to read (decode) the file?
I do not want to use the encode() and decode() methods of Charset since tests with them have failed (some charsets were not decoded properly). I also don't want to use third-party programs for various reasons, so the scope of this question is purely in the standard java packages/code.
Thanks a lot!
N.S.
You'd need to read it as a byte stream and know beforehand at which byte positions the characters start and end, or to use some special separator character/byterange which indicates the start and end of the character group. This way you can get the bytes of the specific character group and finally decode it using the desired character encoding.
This problem is not specific to Java. The requirement is just strange. I wonder how it makes sense to mix character encodings like that. Just use one uniform encoding all the time, for example UTF-8 which supports practically all characters the mankind is aware of.
Ofcourse it is in principle possible to write text that is encoded in different character sets into one file, but why would you ever want to do this?
A character encoding is simply a mapping from text characters to bytes and vice versa. A file consists of bytes. When writing a file, the character encoding determines how the characters are converted to bytes, and when reading, it determines how the bytes are converted back to characters.
You could have one part of the file encoded with one character encoding, and another part with another character encoding. You'd have to have some mechanism to keep track of what parts are encoded with what encoding, because the file doesn't automatically keep track of that for you.
I was wondering about this as well, because my client just asked a similar question. Like BalusC mentioned this is not a java specific problem.
After a few back and forth, I found the real question might be 'multiple encoding of information', instead multiple encoding file.
i.e. we have a xml string text needs to be encoded with 8859-1, if we save it as a file, then we need encode it. The default encoding for xml is UTF-8, we might not necessary to encode the whole xml as 8859-1. Since the xml node is just a vehicle of passing information over to other system and the content (value of the xml node, which needs to be persisted with 8859-1). So do we need multiple encoding in this case? probably not. We can still encode the xml with UTF-8, then pass it over. once the client receives the xml, then they need read the information out of the UTF-8 encoded file, and persist value of the xml node as 8859-1.
I'm stumbling upon a few xxx_fr.properties, xxx_en.properties, etc. files and I'm a bit surprised for they contain both html entities and \uxxxx escapings.
I guess the HTML entities are fine as long as these resources are served to something awaiting HTML but what about the \uxxxx escaping?
Does Java specify that \uxxxx escaping are fine in .properties file?
Yes - see the documentation for load(Reader), which states
Characters in keys and elements can be
represented in escape sequences
similar to those used for character
and string literals.
and then clarifies that
Only a single 'u' character is allowed in a Unicode escape sequence.
Hence a Unicode escape sequence containing a single 'u' character is definitely supported.
Note that there's nothing special going on here at loading time with HTML entities - the String & for example would simply be seen within Java as a String containing 5 characters. As you point out, this might be interpreted in a special way if it were output to some other component later.
On the other hand, the escape sequence \u0061 would be seen within Java as the single-character string 'a', and would be indistinguishable from the file having contained that character instead.
The \u type escaping is a standard Java way of representing Unicode characters. You can read about it in Java Internationalization FAQ. With "How do I specify non-ASCII strings in a properties file?" question being the one you're most interested in:
http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp#properties-escape
And that's not Properties related only; you can use those in your typical Java code as well. See the Text Representation block:
http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp#core-textrep
when dealing with non-english filename.
The problem is that my program cannot gurantee those directories and filenames are in English, if some filenames using japanese, chinese character it will display some character like '?'.
anybody can suggest me wat i need to do to access non english file name
The problem is that my program cannot guarantee those directories and filenames are in English. If a filename use japanese, chinese characters it will display some character like '?'.
The problem is apparently that "it" is using the wrong character set to display the filenames. The solution depends on whether "it" is your program (via a GUI), some other application, the command shell / terminal emulator, or the user's web browser. If you could provide more information, maybe I could offer some suggestions.
But turning the characters into underscores is most likely a bad solution. It is liable to lead to filename clashes, and those Chinese / Japanese / etc characters are most likely meaningful to the people who created the files.
By the way, the correct term for "english" letters is Latin.
EDIT
For your use-case, you don't to store the PDF file using a filename that bears any relation to the supplied filename. I suggest that you try to solve the problem by using a filename consisting of Latin numbers and letters generated from (say) currentTimeInMillis(). If that fails, then your real problem has nothing to do with the filenames at all.
EDIT 2
You ask about the statement
if (fileName.startsWith("=?iso-8859"))
This seems to be trying to unpick a filename in MIME encoded-word format; see RFC 2047 Section 2
Firstly, I think that code may be unnecessary. The javadoc is not specific, but I think that the Part.getFilename() method should deal with decoding of the filename.
Second, if the decoding is necessary, then you are going about it the wrong way. The stuff after the charset cannot simply be treated as the value of the filename. Look at the RFC.
Third, if you need to you should use the relevant MimeUtility methods to decode "word" tokens ... like the filename.
Fourthly, ISO-8859-1 is NOT a suitable encoding for characters in non-Latin character sets.
Finally, examine the raw email headers of the emails that you are trying to decode and look for the header line that starts
Content-Disposition: attachment; filename=...
If the filename looks like "=?iso-8859-1?...", and the filename is supposed to contain japanese / chinese / etc characters, then the problem is in the client (or whatever) that constructed the email. The character set needs to be "utf-8" or one of the other multibyte character sets.
Java uses Unicode natively - you don't need to replace special characters, as Unicode has no special characters - every code point is treated equally. Your replaceSpChars() may be the culprit here.
We are using Java and Oracle for development.
I have table in a oracle database which has a CLOB column in it. Some XYZ application dumps a text file in this column. The text file has multiple rows.
Is it possible that while reading the same CLOB file thru Java application, the escape sequences (new line chars, etc) may get lost??
Reason I asked this is, we gona parse this file line by line and if the escape sequences are lost, then we would be trouble. I would have done this analysis myself, but I am on vacation and my team needs urgent help.
Would really appreciate if you could provide any thoughts/inputs.
You need to ensure that you use the one correct and same character encoding throughout the whole process. I strongly recommend you to pickup UTF-8 for that. It covers every human character known at the world. Every step which involves handling of character data should be instructed to use the very same encoding.
In SQL context, ensure that the DB and table is created with UTF-8 charset. In JDBC context, ensure that JDBC driver is using UTF-8; this is often configureable by JDBC connection string. In Java code context, ensure that you're using UTF-8 when reading/writing character data from/to streams; you can specify it as 2nd constructor argument in InputStreamReader and OutputStreamWriter.
A CLOB stores character data. Carriage returns and line feeds are valid characters, though unprintable ones. As long as your XYZ app is correctly filling your CLOBs, the contents should be just as manageable to you as if they had come from the file.
Depending on the platform and the nature of said "XYZ app," lines could be separated by either \r(Mac), \r\n (DOS/Windows) or \n (Unix/Linux), and you should make allowance for this fact if necessary. This is one aspect where BufferedReader.readLine() is more convenient, as it transparently gets rid of this difference for you.
I'm not 100% sure what you mean by escape sequences in this context. Within a (for example) Java literal string, "\n" is an escape sequence representing a newline, but once that string is outputted into something (say, a database), it's not an escape sequence any more, it's an actual newline character.
Anyhow, to your direct question, Java through can read text from Oracle CLOBs perfectly fine. Newlines are not lost.
I read texts from different sources which can have characters from different languages/extended characters like € ƒ „ … † ® ©. And then I am supposed to write to an XML file, I am using PrinterWriter in java to write to an XML file whatever string I read. So for these types of extended characters which has ascii greater than 127 gives illegal characters error in XML file, so how can I encode it properly while writing to XML.
First, there's no such thing as an ASCII code above 127. ASCII only defines values up to 127. "Extended ASCII" is an ambiguous term, as it's used to describe many different encodings.
Now, as for XML: use whichever XML API you want to write the string, without worrying about the contents (so long as they are representable in XML; various control characters in the range U+0000 to U+001F aren't representable, unfortunately). Don't try to create the XML from scratch yourself - that's what XML APIs are for. Make sure that your XML document uses an encoding that will cope with the characters you need (UTF-8 is normally a good choice, and is often the default), make sure that your Java strings have the right Unicode data in them, and you should be fine.
EDIT: I hadn't actually spotted this bit before:
I am using PrinterWriter in java to write to an XML
Don't. Please use an XML API. There are plenty around, and you'll have a lot less to worry about. I'd also not recommend using PrintWriter anyway for the most part - suppressing exceptions isn't really a good idea in most cases.
Use the &#value; syntax. Space would be