I need to generate text data files, both with UTF-8 byte-order-mark and without it. How do I do that?
So far the file has been generated like this:
File(fileName).writeText(source, Charsets.UTF_8)
But this does not provides the possibility to have the BOM on demand.
Note 1:
In this question How to add a UTF-8 BOM in java are using BufferedWriter and PrintStream.print(), but this implies to change the generation of the code to a more Java-oriented way (this is the last option).
Note 2:
In this other question Java: UTF-8 and BOM from 2012, points to a Java Bug that the BOM is not handle. In the comments suggest to never use BOM, but this is not an option in my case because the files are send to different services some of which require it and others don't. Does anybody knows more recent news about this? and if applies to Kotlin?
The BOM is a single Unicode character, U+FEFF. You can easily add it yourself, if it's required.
File(fileName).writeText("\uFEFF" + source, Charsets.UTF_8)
The harder part is that the BOM is not stripped automatically when the file is read back in. This is why people recommend not adding a BOM when it's not needed.
Related
I am zipping a file name contains some special characters like Péréquation LES HOPITAUX NEUFS.xls to a different folder, say temp.
I am able to zip the file but the problem is the name of file is changing automatically to
P+¬r+¬quation LES HOPITAUX NEUFS.xls.
How can I support unicode characters for file names inside a zip archive?
It depends a little bit on what code you're using to create the archive. The old Java compression classes are not so flexible as you need.
You may use Apache Commons Compress. Michael Simons wrote this nice piece of code:
ZipArchiveOutputStream ostream = ...; // Your initialization code here
ostream.setEncoding("Cp437"); // This should handle your "special" characters
ostream.setFallbackToUTF8(true); // For "unknown" characters!
ostream.setUseLanguageEncodingFlag(true);
ostream.setCreateUnicodeExtraFields(
ZipArchiveOutputStream.UnicodeExtraFieldPolicy.NOT_ENCODEABLE);
If you're using Java 7 then you finally have a Charset parameter (that can be UTF-8) on the ZipOutputStream constructor
The big problem, anyway, is that many implementations don't understand Unicode encoding because original ZIP file format is ASCII and there is not an official standard for Unicode. See this post for further details.
The Zip specification (historically) does not specify what character encoding to be used for the embedded file names and comments, the original IBM PC character encoding set, commonly referred to as IBM Code Page 437, is supposed to be the only encoding supported. Jar specification meanwhile explicitly specifies to use UTF-8 as the encoding to encode and decode all file names and comments in Jar files. Our java.util.jar and java.util.zip implementation therefor strictly followed Jar specification to use UTF-8 as the sole encoding when dealing with the file names and comments stored in Jar/Zip files.
Consequence? the ZIP file created by "traditional" ZIP tool is not accessible for java.util.jar/zip based tool, and vice versa, if the file name contains characters that are not compatible between Cp437 (as an alternative, tools might simply use the default platform encoding) and UTF-8
For most European, you're "lucky":-) that you only need to avoid a "handful" of characters, such as the umlauts (OK, I'm just kidding), but for Japanese and Chinese, most of the characters are simply out of luck. This is why bug 4244499 had been the No.1 on the Top 25 Java Bugs for so many years. The bug is no longer on the list:-) it has been finally "fixed" in OpenJDK 7, b57. I still keep a snapshot as the record/kudo for myself:-)
The solution (I would use "solution" than a "fix") in JDK7 b57 is to introduce in a new set of ZipInputStream ZipOutStream and ZipFile constructors with a specific "charset" as the parameter, as showed below.
ZipFile(File, Charset)
ZipInputStream(InputStream, Charset)
ZipOutputStream(OutputStream, Charset)
With these new constructors, applications can now access those non-UTF-8 ZIP files via ZipInputStream or ZipFile objects created with the specific encoding, or create a Zip files encoded in non-UTF-8 via the new ZipOutputStream(os, charset) constructor, if necessary.
zip is a stripped-down version of the Jar tool with a "-encoding" option to support non-UTF8 encoding for entry name and comment, it can serve as a demo for how to use the new APIs (I used it as a unit test). I'm still debating with myself if it is a good idea to officially introduce "-encoding" into the Jar tool...
I am currently using Freemarker to generate a number of configuration files. So far these have been either xml files or properietary format text files. I now would like to generate some Java .properties files but have hit a couple of issues.
The first is character encoding. As far as I can see simply adding
<#ftl encoding="8859_1">
to the start of the file should sort this out.
The second issue is the escaping of the keys and values. The keys are probably ok as I would be hardcoding these in the template anyway so I can escape them in the template. The values will be coming from my data model and so will need escaping.
I can see how I can create my own user defined directive and by installing it as a shared variable use it in my template.
Is this the best or only way to do this? I would have thought generating .properties files is something that has been tackled many times before and was hoping something may already exist before I start writing my own code.
The class java.util.Properties got various store methods to save properties to OutputStreams or files. This seems more preferable than trying to adapt freemarker.
I don't get what are the charset issues that are specific to generating properties files. But note that the charset of the template and the charset of the output are independent, so you might as well use the same charset for these templates as for the others (like maybe UTF-8).
As of escaping, always use auto-escaping if you can. In 2.3.24 that will be especially sleek, but unless you are allowed to use unreleased versions, you had to wait for that until the end of February or so. (If you can use unreleased/unofficial versions, you can find out about the internal testing releases in the developer list archive.) Before 2.3.24, there's <#escape x as propEsc(x)>all the template content here</#escape>, where propEsc is a TemplateMethodModelEx (not a TemplateDirectiveModel) that you have added as shared variable or such. And so all ${...}-s will be magically escaped.
I am reading a text file in my program which contains some Unicode BOM character \ufeff/65279 in places. This presents several issues in further parsing.
Right now I am detecting and filtering these characters myself but would like to know if Java standard library or Guava has a way to do this more cleanly.
There is no built in way of dealing with a (UTF-8) BOM in Java or, indeed, in Guava.
There is currently a bug report on the Guava website about dealing with a BOM in Guava IO.
There are several SO posts (here and here) on how to detect/skip the BOM while reading a file in plain Java.
Your BOM (\ufeff) seems to be UTF-16 which, according to the same Guava report should be dealt with automatically by Java. This SO post seems suggest the same.
I'm programing with other people an application to college homework, and sometime we use non-english characters in comments or in Strings displayed in the views. The problem is that everyone of use is using a different OS and sometimes different IDE's to program.
Concretely, one is using MacOS, another Windows7, and another and me Ubuntu Linux. Furthermore, all of them use Eclipse and I use gedit. We have no idea if Eclipse or gedit are configurable to work propertly with UTF8 bussiness, at least I don't found nothing for mine.
The fact is that what I write with non-english characters appears in Windows & MacOS virtual machines with strange symbols and vice-versa , and sometimes, what my non-linux friends write provokes compilation warnings like this: warning: unmappable character for encoding UTF8.
Do you have any Idea to solve this? It is not really urgent but it will be a help.
Thank you.
Not sure about gedit, but you can certainly configure eclipse to use whatever encoding you like for source code. It's part of the project properties (and saved in the .settings directory within the project).
Eclipse works fine with UTF-8. See Michael's answer about configuring it. Maybe for Windows and/or MacOS it is really necessary. Ubuntu uses UTF-8 as the default encoding so I don't think it's necessary to configure Eclipse there.
As for Gedit, this picture shows that it is possible to change the encoding when saving a file in Gedit.
Anyway, you need to make sure that all of you use UTF-8 for your sources. This is the only reasonable way to achieve cross-platform portability of your sources.
You could avoid the issue in Strings by using character escape sequences, and using only ASCII encoding for the files.
For example, an en dash can be expressed as "\u2013".
You can quickly search for the Java code for individual characters here.
As Sergey notes below, this works best for small numbers of non-ASCII characters. An alternative is to put all the UTF-8 strings in resource files. Eclipse provides a handy wizard for this.
If your UTF8 file contains a BOM (byte order mark) then you will have a problem. It is a known bug , see here and here.
The BOM is optional with UTF8 and most of the time it is not there because it breaks many tools (like Javadoc, XML parser,...).
More info here.
(Disclaimer: I looked at a number of posts on here before asking, I found this one particularly helpful, I was just looking for a bit of a sanity check from you folks if possible)
Hi All,
I have an internal Java product that I have built for processing data files for loading into a database (AKA an ETL tool). I have pre-rolled stages for XSLT transformation, and doing things like pattern replacing within the original file. The input files can be of any format, they may be flat data files or XML data files, you configure the stages you require for the particular datafeed being loaded.
I have up until now ignored the issue of file encoding (a mistake I know), because all was working fine (in the main). However, I am now coming up against file encoding issues, to cut a long story short, because of the nature of the way stages can be configured together, I need to detect the file encoding of the input file and create a Java Reader object with the appropriate arguments. I just wanted to do a quick sanity check with you folks before I dive into something I can't claim to fully comprehend:
Adopt a standard file encoding of UTF-16 (I'm not ruling out loading double-byte characters in the future) for all files that are output from every stage within my toolkit
Use JUniversalChardet or jchardet to sniff the input file encoding
Use the Apache Commons IO library to create a standard reader and writer for all stages (am I right in thinking this doesn't have a similar encoding-sniffing API?)
Do you see any pitfalls/have any extra wisdom to offer in my outlined approach?
Is there any way I can be confident of backwards compatibility with any data loaded using my existing approach of letting the Java runtime decide the encoding of windows-1252?
Thanks in advance,
-James
With flat character data files, any encoding detection will need to rely on statistics and heuristics (like the presence of a BOM, or character/pattern frequency) because there are byte sequences that will be legal in more than one encoding, but map to different characters.
XML encoding detection should be more straightforward, but it is certainly possible to create ambiguously encoded XML (e.g. by leaving out the encoding in the header).
It may make more sense to use encoding detection APIs to indicate the probability of error to the user rather than rely on them as decision makers.
When you transform data from bytes to chars in Java, you are transcoding from encoding X to UTF-16(BE). What gets sent to your database depends on your database, its JDBC driver and how you've configured the column. That probably involves transcoding from UTF-16 to something else. Assuming you're not altering the database, existing character data should be safe; you might run into issues if you intend parsing BLOBs. If you've already parsed files written in disparate encodings, but treated them as another encoding, the corruption has already taken place - there are no silver bullets to fix that. If you need to alter the character set of a database from "ANSI" to Unicode, that might get painful.
Adoption of Unicode wherever possible is a good idea. It may not be possible, but prefer file formats where you can make encoding unambiguous - things like XML (which makes it easy) or JSON (which mandates UTF-8).
Option 1 strikes me as breaking backwards compatibility (certainly in the long run), although the "right way" to go (the right way option generally does break backwards compatibility) with perhaps additional thoughts about if UTF-8 would be a good choice.
Sniffing the encoding strikes me as reasonable if you have a limited, known set of encodings that you tested to know that your sniffer correctly distinguishes and identifies.
Another option here is to use some form of meta-data (file naming convention if nothing else more robust is an option) that lets your code know that the data was provided according to the UTF-16 standard and behave accordingly, otherwise convert it to the UTF-16 standard before moving forward.