Right way to deal with Unicode BOM in a text file - java

I am reading a text file in my program which contains some Unicode BOM character \ufeff/65279 in places. This presents several issues in further parsing.
Right now I am detecting and filtering these characters myself but would like to know if Java standard library or Guava has a way to do this more cleanly.

There is no built in way of dealing with a (UTF-8) BOM in Java or, indeed, in Guava.
There is currently a bug report on the Guava website about dealing with a BOM in Guava IO.
There are several SO posts (here and here) on how to detect/skip the BOM while reading a file in plain Java.
Your BOM (\ufeff) seems to be UTF-16 which, according to the same Guava report should be dealt with automatically by Java. This SO post seems suggest the same.

Related

How to add a UTF-8 BOM in Kotlin?

I need to generate text data files, both with UTF-8 byte-order-mark and without it. How do I do that?
So far the file has been generated like this:
File(fileName).writeText(source, Charsets.UTF_8)
But this does not provides the possibility to have the BOM on demand.
Note 1:
In this question How to add a UTF-8 BOM in java are using BufferedWriter and PrintStream.print(), but this implies to change the generation of the code to a more Java-oriented way (this is the last option).
Note 2:
In this other question Java: UTF-8 and BOM from 2012, points to a Java Bug that the BOM is not handle. In the comments suggest to never use BOM, but this is not an option in my case because the files are send to different services some of which require it and others don't. Does anybody knows more recent news about this? and if applies to Kotlin?
The BOM is a single Unicode character, U+FEFF. You can easily add it yourself, if it's required.
File(fileName).writeText("\uFEFF" + source, Charsets.UTF_8)
The harder part is that the BOM is not stripped automatically when the file is read back in. This is why people recommend not adding a BOM when it's not needed.

using java for Arabic NLP

I'm working on Arabic natural language processing such as word stemming, tokenization etc.
In order to deal with words/chars, I need to write arabic letters in java. So, my question is that is it a good practice to write arabic letters in java directly without encoding?
example:
which one is better:
if(word.startsWith("ت"){...}
or
if(word.startsWith("\u1578"){...}
You have to write Arabic letters for the sake of readability. As for the machine, there is no big difference. Also set your character coding to UTF-8 as Arabic characters have issues with ASCII coding set.
If you are familiar with Python, then NLTK module will be of great help to you.
I would go with the real characters in your master copy, ensuring your compiler is configured for the correct encoding. You can always run it through native2ascii if you need the escaped version for any reason. Once you get going you may well find you don't actually have that many hard-coded strings in the source code, as things like gazetteer lists of potential named entities etc. are better represented as external text files.
GATE has a basic named entity annotation plugin for Arabic which may be a good starting point for your work (full disclosure: I'm one if the GATE core development team).

fusion chart not supporting chinese characters

Fusion Chart not supporting chinese characters , I am using spring framework for this task, I am generating xml file dynamically using dom4j , when I use chinese character for my chart its shows some other characters,if using english means no problem will occur. kindly help me to solve this problem.
Did you specify proper encoding when reading or writing the file? The safest way to
handle chinese characters is to always provide encoding (eg. utf-8) while reading or
writing the file. Once you have a valid java string, you dont need to worry about encoding
anymore.
Anyway can you post the chinese characters you are dealing with. They might be special ones
that need special handling.

Java problems with UTF-8 in different OS

I'm programing with other people an application to college homework, and sometime we use non-english characters in comments or in Strings displayed in the views. The problem is that everyone of use is using a different OS and sometimes different IDE's to program.
Concretely, one is using MacOS, another Windows7, and another and me Ubuntu Linux. Furthermore, all of them use Eclipse and I use gedit. We have no idea if Eclipse or gedit are configurable to work propertly with UTF8 bussiness, at least I don't found nothing for mine.
The fact is that what I write with non-english characters appears in Windows & MacOS virtual machines with strange symbols and vice-versa , and sometimes, what my non-linux friends write provokes compilation warnings like this: warning: unmappable character for encoding UTF8.
Do you have any Idea to solve this? It is not really urgent but it will be a help.
Thank you.
Not sure about gedit, but you can certainly configure eclipse to use whatever encoding you like for source code. It's part of the project properties (and saved in the .settings directory within the project).
Eclipse works fine with UTF-8. See Michael's answer about configuring it. Maybe for Windows and/or MacOS it is really necessary. Ubuntu uses UTF-8 as the default encoding so I don't think it's necessary to configure Eclipse there.
As for Gedit, this picture shows that it is possible to change the encoding when saving a file in Gedit.
Anyway, you need to make sure that all of you use UTF-8 for your sources. This is the only reasonable way to achieve cross-platform portability of your sources.
You could avoid the issue in Strings by using character escape sequences, and using only ASCII encoding for the files.
For example, an en dash can be expressed as "\u2013".
You can quickly search for the Java code for individual characters here.
As Sergey notes below, this works best for small numbers of non-ASCII characters. An alternative is to put all the UTF-8 strings in resource files. Eclipse provides a handy wizard for this.
If your UTF8 file contains a BOM (byte order mark) then you will have a problem. It is a known bug , see here and here.
The BOM is optional with UTF8 and most of the time it is not there because it breaks many tools (like Javadoc, XML parser,...).
More info here.

Java File parsing toolkit design, quick file encoding sanity check

(Disclaimer: I looked at a number of posts on here before asking, I found this one particularly helpful, I was just looking for a bit of a sanity check from you folks if possible)
Hi All,
I have an internal Java product that I have built for processing data files for loading into a database (AKA an ETL tool). I have pre-rolled stages for XSLT transformation, and doing things like pattern replacing within the original file. The input files can be of any format, they may be flat data files or XML data files, you configure the stages you require for the particular datafeed being loaded.
I have up until now ignored the issue of file encoding (a mistake I know), because all was working fine (in the main). However, I am now coming up against file encoding issues, to cut a long story short, because of the nature of the way stages can be configured together, I need to detect the file encoding of the input file and create a Java Reader object with the appropriate arguments. I just wanted to do a quick sanity check with you folks before I dive into something I can't claim to fully comprehend:
Adopt a standard file encoding of UTF-16 (I'm not ruling out loading double-byte characters in the future) for all files that are output from every stage within my toolkit
Use JUniversalChardet or jchardet to sniff the input file encoding
Use the Apache Commons IO library to create a standard reader and writer for all stages (am I right in thinking this doesn't have a similar encoding-sniffing API?)
Do you see any pitfalls/have any extra wisdom to offer in my outlined approach?
Is there any way I can be confident of backwards compatibility with any data loaded using my existing approach of letting the Java runtime decide the encoding of windows-1252?
Thanks in advance,
-James
With flat character data files, any encoding detection will need to rely on statistics and heuristics (like the presence of a BOM, or character/pattern frequency) because there are byte sequences that will be legal in more than one encoding, but map to different characters.
XML encoding detection should be more straightforward, but it is certainly possible to create ambiguously encoded XML (e.g. by leaving out the encoding in the header).
It may make more sense to use encoding detection APIs to indicate the probability of error to the user rather than rely on them as decision makers.
When you transform data from bytes to chars in Java, you are transcoding from encoding X to UTF-16(BE). What gets sent to your database depends on your database, its JDBC driver and how you've configured the column. That probably involves transcoding from UTF-16 to something else. Assuming you're not altering the database, existing character data should be safe; you might run into issues if you intend parsing BLOBs. If you've already parsed files written in disparate encodings, but treated them as another encoding, the corruption has already taken place - there are no silver bullets to fix that. If you need to alter the character set of a database from "ANSI" to Unicode, that might get painful.
Adoption of Unicode wherever possible is a good idea. It may not be possible, but prefer file formats where you can make encoding unambiguous - things like XML (which makes it easy) or JSON (which mandates UTF-8).
Option 1 strikes me as breaking backwards compatibility (certainly in the long run), although the "right way" to go (the right way option generally does break backwards compatibility) with perhaps additional thoughts about if UTF-8 would be a good choice.
Sniffing the encoding strikes me as reasonable if you have a limited, known set of encodings that you tested to know that your sniffer correctly distinguishes and identifies.
Another option here is to use some form of meta-data (file naming convention if nothing else more robust is an option) that lets your code know that the data was provided according to the UTF-16 standard and behave accordingly, otherwise convert it to the UTF-16 standard before moving forward.

Categories