Unicode aware CSV parser in Java

Unicode aware CSV parser in Java - java

I'm looking for Java implementation of CSV (comma separated values) parser with proper handling of Unicode data, e.g. UTF-8 CSV files with Chinese text. I suppose such a parser should internally use code point related methods while iterating, comparing etc.
Apache 2 license or similar would work the best.

I don't believe in reinventing the wheel. So I do not want to write my own parser and go through the same headaches someone else did.
I personally like the CSV Parser from Ostermiller. They also have a Maven Repository if interested.
You can also check out OpenCSV. There is a Stack Overflow question already about parsing unicode.

Have you tried Commons CSV?

It's pretty easy to write yourself. Open the file with a FileInputStream and an InputStreamReader that uses UTF-8. Wrap it in a BufferedReader you can iterate through it using readLine(). Get each line as a String. Use regular expressions to split it into fields.
The only tricky part is constructing the regexes so they don't treat commas that are enclosed within quotes as field delimiters.
The approach above is a bit inefficient, but fast enough for most apps. If you have real performance requirements then you'll need something that iterates through characters. I wrote one a few years ago that uses a state machine that worked ok.

Related

Read CSV with semicolons into a String on Java

My main problem is that I'm trying to read a CSV delimited by ; in Java and the problem comes when I try to read a field of the CSV that contains a ;. For example:
"I want you to do that;"
In this case the field is recognized like
"I want you to do that"
And it creates another field that is just an empty string.
I use a BufferedReader to read the CSV and the split method to separate it with the ;. I'm not allowed to use libraries like OpenCSV so I want to find a solution with the method I'm using.

Parse according to the quotation marks
If the data incidentally containing the delimiter is wrapped in double quotes (QUOTATION MARK), then you should have no problem with parsing. Your parsing should look first for pairs of double quote characters. After that, look for delimiters outside of those pairs.
Rather than writing the parsing code yourself, I highly recommend using a CSV library. In the Java ecosystem, you have a wealth of good products to choose. For example, I have made successful use of Apache Commons CSV.
See also the specification for CSV: RFC 4180.

How would you go about separating strings in a file if they may contain any arbitary characters?

I'm currently writing a Java program which involves goals. It's basically a to-do list. Each goal has a few strings, such as name, description etc. I can save and load these goals to a file. My issue was separating the strings - I couldn't think of a character that couldn't be in the string itself. I ended up prefixing each string with it's length and then a colon.
I'm sure there is something in the Java API that will handle this, like ObjectOutputStream. I'm curious about the 'general case', though. This must be an issue for any program that saves and loads strings from a file without being able to assume anything about the string. Is there a better way to go about this?

There are couple of ways to handle your case, e.g:
Encoding your String with something like base64
Applying a well defined format, e.g. JSON or CSV
There are tons of tools support you including:
Apache Commons codec for base64 encoding/decoding
Jaskson for JSON serializing/deserializing
opencsv for csv serializing/deserializing

Replacing Java unicode encodings with actual characters

When I make web queries, for accented characters, I get special character encodings back as strings such as "\u00f3" , but I need to replace it with the actual character, like "ó" before making another query.
How would I find these cases without actually looking for each one, one by one?

It seems you're handling JSON formatted data.
Use any of the many freely available JSON libraries to handle this (and other parsing issues) for you instead of trying to do it manually.
The one from JSON.org is pretty widely used, but there are surely others that work just as well.

If data contains commas, how to store it in csv?

If the task is to create a csv file out of some data where commas may be present, is there a way to do it without later confusing which comma is a delimiter and which comma is part of a value?
Obviously, we can use a different delimiter, replace all occurrences, or replace the original comma with something else, but for the purpose of this question let's say that modifying the original data is not an option and a comma is the only delimiter allowed.
How would you approach something like this? Would it be easier to create the xls instead? Can you recommend any java libraries that handle this well?

A true CSV reader should be able to handle this; the values should be in quotes, e.g.:
one,two,"a, b, c",four
...per item #6 in Section 2 of the RFC.

While there's no single CSV standard, the usual convention is to surround entries containing commas in double quotes (i.e. ").
Prempting the next question: What to do if your data contains a double quote? In this case they are usually substituted for a pair of double quotes.
While I hate to cite wikipedia as a source, they do have a pretty good roundup of basic rules and examples for CSV formatting.

I would either use a different delimiter or use a library like Apache POI.

I think the best way is to use Apache POI: http://poi.apache.org/
You can easily create XLS documents without much hassle.
However, if you really need CSV and not XLS, you can surround the value with quotes. This should also solve the problem.

Usually, you work with , as separator and ' as quote. So your values would look like:
foo, 'bar, baz', iik, aje

the task is to create a csv file
Actually an impossible task, since there is no such thing as "a CSV" file. Different Microsoft produces have used different (subtly different, I grant) formats and named them all "CSV". As most spreadsheets can read delimiter separated value (DSV) files, you might be better writing one of those.

Java library to escape/clean XML?

I get some malformed xml text input like:
"<Tag>something</Tag> 8 > 3, 2 < 3, ... <Tag>something</Tag>"
I want to clean the input so to get:
"<Tag>something</Tag> 8 > 3, 2 < 3, ... <Tag>something</Tag>"
That is, escape those special symbols like <,> and yet keep the valid tags ("<Tag>something</Tag>, note, with the same case)
Do you know of any java library to do this? Probably a xml/html parser? (though I don't really need a parser, simple a "clean" procedure)

JTidy is "HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML"
But it can also be used with xml. Check the documentation. It's incredible smart, it will probably work for you.

I don't know of any library that would do that. Your input is malformed XML, and no proper XML parser would accept it. More important, it is not always possible to distinguish an actual tag from something that looks-like-a-tag-but-is-really-text. Therefore any heuristic-based attempt that you make to solve the problem will be fragile; i.e. it could occasionally produce malformed XML.
The best approach is address the problem before you assemble the XML.
If you generate the XML by (for example) unparsing a DOM, then the unparser will take care of the escaping for you.
If you are generating the XML by templating or string bashing, then you need to call something like StringEscapeUtils.escapeXml on the relevant text chunks ... before the XML tags get incorporated.
If you leave the problem until after the "XML" has been assembled, it cannot be properly fixed.

The best solution is to fix the program generating your text input. The easiest such fix would involve an escape utility like the other answers suggested. If that's not an option, I'd use a regular expression like
</?[a-zA-Z]+ */?>
to match the expected tags, and then split the string up into tags (which you want to pass through unchanged) and text between tags (against which you want to apply an escape method.)
I wouldn't count on an XML parser to be able to do it for you because what you're dealing with isn't valid XML. It is possible for the existing lack of escaping to produce ambiguities, so you might not be able to do a perfect job either.

Check out Guava's XmlEscaper. It is in pre-release for version 11 but the code is available.

Apache Commons Lang contains a class named StringEscapeUtils which does exactly what you want! The method you'd want to use is escapeXml, I presume.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Unicode aware CSV parser in Java - java

Have you tried Commons CSV?

Related

Read CSV with semicolons into a String on Java

How would you go about separating strings in a file if they may contain any arbitary characters?

Replacing Java unicode encodings with actual characters

If data contains commas, how to store it in csv?

Java library to escape/clean XML?

Categories

Resources