How to properly parse CSV file to 2d Array? - java

I'm trying to parse a csv file into a 2d array, where each row is a data entry and each column is a field in that entry.
Doing this all at once simplifies and separates my processing code from my parsing code.
I tried to write a simple parser that used String.Split to separate file by commas. This is a horrible approach as I have discovered. It completely fails to parse any special cases like double quotes, line feeds, and other special chars.
What is the proper way to parse a CSV file into a 2d array as I have described?
Code samples in Java would be appreciated.
The array can be a dynamic list object or vector or something like that, it just has to be indexable with two indexers.

Have a look at Commons CSV?
CSVParser parser = new CSVParser(new FileReader(file));
String[] line;
while ((line = parser.getLine()) != null) {
// process
}

If your file has fields with double quoted entries that contain separators and fields with line feeds, than I doubt that it is a real csv file... a proper csv file is something like this
1;John;Doe;engineer,manager
2;Bart;Foo;engineer,dilbert
while this is "something else":
1;John;Doe;"engineer;manager"
2;Bart;Foo;
"engineer,dilbert"
And the first example is parseable with String.split on each line.

Related

Read CSV with semicolons into a String on Java

My main problem is that I'm trying to read a CSV delimited by ; in Java and the problem comes when I try to read a field of the CSV that contains a ;. For example:
"I want you to do that;"
In this case the field is recognized like
"I want you to do that"
And it creates another field that is just an empty string.
I use a BufferedReader to read the CSV and the split method to separate it with the ;. I'm not allowed to use libraries like OpenCSV so I want to find a solution with the method I'm using.
Parse according to the quotation marks
If the data incidentally containing the delimiter is wrapped in double quotes (QUOTATION MARK), then you should have no problem with parsing. Your parsing should look first for pairs of double quote characters. After that, look for delimiters outside of those pairs.
Rather than writing the parsing code yourself, I highly recommend using a CSV library. In the Java ecosystem, you have a wealth of good products to choose. For example, I have made successful use of Apache Commons CSV.
See also the specification for CSV: RFC 4180.

Java - CSVReader split correctly with comma inside the values

I am reading a csv file in java using CSVReader. It is "working" but I've found a minor problem when I try to split the line into an array of strings like this:
reader = new CSVReader(new FileReader(csvFile));
line = reader.readNext()
String[] lineDetail = line[0].split(";", -1);
Here is my problem: The line below work correctly:
[ABEL MESQUITA JR.;178957;1;2015;RR;DEM;55;1;MANUTENÇÃO DE ESCRITÓRIO DE APOIO À ATIVIDADE PARLAMENTAR;0;;WM PAPELARIA E ESCRITÓRIO;12132854000100;3592;0;2017-04-26 00:00:00;296;0;296;4;2017;0;;;1377952;5828;0;3074;6266962]
But the line below when I try to read using CSVReader, results in 3 arrays of strings:
[ABEL MESQUITA JR.;178957;1;2015;RR;DEM;55;3;COMBUSTÍVEIS E LUBRIFICANTES.;1;Veículos Automotores;B.B. PETROLEO LTDA;03625917000170;4339;0;2017-01-31 00:00:00;4007, 06;0;4007, 06;1;2017;0;;;1354058;5711;0;3074;6196889]
The arrays look like this:
ABEL MESQUITA JR.;178957;1;2015;RR;DEM;55;3;COMBUSTÍVEIS E LUBRIFICANTES.;1;Veículos Automotores;B.B. PETROLEO LTDA;03625917000170;4339;0;2017-01-31 00:00:00;4007
06;0;4007
06;1;2017;0;;;1354058;5711;0;3074;6196889
I think the problem is because of this value: 4007, 06 since in the first line the value is an integer 296.
Does anyone know how to make the CSVReader returns only one array, instead of 3?
Thanks in advance!!
EDIT 1
The result that I need is the second and third array concatenated with the first. So I would have the 4007,06 together instead of separated.
Your line of CSV data is getting split on comma, which is default CSV field separator.
To avoid splitting on commas, before reading your CSV file initialize your CSV reader to use some unused character (for example Tab) as CSV field separator. This way you will get one field per row. But your expectation seems to be pointless, because normally there should be easier way:
Why you don't configure CSV field separator to ; so it will split your CSV directly into semicolon-separated fields and you won't need to do additional splitting? This will also solve your problem with commas.
Your line[0].split(";", -1) is basically a bug, because it won't be able to split this valid CSV into 2 values:
Value 1; "Value;2"
Using your approach you will get 3 values instead.
To get further advice on CSVReader, please add the information which one are you using (by denoting its package).

Skip empty lines while CSV parsing

I'm currently working with pulling a CSV file from a URL and modifying it's entries. I'm currently using a StreamReader to read each line of the CSV and split it into an array, where I can modify each entry based on its position.
The CSV is generated from an e-form provider where a particular form entry is a Multi-Line field, where a user can add multiple notes. However, when a user enters a new note, they are separating each note by a line return.
CSV Example:
"FName","LName","Email","Note 1: some text
Note 2: some text"
Since my code is splitting each CSV entry by line, once it reaches these notes, it believes it to be a new CSV entry. This is causing my code that modifies the entries to not work since the element positions become incorrect. (CSV entries with empty or single line note fields work fine)
Any ideas on the best approach to take for this? I've tried adding code to replace carriage returns or to skip empty lines but it doesn't seem to help.
You can check for first column value in a row is null or not. If it is null continue to read next line.
Assuming the CSV example you have provided is supposed to be just one entry in the CSV file (with the last field spanning over several different lines due to newline breaks), you could try something like this, using 2 loops.
Keep a variable for the current CSV record (of String[] type) currentRecord and a recordList (a List or an array) to keep all the CSV records.
Read a line of the CSV file
Split it into an array of strings using the comma as the delimiter. Keep this array in a temporary variable.
If the size of this array is 1, append this string to the last element (4th) in currentRecord (if currentRecord is not null).
Keep reading lines off the CSV file, and repeating step 4 until the array size is 4.
If the size is 4, then this indicates that the record is the next record in the CSV file and you can add the currentRecord to recordList
Keep repeating steps 2 to 6 until you reach the end of the CSV file
It would be better if you can remove the line breaks in the field and clean the CSV file before parsing it though. It'll make things much simpler.
Use a proper CSV library to handle the writing and parsing. There's a few edge cases to handle here, not only the new line. Users could also insert commas or quotes in their notes and it will become very messy to handle this by yourself.
Try uniVocity-parsers as it can handle all sorts of situations when parsing and writing CSV.
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

Making a simple CSV file in Java

I would like to store some String in a file and then read it back again. The problem is Strings could be anything for instance it could even be something like "Entry1","Entry2" for one field. So if I simply check commas and split Strings accordingly to that it will definitely fail.
Is there any built-in Java class that handles situations like that? If not how can I make a simple CSV parser in Java?
You might want to have a look at thisspecification for CSV. Bear in mind that there is no official recognized specification. You can probably try this parser too else
There is Apache Common library for CSV too that can help.
If you do not know about delimiter it will not be possible to do this so you have to find out somehow. If the delimiter can vary your only hope is to be able to deduce if from the formatting of the known data. When Excel imports CSV files it lets the user choose the delimiter and this is a solution you could use as well.
I would recommend openCSV: http://opencsv.sourceforge.net/
I have used it for numerous Java projects requiring CSV support, both reading and writing. A simple example of how it writes a CSV from the docs:
CSVWriter writer = new CSVWriter(new FileWriter("yourfile.csv"), ',');
// feed in your array (or convert your data to an array)
String[] entries = "first,second,third".split(",");
writer.writeNext(entries);
writer.close();
Assuming you can make a String[] out of your data it's that simple.
To deal with comma's in your entries you'd need to quote the entire entry:
`Make,Take,Break", "Top,Right,Left,Bottom",
With OpenCSV you can provide a quote character,you just pass it in the constructor:
CSVWriter writer = new CSVWriter(new FileWriter("yourfile.csv"), ',', '"');
That should take care of the needs you listed.

openCSV library removing quotes from variable

I am attempting to parse a csv file with opencsv. The last column in the file has a parameter which is actually json. This json is enclosed in "". The problem is opencsv is removing some " from inside the json causing my code to break.
CSVReader reader = new CSVReader(new FileReader("c:\\Json.csv"), ',');
nextLine = reader.readNext();
nextLine[6];
Has anyone seen this before?
example of json as found in the csv
"{"type":"Polygon","coordinates":[[[-66.9,18.05],[-66.9,18.05]]]}"
Technically, your JSON is not valid CSV format unless you escape the quotation marks you want to keep. Double quote characters can be escaped with an additional double quote character (as mentioned in the RFC) or with a backslash (as expected by the default parameters in CSVReader).
So in your example, the CSV content should be:
"{""type"":""Polygon"",""coordinates"":[[[-66.9,18.05],[-66.9,18.05]]]}"
or
"{\"type\":\"Polygon\",\"coordinates\":[[[-66.9,18.05],[-66.9,18.05]]]}"
If you want it to be read/parsed with all the internal quotation marks (but without the surrounding quotation marks). CSVReader will read both as:
{"type":"Polygon","coordinates":[[[-66.9,18.05],[-66.9,18.05]]]}
Also note that you can tell CSVReader to use different characters for quotes and escaping, but you should probably stick with the default since they are more universally accepted.

Categories