Resolving invalid data in CSV file with Apache Commons

Resolving invalid data in CSV file with Apache Commons - java

Using the apache commons library for parsing CSV data I encounter an error
java.lang.IllegalStateException: IOException reading next record: java.io.IOException:
(line 46196) invalid char between encapsulated token and delimiter
I am using the setup as following:
try {
File csvInput = getLatestFilefromDir(CSV_PATH);
reader = new FileReader(csvInput);
final CSVFormat csvFormat = CSVFormat.Builder.create()
.setHeader(HEADERS)
.setDelimiter(';')
.setQuote('"')
.setEscape('\\')
.setSkipHeaderRecord(true)
.build();
Iterable<CSVRecord> csvRecords = csvFormat.parse(reader);
for (CSVRecord csvRecord : csvRecords) {
// processing
}
} catch (Exception e) {
log.error("Error retrieving CSV data.");
e.printStackTrace();
}
As the error suggest the data has some defect, invalid entry :
"TABLE_NAME";"ATTRIBUTE";"VALUE"
"SWAP_LEG_TYPE";"SWAP_LEG_TYPE_DESC";"The payments (PAY or RECEIVE) of this \"Leg\" are based on the yield linked to a specific equity or an index. (or to the actual market price of the equity or the index ???)"
"CNTPTY_TYPE";"CNTPTY_TYPE_DESC";"With Local Government we mean the so called \Regional Governments or Local Authorities\\" (RGLA) as defined by the EBA (European Banking Authority).\""
Changing the data is out of my control. Assuming the backslash is used for escaping quotes as in other example, in this case is used poorly and made it to the CSV file, hopefully there should be
...Authorities\ \" (RGLA)...
Is there a way to replace string before parsing?
Or what can I do to extend the CSVFormat builder to accept such data?
I am thinking of simple method to read the whole input and just do the replace string \\ for \ as this is the only instance in million lines, but that seems wrong.

This is a slightly modified original version that should solve your issue, setQuote(null) does all magic.
final CSVFormat csvFormat = CSVFormat.Builder.create()
.setHeader(HEADERS)
.setDelimiter(';')
.setQuote(null)
.setEscape('\\')
.setSkipHeaderRecord(true)
.build();

Related

Parsing CSV file with multi line fields using au.com.bytecode.opencsv.CSVReader.CSVReader

I want to parse a .csv file in java. Most of the lines (rows) in the file are following a typical .csv convention, but there are some cases which causes trouble. I am not actually sure if those cases are allowed or not in a csv-styled document. The biggest troublemaker is the multi-line cell with the text wrapped into quotes:
"text",12345,"text2"
"text",45678,"text2"
"text",23456,"text
accross multiple
lines"
So single cell in this case can be uniquely identified by quotes, so I guess it could work, but I cannot make CSVReader to parse it correctly. Any ides, how to set it up so it recognize this multi-line fields as a single field?
This is how I parse csv file in java:
try(FileInputStream fio = new FileInputStream(csvFile);
InputStreamReader isr = new InputStreamReader(fio, StandardCharsets.UTF_8);
CSVReader reader = new CSVReader(isr, ';', '"',true)){
String[] line;
reader.readNext(); //skip header
while((line = reader.readNext()) != null) {
updatedDataTable.add(processOneLine(line, csvFile));
}
}catch(Exception e) {
log.error("Error", e);
}
CSVReader has a small set of parameters in the constructor that can define parsing rules, but I cannot configure it in a way so it could tolerate the issue I described above.

Importing two CSV files into Java and then parsing them. The first one works the second doesnt

Im working on my code where I am importing two csv files and then parsing them
//Importing CSV File for betreuen
String filename = "betreuen_4.csv";
File file = new File(filename);
//Importing CSV File for lieferant
String filename1 = "lieferant.csv";
File file1 = new File(filename1);
I then proceed to parse them. For the first csv file everything works fine. The code is
try {
Scanner inputStream = new Scanner(file);
while(inputStream.hasNext()) {
String data = inputStream.next();
String[] values = data.split(",");
int PInummer = Integer.parseInt(values[1]);
String MNummer = values[0];
String KundenID = values[2];
//System.out.println(MNummer);
//create the caring object with the required paramaters
//Caring caring = new Caring(MNummer,PInummer,KundenID);
//betreuen.add(caring);
}
inputStream.close();
}catch(FileNotFoundException d) {
d.printStackTrace();
}
I then proceed to parse the other csv file the code is
// parsing csv file lieferant
try {
Scanner inputStream1 = new Scanner(file1);
while(inputStream1.hasNext()) {
String data1 = inputStream1.next();
String[] values1 = data1.split(",");
int LIDnummer = Integer.parseInt(values1[0]);
String citynames = values1[1];
System.out.println(LIDnummer);
String firmanames = values1[2];
//create the suppliers object with the required paramaters
//Suppliers suppliers = new
//Suppliers(LIDnummer,citynames,firmanames);
//lieferant.add(suppliers);
}
inputStream1.close();
}catch(FileNotFoundException d) {
d.printStackTrace();
}
the first error I get is
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2
at Verbindung.main(Verbindung.java:61)
So I look at my array which is firmaname at line 61 and I think, well it's impossible that its out of range since in my CSV file there are three columns and at index 2 (which I know is the third column in the CSV file) is my list of company names. I know the array is not empty because when i wrote
`System.out.println(firmanames)`
it would print out three of the first company names. So in order to see if there is something else causing the problem I commented line 61 out and I ran the code again. I get the following error
`Exception in thread "main" java.lang.NumberFormatException: For input
string: "Ridge"
at java.lang.NumberFormatException.forInputString(Unknown Source)
at java.lang.Integer.parseInt(Unknown Source)
at java.lang.Integer.parseInt(Unknown Source)
at Verbindung.main(Verbindung.java:58)`
I google these errors and you know it was saying im trying to parse something into an Integer which cannot be an integer, but the only thing that I am trying to parse into an Integer is the code
int LIDnummer = Integer.parseInt(values1[0]);
Which indeed is a column containing only Integers.
My second column is also indeed just a column of city names in the USA. The only thing with that column is that there are spaces in some town names like Middle brook but I don't think that would cause problems for a String type. Also in my company columns there are names like AT&T but i would think that the & symbol would also not cause problems for a string. I don't know where I am going wrong here.
I cant include the csv file but here is a pic of a part of it. The length of each column is a 1000.
A pic of the csv file

Scanner by default splits its input by whitespace (docs). Whitespace means spaces, tabs and newlines.
So your code will, I think, split the whole input file at every space and every newline, which is not what you want.
So, the first three elements your code will read are
5416499,Prairie
Ridge,NIKE
1765368,Edison,Cartier
I suggest using method readLine of BufferedReader then calling split on that.
The alternative is to explicitly tell Scanner how you want it to split the input
Scanner inputStream1 = new Scanner(file1).useDelimiter("\n");
but I think this is not the best use of Scanner when a simpler class (BufferedReader) will do.

First of all, I would highly suggest you try and use an existing CSV parser, for example this one.
But if you really want to use your own, you are going to need to do some simple debugging. I don't know how large your file is, but the symptoms you are describing lead me to believe that somewhere in the csv there may be a missing comma or an accidental escape character. You need to find out what line it is. So run this code and check its output before it crashes:
int line = 1;
try {
Scanner inputStream1 = new Scanner(file1);
while(inputStream1.hasNext()) {
String data1 = inputStream1.next();
String[] values1 = data1.split(",");
int LIDnummer = Integer.parseInt(values1[0]);
String citynames = values1[1];
System.out.println(LIDnummer);
String firmanames = values1[2];
line++;
}
} catch (ArrayIndexOutOfBoundsException e){
System.err.println("The issue in the csv is at line:" + line);
}
Once you find what line it is, the answer should be obvious. If not, post a picture of that line and we'll see...

Apache Commons CSV Mapping not found

I am trying to read a CSV file with certain headers into a Java object using Apache Commons CSV. However, when I run the code, I get the following exeption:
Exception in thread "main" java.lang.IllegalArgumentException: Mapping for Color not found, expected one of [Color, Name, Price, House Cost, Rent, 1 House, 2 Houses, 3 Houses, 4 Houses, Hotel, Mortgage]
at org.apache.commons.csv.CSVRecord.get(CSVRecord.java:102)
at GameBoard.<init>(GameBoard.java:25)
at Game.main(Game.java:3)
Can someone explain where the exception is coming from? It appears to me that Apache Commons somehow is not matching my input to a column. Is there something wrong on my part or is something else broken? Here is my code snippet:
Reader in;
Iterable<CSVRecord> records = null;
try {
in = new FileReader(new File(Objects.requireNonNull(getClass().getClassLoader().getResource("Properties.csv")).getFile()));
records = CSVFormat.EXCEL.withFirstRecordAsHeader().parse(in);
} catch (IOException | NullPointerException e) {
e.printStackTrace();
System.exit(1);
}
for (CSVRecord record :
records) {
spaces.add(new Property(
record.get("Color"),
record.get("Name"),
Integer.parseInt(record.get("Price")),
And here are my csv headers (sorry, one was cut off but that's not the point):
Thanks!

I had the same probem which only occurs if you reference the first column, all other column names are working. The problem is, that the UTF-8 representation prepends the following characters "0xEF,0xBB,0xBF" (see Wikipedia page). This seems to be a known problem for commons-csv but since this is application specific, it won't be fixed (CSVFormat.EXCEL.parse should handle byte order marks).
However, there is a documented workaround for this:
http://commons.apache.org/proper/commons-csv/user-guide.html#Handling_Byte_Order_Marks

I got the same weird exception. It actually said "Expecting one of ..." and then listed the field it said it could not find - just like in your case.
The reason was that I had set the wrong CSVFormat:
CSVFormat csvFormat = CSVFormat.newFormat(';');
This meant that my code was trying to separate fields on semi-colons in a file that actually had comma separators.
Once I used the DEFAULT CSVFormat, everything started to work.
CSVFormat csvFormat = CSVFormat.DEFAULT;
So the answer is that probably you must set CSVFormat correctly for your file.

Moving to spring boot version 2.6.7 from 2.4.5 brought about this error.. I had to convert each csvRecord to a map before assigning it to my POJO as follows.
for (CSVRecord csvRecord : csvRecords) {
Map<String, String> csvMap = csvRecord.toMap();
Model newModel = new Model();
model.setSomething(csvMap.get("your_item"));
}

I also got the same exception by giving a different name of header in CSV file like xyz, or trying to get the value by calling csvRecord.get("x_z")
I resolved my problem changing the header name xyz.
try {
fileReader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
csvParser = new CSVParser(fileReader,
CSVFormat.DEFAULT.withFirstRecordAsHeader().withIgnoreHeaderCase().withTrim());
Iterable<CSVRecord> csvRecords = csvParser.getRecords();
CSVFormat csvFormat = CSVFormat.DEFAULT;
for (CSVRecord csvRecord : csvRecords) {
} catch (Exception e) {
System.out.println("Reading CSV Error!");
e.printStackTrace();
} finally {
try {
fileReader.close();
csvParser.close();
} catch (IOException e) {
System.out.println("Closing fileReader/csvParser Error!");
e.printStackTrace();
}
}

how can I parse CSV(excel,not separated by comma) file in Java ?

I have a CSV files (excel) which has data in it and i need to parse the data using java.
the data in those files doesn't separated using comma,the CSV files has number of columns and number of rows(each cell has data) where all the data is written.
i need to go through on all the files until i get to the EOF(end of file)of each file and parse the data.
the files contains also empty rows in it so empty row is not a criteria to stop parsing,i think only EOF will indicate that i've reached to the end of the specific file.
many thanks.

You can use opencsv to parse the excel CSV. I've used this myself, all you need to do is split on the ';'. Empty cells will be parsed aswell.
You can find info here : http://opencsv.sourceforge.net/
And to parse the excelCSV you can do:
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"), ';');

Aside from other suggestions, I would offer Jackson CSV module. Jackson has very powerful data-binding functionality, and CSV module allows reading/writing as CSV as an alternative to JSON (or XML, YAML, and other supported formats). So you can also do conversions between other data formats, in addition to powerful CSV-to/from-POJO binding.

Please have a Stream Object to read the CSV file.
FileInputStream fis = new FileInputStream("FileName.CSV");
BufferedInputStream bis = new BufferedInputStream(fis);
InputStreamReader isr = new InputStreamReader(bis);
Read an inputstream Object and store the file in String object.
Then using StringTokenizer with ,[comma] as delimeter -->you will get the tokens
Please manipulate the token to get the value.
String str = "This is String , split by StringTokenizer, created by mkyong";
StringTokenizer st = new StringTokenizer(str);
System.out.println("---- Split by space ------");
while (st.hasMoreElements()) {
System.out.println(st.nextElement());
}
System.out.println("---- Split by comma ',' ------");
StringTokenizer st2 = new StringTokenizer(str, ",");
while (st2.hasMoreElements()) {
System.out.println(st2.nextElement());
}
Thanks,
Pavan

Suppose you have a csv fileContent in form of string:
String fileContent;
Generally, the CSV fileContent are parsed into List>.
final List<String> rows = new ArrayList<String>(Lists.newArraysList(fileContent.split("[\\r\\n]+")));
Split the file into List of rows.
Then use CSVParser of OpenCSV and parse the comma separated line into List
final CSVParser parser = new CSVParser();
final List<List<String>> csvDetails = new ArrayList<List<String>>();
rows.forEach(t -> {
try {
csvDetails.add(Lists.newArrayList(parser.parseLine(t)));
} catch (Exception e) {
throw new RunTimeException("Exception occurred while parsing the data");
}
});

Creating android app Database with big amount of data

The database of my application need to be filled with a lot of data,
so during onCreate(), it's not only some create table sql
instructions, there is a lot of inserts. The solution I chose is to
store all this instructions in a sql file located in res/raw and which
is loaded with Resources.openRawResource(id).
It works well but I face to encoding issue, I have some accentuated
caharacters in the sql file which appears bad in my application. This
my code to do this:
public String getFileContent(Resources resources, int rawId) throws
IOException
{
InputStream is = resources.openRawResource(rawId);
int size = is.available();
// Read the entire asset into a local byte buffer.
byte[] buffer = new byte[size];
is.read(buffer);
is.close();
// Convert the buffer into a string.
return new String(buffer);
}
public void onCreate(SQLiteDatabase db) {
try {
// get file content
String sqlCode = getFileContent(mCtx.getResources(), R.raw.db_create);
// execute code
for (String sqlStatements : sqlCode.split(";"))
{
db.execSQL(sqlStatements);
}
Log.v("Creating database done.");
} catch (IOException e) {
// Should never happen!
Log.e("Error reading sql file " + e.getMessage(), e);
throw new RuntimeException(e);
} catch (SQLException e) {
Log.e("Error executing sql code " + e.getMessage(), e);
throw new RuntimeException(e);
}
The solution I found to avoid this is to load the sql instructions
from a huge static final String instead of a file, and all
accentuated characters appear well.
But isn't there a more elegant way to load sql instructions than a big
static final String attribute with all sql instructions?

I think your problem is in this line:
return new String(buffer);
You're converting the array of bytes in to a java.lang.String but you're not telling Java/Android the encoding to use. So the bytes for your accented characters aren't being converted correctly as the wrong encoding is being used.
If you use the String(byte[],<encoding>) constructor you can specify the encoding your file has and your characters will be converted correctly.

The SQL file solution seems perfect, it's just that you need to make sure that the file is saved in utf8 encoding otherwise all the accentuated characters will be lost. If you don't want to change the file's encoding then you need to pass an extra argument to new String(bytes, charset) defining the file's encoding.
Do prefer to use file resources instead of static final String to avoid having all those unnecessary bytes loaded into memory. In mobile phones you want to save all memory possible!

I am using a different approach:
Instead of executing loads of sql statements (which will take long time to complete), I build my sqlite database on the desktop, put it in the assets folder, create an empty sqlite db in android and copy the db from the assets folder into the database folder. This is a huge increase in speed. Note, you need to create an empty database first in android, and then you can copy and overwrite it. Otherwise, Android will not allow you to write a db into the datbase folder. There are several examples on the internet.
BTW, seems this approach works best, if the db has no file extension.

It looks like you are passing all your sql statements in one string. That's a problem because execSQL expects "a single statement that is not a query" (see documentation [here][1]). Following is a somewhat-ugly-but-working solution.
I have all my sql statements in a file like this:
INSERT INTO table1 VALUES (1, 2, 3);
INSERT INTO table1 VALUES (4, 5, 6);
INSERT INTO table1 VALUES (7, 8, 9);
Notice the new lines in between text(semicolon followed by 2 new lines)
Then, I do this:
String text = new String(buffer, "UTF-8");
for (String command : text.split(";\n\n")) {
try { command = command.trim();
//Log.d(TAG, "command: " + command);
if (command.length() > 0)
db.execSQL(command.trim());
}
catch(Exception e) {do whatever you need here}
My data columns contain blobs of text with new lines AND semicolons, so I had to find a different command-separator. Just be sure to get creative with the split str: use something you know doesn't exist in your data.
HTH
Gerardo
[1]: http://developer.android.com/reference/android/database/sqlite/SQLiteDatabase.html#execSQL(java.lang.String, java.lang.Object[])

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Resolving invalid data in CSV file with Apache Commons - java

This is a slightly modified original version that should solve your issue, setQuote(null) does all magic. final CSVFormat csvFormat = CSVFormat.Builder.create() .setHeader(HEADERS) .setDelimiter(';') .setQuote(null) .setEscape('\\') .setSkipHeaderRecord(true) .build();

Related

Parsing CSV file with multi line fields using au.com.bytecode.opencsv.CSVReader.CSVReader

Importing two CSV files into Java and then parsing them. The first one works the second doesnt

Apache Commons CSV Mapping not found

how can I parse CSV(excel,not separated by comma) file in Java ?

Creating android app Database with big amount of data

Categories

Resources