Using Same Escape and Quote Character Breaks CSV - java

I have a simple CSV file like this:
SellerProductID;ProductTextLong
1000;"a ""good"" Product"
And this is the try to read it in with Apache CSV:
try (Reader reader = new StringReader(content)) {
CSVFormat format = CSVFormat.DEFAULT.withDelimiter(';').withHeader().withEscape('"').withQuote('"');
CSVParser records = format.parse(reader);
System.out.println(records.iterator().next());
}
That doesn't work because of:
Exception in thread "main" java.lang.IllegalStateException: IOException reading next record: java.io.IOException: (startline 2) EOF reached before encapsulated token finished
at org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:145)
at org.apache.commons.csv.CSVParser$CSVRecordIterator.next(CSVParser.java:171)
at org.apache.commons.csv.CSVParser$CSVRecordIterator.next(CSVParser.java:137)
Caused by: java.io.IOException: (startline 2) EOF reached before encapsulated token finished
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:288)
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:158)
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:674)
at org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:142)
... 3 more
Other CSV tools (e.g. Google Sheets) can load the CSV just fine.
It works if I use another quote or escape character, but sadly the customer's CSV is set.
How do I configure Apache CSV to allow the same escape and quote character? Or is there any way to modify a stream to replace the quote characters on the fly (the files are gigantic)?

The entire problem is that " is not the "escape character".
From Wikipedia:
Embedded double quote characters may then be represented by a pair of consecutive double quotes, or by prefixing a double quote with an escape character such as a backslash.
So in this case, "" is just two quote characters next to each other, while the escape character is a differenct character used to escape quotes or line breaks or separators.
This fixes it (note that withEscape() is called differently, but the example data doesn't show what the escape character actually is):
try (Reader reader = new StringReader(content)) {
CSVFormat format = CSVFormat.DEFAULT.withDelimiter(';').withHeader().withEscape('/').withQuote('"');
CSVParser records = format.parse(reader);
System.out.println(records.iterator().next());
}

I have looked over your issue and this article and this post might help you. Try to use also with .withNullString("").

Related

CSV parsing with univocity-parsers and backslash-escaped quotes

I'm having some trouble parsing CSV with backslash escaped qoutes \". Most of lines in source CSV don't include escaped quotes but where there are I can't seem to find appropriate settings for correct parsing.
CSV example (each line with 4 columns):
1,,No quote escape,test
2,,"One quote escape\"",test
3,,"Two \"quote escapes\",test
4,,"Two \"quote escapes\" 2",test
CSV parser settings:
CsvFormat:
Comment character=#
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\r\n
Quote character="
Quote escape character=\
Quote escape escape character=null
Code snippet:
CsvParserSettings settings = new CsvParserSettings();
settings.setDelimiterDetectionEnabled(true);
settings.setLineSeparatorDetectionEnabled(true);
settings.getFormat().setQuote('"');
settings.getFormat().setQuoteEscape('\\');
CsvParser parser = new CsvParser(settings);
parser.beginParsing(file, StandardCharsets.UTF_8);
...
Lines are parsed correctly until two escaped quotes are present in one line. Expected parsed lines are:
- 1,null,No quote escape,test
- 2,null,One quote escape",test
- 3,null,Two "quote escapes",test
- 4,null,Two "quote escapes" 2,test
Upon further inspection I found an existing issue for v2.9.1.

reading an ISO-8859-1 encoded data from CSV

I have a CSV file, which I need to read and analyse. I use the methods and classes in from Apache Commons CSV.
The input file uses the regular low ASCII (0x0 -0x7f) characters. Some of fields include also line breaks.
However, in addition, some of the fields may contain characters 0xe4 and 0xe5 which need to be converted to '{' and '}' respectively. I have looked at the input file with a hex view so I am certain that it is really 0xe4 and 0xe5, and not some Unicode.
FileReader in = new FileReader(INPUT_CSV);
System.out.println(in.getEncoding());
records = CSVFormat.RFC4180.withFirstRecordAsHeader().withDelimiter('|').withQuote('#').parse(in);
The getEncoding() method says that the file is UTF-8 encoded, and I suspect this is where it goes wrong.
Then I read the records by using a loop through
for (CSVRecord record : records) {
// some analysis in here
String toProcess = record.get("TO_PROCESS"); // this is the field which may contain the 0xe4 and 0xe5
toProcess = StringUtils.replaceChars(toProcess, OPENING_BRACKET,'{');
toProcess = StringUtils.replaceChars(toProcess, CLOSING_BRACKET,'}');
}
Yet, this replacement does not work, and the output strings have a three character sequence 0xef 0xbf 0xbd instead of the brackets I was hoping to see.
Is it possible to force the ISO-8859-1 on the input? Or while reading the strings from the input file?
p.s.
Opening and closing brackets are defined as
static char OPENING_BRACKET = 228; // 'ä'
static char CLOSING_BRACKET = 229; // 'å'

Java: OpenCSV escape character in fields

I have my input file with quoted fields. Below is how I am initializing the CSV reader
CSVParser parser = new CSVParserBuilder().withSeparator(CSVParser.DEFAULT_SEPARATOR).build();
CSVReader reader = new CSVReaderBuilder(new FileReader("abc.txt")).withCSVParser(parser).build();
With the following input, it reads properly.
"1","abc","this works properly with ""quotes"" as well"
With the following input, it fails
"1","abc","this fails with \""backslash\"" and ""quotes"". "
I know in java backslash is an escape character. Is there a workaround to read the above line properly? Unfortunately, I can't change the input format as its generated by our client's legacy system.

How to skip invalid double quote character line in csv file using java?

I have a csv file contain 78400 lines (25MB).
When I read the csv file line by line, 1 column has error in 2nd line.
It contains backslash character.
When I read this column, it read all the remaining columns in the csv file as single column.
"CDE","456","6346","testdata2","MyData2","ClassB"
"ABC","123","4567\","testdata","MyData","ClassA"
"CDE","456","6346","testdata2","MyData2","ClassB"
How to skip that line by using line seperator in java?
you can write method which would check by splitting the line into words and then identify the \ using as a char
String line=br.readline();
String words =line.split(",");
char[] word=words.toCharArray();
boolean escape=(word=='\');
You can identify the escape and handle it specially .
If you are using openCSV then just define your parser with an escape character other than backslash. If you don't want an escape character you can use the ICSVParser.NULL_CHARACTER or if you are using the 3.9 version of openCSV you can use the RFC4180Parser.
RFC4180ParserBuilder rfc4180ParserBuilder = new RFC4180ParserBuilder();
ICSVParser rfc4180Parser = rfc4180ParserBuilder.build();
CSVReaderBuilder builder = new CSVReaderBuilder(sr);
CSVReader reader = builder.withCSVParser(parser).build();

Invalid char between encapsulated token and delimiter in Apache Commons CSV library

I am getting the following error while parsing the CSV file using the Apache Commons CSV library.
Exception in thread "main" java.io.IOException: (line 2) invalid char between encapsulated token and delimiter
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:275)
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:450)
at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:327)
at parse.csv.file.CSVFileParser.main(CSVFileParser.java:29)
What's the meaning of this error ?
We ran into this issue when we had embedded quote in our data.
0,"020"1,"BS:5252525 ORDER:99999"4
Solution applied was CSVFormat csvFileFormat = CSVFormat.DEFAULT.withQuote(null);
#Cuga tip helped us to resolve. Thanks #Cuga
Full code is
public static void main(String[] args) throws IOException {
FileReader fileReader = null;
CSVFormat csvFileFormat = CSVFormat.DEFAULT.withQuote(null);
String fileName = "test.csv";
fileReader = new FileReader(fileName);
CSVParser csvFileParser = new CSVParser(fileReader, csvFileFormat);
List<CSVRecord> csvRecords = csvFileParser.getRecords();
for (CSVRecord csvRecord : csvRecords) {
System.out.println(csvRecord);
}
csvFileParser.close();
}
Result is
CSVRecord [comment=null, mapping=null, recordNumber=1, values=[0, "020"1, "BS:5252525 ORDER:99999"4]]
That line in the CSV file contains an invalid character between one of your cells and either the end of line, end of file, or the next cell. A very common cause for this is a failure to escape your encapsulating character (the character that is used to "wrap" each cell, so CSV knows where a cell (token) starts and ends.
I found the solution to the problem.
One of my CSV file has an attribute as follows:
"attribute with nested "quote" "
Due to nested quote in the attribute the parser fails.
To avoid the above problem escape the nested quote as follows:
"attribute with nested """"quote"""" "
This is the one way to solve the problem.
We ran into this in this same error with data containing quotes in otherwise unquoted input. I.e.:
some cell|this "cell" caused issues|other data
It was hard to find, but in Apache's docs, they mention the withQuote() method which can take null as a value.
We were getting the exact same error message and this (thankfully) ended up fixing the issue for us.
I ran into this issue when I forgot to call .withNullString("") on my CSVFormat. Basically, this exception always occurs when:
your quote symbol is wrong
your null string representation is wrong
your column separator char is wrong
Make sure you know the details of your format. Also, some programs use leading byte-order-marks (for example, Excel uses \uFEFF) to denote the encoding of the file. This can also trip up your parser.

Categories