CSV parsing with univocity-parsers and backslash-escaped quotes - java

I'm having some trouble parsing CSV with backslash escaped qoutes \". Most of lines in source CSV don't include escaped quotes but where there are I can't seem to find appropriate settings for correct parsing.
CSV example (each line with 4 columns):
1,,No quote escape,test
2,,"One quote escape\"",test
3,,"Two \"quote escapes\",test
4,,"Two \"quote escapes\" 2",test
CSV parser settings:
CsvFormat:
Comment character=#
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\r\n
Quote character="
Quote escape character=\
Quote escape escape character=null
Code snippet:
CsvParserSettings settings = new CsvParserSettings();
settings.setDelimiterDetectionEnabled(true);
settings.setLineSeparatorDetectionEnabled(true);
settings.getFormat().setQuote('"');
settings.getFormat().setQuoteEscape('\\');
CsvParser parser = new CsvParser(settings);
parser.beginParsing(file, StandardCharsets.UTF_8);
...
Lines are parsed correctly until two escaped quotes are present in one line. Expected parsed lines are:
- 1,null,No quote escape,test
- 2,null,One quote escape",test
- 3,null,Two "quote escapes",test
- 4,null,Two "quote escapes" 2,test

Upon further inspection I found an existing issue for v2.9.1.

Related

Using Same Escape and Quote Character Breaks CSV

I have a simple CSV file like this:
SellerProductID;ProductTextLong
1000;"a ""good"" Product"
And this is the try to read it in with Apache CSV:
try (Reader reader = new StringReader(content)) {
CSVFormat format = CSVFormat.DEFAULT.withDelimiter(';').withHeader().withEscape('"').withQuote('"');
CSVParser records = format.parse(reader);
System.out.println(records.iterator().next());
}
That doesn't work because of:
Exception in thread "main" java.lang.IllegalStateException: IOException reading next record: java.io.IOException: (startline 2) EOF reached before encapsulated token finished
at org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:145)
at org.apache.commons.csv.CSVParser$CSVRecordIterator.next(CSVParser.java:171)
at org.apache.commons.csv.CSVParser$CSVRecordIterator.next(CSVParser.java:137)
Caused by: java.io.IOException: (startline 2) EOF reached before encapsulated token finished
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:288)
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:158)
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:674)
at org.apache.commons.csv.CSVParser$CSVRecordIterator.getNextRecord(CSVParser.java:142)
... 3 more
Other CSV tools (e.g. Google Sheets) can load the CSV just fine.
It works if I use another quote or escape character, but sadly the customer's CSV is set.
How do I configure Apache CSV to allow the same escape and quote character? Or is there any way to modify a stream to replace the quote characters on the fly (the files are gigantic)?
The entire problem is that " is not the "escape character".
From Wikipedia:
Embedded double quote characters may then be represented by a pair of consecutive double quotes, or by prefixing a double quote with an escape character such as a backslash.
So in this case, "" is just two quote characters next to each other, while the escape character is a differenct character used to escape quotes or line breaks or separators.
This fixes it (note that withEscape() is called differently, but the example data doesn't show what the escape character actually is):
try (Reader reader = new StringReader(content)) {
CSVFormat format = CSVFormat.DEFAULT.withDelimiter(';').withHeader().withEscape('/').withQuote('"');
CSVParser records = format.parse(reader);
System.out.println(records.iterator().next());
}
I have looked over your issue and this article and this post might help you. Try to use also with .withNullString("").

Java: OpenCSV escape character in fields

I have my input file with quoted fields. Below is how I am initializing the CSV reader
CSVParser parser = new CSVParserBuilder().withSeparator(CSVParser.DEFAULT_SEPARATOR).build();
CSVReader reader = new CSVReaderBuilder(new FileReader("abc.txt")).withCSVParser(parser).build();
With the following input, it reads properly.
"1","abc","this works properly with ""quotes"" as well"
With the following input, it fails
"1","abc","this fails with \""backslash\"" and ""quotes"". "
I know in java backslash is an escape character. Is there a workaround to read the above line properly? Unfortunately, I can't change the input format as its generated by our client's legacy system.

preserve /t and /n in XML attribute with Java parser

In a XML file parsed to a Document I want to get a XML attribute that has embedded tabs and new lines.
I've googled and found that the XML parsing spec says the attribute text is "normalized", replacing white space characters with a blank.
I guess a have to replace the tabs and line breaks with an appropriate escaped character before I parse the XML.
In all of my googling I have not found a straightforward method to get from the File to a Document where the attribute text is returned with Tabs and Line breaks preserved.
The XML file is generated from a third party application so it may not be addressed there.
I want to use the JDK parser.
My initial attempts at reading the File into a string and parsing the String fail with a parse error on the first byte
Any suggestions on a straight forward approach?
An example element is at pastbin
Element example
[1]: https://pastebin.com/pc9uGbSD
I perform a XML Parse like this
public ReadPlexExport(Path xmlPath, ExportType exType) throws Exception {
this.xmlPath = xmlPath;
this.type = exType;
this.doc = DBF.newDocumentBuilder().parse(this.xmlPath.toFile());
}
The quick and dirty solution to my immediate problem was to read the XML file line by line as a text file, on each line replacing \t characters with the escaped tab value, writing the line to a new file, then appending an escaped line break.
The new XML files could be parsed. The original XML would always be in a form that allowed this hack as \t and line breaks would only ever occur in Attributes.

How to skip invalid double quote character line in csv file using java?

I have a csv file contain 78400 lines (25MB).
When I read the csv file line by line, 1 column has error in 2nd line.
It contains backslash character.
When I read this column, it read all the remaining columns in the csv file as single column.
"CDE","456","6346","testdata2","MyData2","ClassB"
"ABC","123","4567\","testdata","MyData","ClassA"
"CDE","456","6346","testdata2","MyData2","ClassB"
How to skip that line by using line seperator in java?
you can write method which would check by splitting the line into words and then identify the \ using as a char
String line=br.readline();
String words =line.split(",");
char[] word=words.toCharArray();
boolean escape=(word=='\');
You can identify the escape and handle it specially .
If you are using openCSV then just define your parser with an escape character other than backslash. If you don't want an escape character you can use the ICSVParser.NULL_CHARACTER or if you are using the 3.9 version of openCSV you can use the RFC4180Parser.
RFC4180ParserBuilder rfc4180ParserBuilder = new RFC4180ParserBuilder();
ICSVParser rfc4180Parser = rfc4180ParserBuilder.build();
CSVReaderBuilder builder = new CSVReaderBuilder(sr);
CSVReader reader = builder.withCSVParser(parser).build();

Bindy CRLF for UNMARSHAL

I wanted to unmarshall a csv file to a Bean.
The issue is the record separator or the newline will be a semi colon ";"
The CSVAnnotation has a crlf separator for marhalling to a csv file. Is there a work around for the reverse scenario. As of now I am replacing the semicolon with a NEWLINE character.
But I might have a requirement where the NEWLINE could be the conventaion "\r\n" or ";"
Any suggestions would be of great help
You can set a custom newline character with
#CsvRecord(separator = ",", crlf=";")
public Class Order {
...
}

Categories