CSV parsing with Commons CSV - Quotes within quotes causing IOException - java

I am using Commons CSV to parse CSV content relating to TV shows. One of the shows has a show name which includes double quotes;
116,6,2,29 Sep 10,""JJ" (60 min)","http://www.tvmaze.com/episodes/4855/criminal-minds-6x02-jj"
The showname is "JJ" (60 min) which is already in double quotes. This is throwing an IOException java.io.IOException: (line 1) invalid char between encapsulated token and delimiter.
ArrayList<String> allElements = new ArrayList<String>();
CSVFormat csvFormat = CSVFormat.DEFAULT;
CSVParser csvFileParser = new CSVParser(new StringReader(line), csvFormat);
List<CSVRecord> csvRecords = null;
csvRecords = csvFileParser.getRecords();
for (CSVRecord record : csvRecords) {
int length = record.size();
for (int x = 0; x < length; x++) {
allElements.add(record.get(x));
}
}
csvFileParser.close();
return allElements;
CSVFormat.DEFAULT already sets withQuote('"')
I think that this CSV is not properly formatted as ""JJ" (60 min)" should be """JJ"" (60 min)" - but is there a way to get commons CSV to handle this or do I need to fix this entry manually?
Additional information: Other show names contain spaces and commas within the CSV entry and are placed within double quotes.

The problem here is that the quotes are not properly escaped. Your parser doesn't handle that. Try univocity-parsers as this is the only parser for java I know that can handle unescaped quotes inside a quoted value. It is also 4 times faster than Commons CSV. Try this code:
//configure the parser to handle your situation
CsvParserSettings settings = new CsvParserSettings();
settings.setUnescapedQuoteHandling(STOP_AT_CLOSING_QUOTE);
//create the parser
CsvParser parser = new CsvParser(settings);
//parse your line
String[] out = parser.parseLine("116,6,2,29 Sep 10,\"\"JJ\" (60 min)\",\"http://www.tvmaze.com/episodes/4855/criminal-minds-6x02-jj\"");
for(String e : out){
System.out.println(e);
}
This will print:
116
6
2
29 Sep 10
"JJ" (60 min)
http://www.tvmaze.com/episodes/4855/criminal-minds-6x02-jj
Hope it helps.
Disclosure: I'm the author of this library, it's open source and free (Apache 2.0 license)

Quoting mainly allows for field to contain separator characters. If embedded quotes in a field are not escaped, this can't work, so there isn't any point in using quotes. If your example value was "JJ", 60 Min, how is a parser to know the comma is part of the field? The data format can't handle embedded commas reliably, so if you want to be able to do that, best to change the source to generate an RFC compliant csv format.
Otherwise, it looks like the data source is simply surrounding non-numeric fields with quotes, and separating each field a comma, so the parser needs to do the reverse. You should probably just treat the data as comma-delimited and strip the leading/trailing quotes yourself with removeStart/removeEnd.
You might use CSVFormat .withQuote(null), or forget about that and just use String .split(',')

You can use withEscape('\\') to ignore quotes within quotes
CSVFormat csvFormat = CSVFormat.DEFAULT.withEscape('\\')
Reference: https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html

I think that having both quotations AND spaces in the same token is what confuses the parser. Try this:
CSVFormat csvFormat = CSVFormat.DEFAULT.withQuote('"').withQuote(' ');
That should fix it.
Example
For your input line:
String line = "116,6,2,29 Sep 10,\"\"JJ\" (60 min)\",\"http://www.tvmaze.com/episodes/4855/criminal-minds-6x02-jj\"";
Output is (and no exception is thrown):
[116, 6, 2, 29 Sep 10, ""JJ" (60 min)", "http://www.tvmaze.com/episodes/4855/criminal-minds-6x02-jj"]

No need of special parsers: just add a double quote in front the double quote:
116,6,2,29 Sep 10,"""JJ"" (60 min)",...
It's all specified in RFC 4180
7. If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
This is already implemented by CSVFormat #DEFAULT.

Related

How to avoid backslash before comma in CSVFormat

I am creating a CSV file using CSVFormat in java, the problem i am facing in both header and values is whenever the string is long and there is a comma the api is inserting a \ before the comma always. As a result the header is not forming correctly and the values in the csv file is taking next cell for the . I am posting the code what i have done
try (CSVPrinter csvPrinter = new CSVPrinter(out,
CSVFormat.DEFAULT.withHeader("\""+SampleEnum.MY_NAME.getHeader()+"\"", "\""+SampleEnum.MY_TITLE.getHeader()+"\"",
"\""+SampleEnum.MY_ID.getHeader()+"\"", "\""+SampleEnum.MY_NUMBER.getHeader()+"\"", "\""+SampleEnum.MY_EXTERNAL_KEY.getHeader()+"\"",
"\""+SampleEnum.DATE.getHeader()+"\"","\""+SampleEnum.MY_ACTION.getHeader()+"\"",
"\"\"\""+SampleEnum.MY__DEFI.getHeader()+"\"\"\"", SampleEnum.MY_ACTION.getHeader(),
SampleEnum.CCHK.getHeader(), SampleEnum.DISTANCE_FROM_LOCATION.getHeader(),
SampleEnum.TCOE.getHeader(), SampleEnum.HGTR.getHeader(),SampleEnum._BLANK.getHeader(),
SampleEnum.LOCATION_MAP.getHeader(), SampleEnum.SUBMISSION_ID.getHeader())
.withDelimiter(',').withEscape('\\').withQuote('"').withTrim().withQuoteMode(QuoteMode.MINIMAL)
)) {
sampleModel.forEach(sf -> {
try {
csvPrinter.printRecord(sf.getMyName(),
sf.getMyTitle(),
sf.getMyID(),
sf.getMyNo(),
So now the problem is i am getting output like this
"\"Name:\"","\"Title\"","\"ID #:\"","\"Store #:\"","\"Store #: External Key\"","\"Date:\"","\"\"\"It's performance issue in detail to include dates,times, circumstances, etc.\"\"\""
I am getting \ before each commas , and when this will come in the value the next portion of the text will shift to the next cell .
my Required output is
"Name:","Title:","Employee ID #:","Store #:","Store #: CurrierKey","Date:","Stage of Disciplinary Action:","""Describe your view about the company, times, circumstances, etc.""",
I am trying
https://commons.apache.org/proper/commons-csv/jacoco/org.apache.commons.csv/CSVFormat.java.html
this link, but i am unable to understand the fix. Please help .
This happens because you are using QuoteMode.NONE which has the following Javadoc:
Never quotes fields. When the delimiter occurs in data, the printer prefixes it with the escape character. If the escape character is not set, format validation throws an exception.
You can use QuoteMode.MINIMAL to only quotes fields which contain special characters (e.g. the field delimiter, quote character or a character of the line separator string).
I suggest that you use CSVFormat.DEFAULT and then configure everything yourself if you cannot use one of the other formats. Check if the backslash (\) is really the right escape character for your use case. Normally it would be a double quote ("). Also, you probably want to remove all the double quotes from your header definition as they get added automatically (if necessary) based on your configuration.
StringBuilder out = new StringBuilder();
try (CSVPrinter csvPrinter = new CSVPrinter(out,
CSVFormat.DEFAULT
.withHeader("AAAA", "BB\"BB", "CC,CC", "DD'DD")
.withDelimiter(',')
.withEscape('\\') // <- maybe you want '"' instead
.withQuote('"').withRecordSeparator('\n').withTrim()
.withQuoteMode(QuoteMode.MINIMAL)
)) {
csvPrinter.printRecord("WWWW", "XX\"XX", "YY,YY", "ZZ'ZZ");
}
System.out.println(out);
AAAA,"BB\"BB","CC,CC",DD'DD
WWWW,"XX\"XX","YY,YY",ZZ'ZZ
After your edit, it seems like you want all fields to be quoted with a double quote as escape character. Therefore, you can use QuoteMode.ALL and .withEscape('"') like this:
StringBuilder out = new StringBuilder();
try (CSVPrinter csvPrinter = new CSVPrinter(out,
CSVFormat.DEFAULT
.withHeader("AAAA", "BB\"BB", "CC,CC", "\"DD\"", "1")
.withDelimiter(',')
.withEscape('"')
.withQuote('"').withRecordSeparator('\n').withTrim()
.withQuoteMode(QuoteMode.ALL)
)) {
csvPrinter.printRecord("WWWW", "XX\"XX", "YY,YY", "\"DD\"", "2");
}
System.out.println(out);
"AAAA","BB""BB","CC,CC","""DD""","1"
"WWWW","XX""XX","YY,YY","""DD""","2"
In your comment, you state that you only want double quotes when required and triple quotes for one field only. Then, you can use QuoteMode.MINIMAL and .withEscape('"') as suggested in the first example. The triple quotes get generated when you surround your input of that field with double quotes (once because there is a special character and the field needs to be quoted, the second one because you added your explicit " and the third one is there to escape your explicit quote).
StringBuilder out = new StringBuilder();
try (CSVPrinter csvPrinter = new CSVPrinter(out,
CSVFormat.DEFAULT
.withHeader("AAAA", "BB\"BB", "CC,CC", "\"DD\"", "1")
.withDelimiter(',')
.withEscape('"')
.withQuote('"').withRecordSeparator('\n').withTrim()
.withQuoteMode(QuoteMode.MINIMAL)
)) {
csvPrinter.printRecord("WWWW", "XX\"XX", "YY,YY", "\"DD\"", "2");
}
System.out.println(out);
AAAA,"BB""BB","CC,CC","""DD""",1
WWWW,"XX""XX","YY,YY","""DD""",2
As per the chat you want total control when the header has quotes and when not. There is no combination of QuoteMode and escape character that will give the desired result. Consequently, I suggest that you manually construct the header:
StringBuilder out = new StringBuilder();
try (CSVPrinter csvPrinter = new CSVPrinter(out,
CSVFormat.DEFAULT
.withDelimiter(',').withEscape('"')
.withQuote('"').withRecordSeparator('\n').withTrim()
.withQuoteMode(QuoteMode.MINIMAL))
) {
out.append(String.join(",", "\"AAAA\"", "\"BBBB\"", "\"CC,CC\"", "\"\"\"DD\"\"\"", "1"));
out.append("\n");
csvPrinter.printRecord("WWWW", "XX\"XX", "YY,YY", "\"DD\"", "2");
}
System.out.println(out);
"AAAA","BBBB","CC,CC","""DD""",1
WWWW,"XX""XX","YY,YY","""DD""",2

How do I extract variable data from a line of string with tags?

I'm trying to write Java code to go to a website, read the HTML code line-by-line, extract certain pieces of data, including an embedded URL to go to another website, and repeat the process 100 times.
I've been able to isolate most of the pieces of data I need using expressions like:
s.ranking = line.substring(line.indexOf(">")+1, line.length() -7);
But I'm having problems with the following line:
<strong>Writer:</strong> Dylan <br/><strong>Producer:</strong> Tom Wilson&nbsp <br/><strong>Released:</strong> July '65, Columbia<br/>12 weeks; No. 2</p>
I need to extract and save the Writer data (Dylan). The producer data (Tom Wilson) and the Release date data (July '65). Some of the pages will have multiple writers and will be labeled "Writers:", and some will have multiple producers, labeled "Producers:"
How do I capture "Dylan" ,"Tom Wilson" and "July '65" from the above line in Java?
Thank you very much!
DM
The best approach is to use HTML parser. But as i read your comment " I'm doing this for a class and am learning about finding, isolating and extracting data."
What you can do something like :
String producer = "Producer:";
String writer = "Writer:";
String released = "Released:";
String s = "<strong>Writer:</strong> Dylan <br/><strong>Producer:</strong> Tom Wilson&nbsp <br/><strong>Released:</strong> July '65, Columbia<br/>12 weeks; No. 2</p> ";
int writerIndex = s.lastIndexOf(writer);
int producerIndex = s.lastIndexOf(producer);
int realesedIndex = s.lastIndexOf(released);
String writerExtracted = s.substring(writerIndex + writer.length(),
producerIndex).replaceAll("\\<.*?>", "");
System.out.println(writerExtracted);
String producerExtracted = s.substring(
producerIndex + producer.length(), realesedIndex).replaceAll(
"\\<.*?>", "");
System.out.println(producerExtracted);
String releasedExtracted = s.substring(
realesedIndex + released.length(), s.length()).replaceAll(
"\\<.*?>", "");
System.out.println(releasedExtracted);
Output:
Dylan
Tom Wilson&nbsp
July '65, Columbia12 weeks; No. 2
NOTE: you can get rid of signs such as &#039 or &nbsp using another regex ...

How to read data from CSV if contains more than excepted separators?

I use CsvJDBC for read data from a CSV. I get CSV from web service request, so not loaded from file. I adjust these properties:
Properties props = new java.util.Properties();
props.put("separator", ";"); // separator is a semicolon
props.put("fileExtension", ".txt"); // file extension is .txt
props.put("charset", "UTF-8"); // UTF-8
My sample1.txt contains these datas:
code;description
c01;d01
c02;d02
my sample2.txt contains these datas:
code;description
c01;d01
c02;d0;;;;;2
It is optional for me deleted headers from CSV. But not optional for me change semi-colon separator.
EDIT: My query for resultSet: SELECT * FROM myCSV
I want to read code column in sample1.txt and sample2.txt with:
resultSet.getString(1)
and read full description column with many semi-colons (d0;;;;;2). Is it possible with CsvJdbc driver or need to change driver?
Thank you any advice!
This is a problem that occurs when you have messy, invalid input, which you need to try to interpret, that's being read by a too-high-level package that only handles clean input. A similar example is trying to read arbitrary HTML with an XML parser - close, but no cigar.
You can guess where I'm going: you need to pre-process your input.
The preprocessing may be very easy if you can make some assumptions about the data - for example, if there are guaranteed to be no quoted semi-colons in the first column.
You could try supercsv. We have implemented such a solution in our project. More on this can be found in http://supercsv.sourceforge.net/
and
Using CsvBeanReader to read a CSV file with a variable number of columns
Finally this problem solved without a CSVJdbc or SuperCSV driver. These drivers works fine. There are possible query data form CSV file and many features content. In my case I don't need query data from CSV. Unfortunately, sometimes the description column content one or more semi-colons and which it is my separator.
First I check code in answer of #Maher Abuthraa and modified to:
private String createDescriptionFromResult(ResultSet resultSet, int columnCount) throws SQLException {
if (columnCount > 2) {
StringBuilder data_list = new StringBuilder();
for (int ii = 2; ii <= columnCount; ii++) {
data_list.append(resultSet.getString(ii));
if (ii != columnCount)
data_list.append(";");
}
// data_list has all data from all index you are looking for ..
return data_list.toString();
} else {
// use standard way
return resultSet.getString(2);
}
}
The loop started from 2, because 1 column is code and only description column content many semi-colons. The CSVJdbc driver split columns by separator ; and these semi-colons disappears from columns data. So, I explicit add semi-colons to description, except the last column, because it is not relevant in my case.
This code work fine. But not solved my all problem. When I adjusted two columns in header of CSV I get error in row, which content more than two semi-colons. So I try adjust ignore of headers or add many column name (or simple ;) to a header. In superCSV ignore of headers option work fine.
My colleague opinion was: you are don't need CSV driver, because try load CSV which not would be CSV, if separator is sometimes relevant data.
I think my colleague has right and I loaded CSV data whith following code:
InputStream in = null;
try {
in = new ByteArrayInputStream(csvData);
List lines = IOUtils.readLines(in, "UTF-8");
Iterator it = lines.iterator();
String line = "";
while (it.hasNext()) {
line = (String) it.next();
String description = null;
String code = null;
String[] columns = line.split(";");
if (columns.length >= 2) {
code = columns[0];
String[] dest = new String[columns.length - 1];
System.arraycopy(columns, 1, dest, 0, columns.length - 1);
description = org.apache.commons.lang.StringUtils.join(dest, ";");
(...)
ok.. my solution to go and read all fields if columns are more than 2 ... like:
int ccc = meta.getColumnCount();
if (ccc > 2) {
ArrayList<String> data_list = new ArrayList<String>();
for (int ii = 1; ii < ccc; ii++) {
data_list.add(resultSet.getString(i));
}
//data_list has all data from all index you are looking for ..
} else {
//use standard way
resultSet.getString(1);
}
If the table is defined to have as many columns as there could be semi-colons in the source, ignoring the initial column definitions, then the excess semi-colons would be consumed by the database driver automatically.
The most likely reason for them to appear in the final column is because the parser returns the balance of the row to the terminator in the field.
Simply increasing the number of columns in the table to match the maximum possible in the input will avoid the need for custom parsing in the program. Try:
code;description;dummy1;dummy2;dummy3;dummy4;dummy5
c01;d01
c02;d0;;;;;2
Then, the additional ';' delimiters will be consumed by the parser correctly.

error during grouping files based on the date field

I have a large file which has 10,000 rows and each row has a date appended at the end. All the fields in a row are tab separated. There are 10 dates available and those 10 dates have randomly been assigned to all the 10,000 rows. I am now writing a java code to write all those rows with the same date into a separate file where each file has the corresponding rows with that date.
I am trying to do it using string manipulations, but when I am trying to sort the rows based on date, I am getting an error while mentioning the date and the error says the literal is out of range. Here is the code that I used. Please have a look at it let me know if this is the right approach, if not, kindly suggest a better approach. I tried changing the datatype to Long, but still the same error. The row in the file looks something like this:
Each field is tab separated and the fields are:
business id, category, city, biz.name, longitude, state, latitude, type, date
**
qarobAbxGSHI7ygf1f7a_Q ["Sandwiches","Restaurants"] Gilbert Jersey
Mike's Subs -111.8120071 AZ 3.5 33.3788385 business 06012010
**
The code is:
File f=new File(fn);
if(f.exists() && f.length()>0)
{
BufferedReader br=new BufferedReader(new FileReader(fn));
BufferedWriter bw = new BufferedWriter(new FileWriter("FilteredDate.txt"));
String s=null;
while((s=br.readLine())!=null){
String[] st=s.split("\t");
if(Integer.parseInt(st[13])==06012010){
Thanks a lot for your time..
Try this,
List<String> sampleList = new ArrayList<String>();
sampleList.add("06012012");
sampleList.add("06012013");
sampleList.add("06012014");
sampleList.add("06012015");
//
//
String[] sampleArray = s.split(" ");
if (sampleArray != null)
{
String sample = sampleArray[sampleArray.length - 1];
if (sampleList.contains(sample))
{
stringBuilder.append(sample + "\n");
}
}
i suggest not to use split, but rather use
String str = s.subtring(s.lastIndexOf('\t'));
in any case, you try to take st[13] when i see you only have 9 columns. might be you just need st[8]
one last thing, look at this post to learn what 06012010 really means

How do you escape colon (:) in Properties file?

I am using a properties file to store my application's configuration values.
In one of the instances, I have to store a value as
xxx:yyy:zzz. When I do that, the colon is escaped with a back slash\ resulting in the value showing as xxx\:yyy\:zzz in the properties file.
I am aware that the colon : is a standard delimiter of the Properties Java class. However I still need to save the value without the back slash \.
Any suggestions on how to handle this?
Put the properties into the Properties object and save it using a store(...) method. The method will perform any escaping required. The Java documentation says:
"... For the key, all space characters are written with a preceding \ character. For the element, leading space characters, but not embedded or trailing space characters, are written with a preceding \ character. The key and element characters #, !, =, and : are written with a preceding backslash to ensure that they are properly loaded."
You only need to manually escape characters if you are creating / writing the file by hand.
Conversely, if you want the file to contain unescaped colon characters, you are out of luck. Such a file is malformed and probably won't load properly using the Properties.load(...) methods. If you go down this route, you'll need to implement your own custom load and/or store methods.
I came across the same issue. Forward slashes / also get escaped by the store() method in Properties.
I solved this issue by creating my own CustomProperties class (extending java.util.Properties) and commenting out the call to saveConvert() in the customStore0() method.
Here is my CustomProperties class:
import java.io.BufferedWriter;
import java.io.IOException;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.util.Date;
import java.util.Enumeration;
import java.util.Properties;
public class CustomProperties extends Properties {
private static final long serialVersionUID = 1L;
#Override
public void store(OutputStream out, String comments) throws IOException {
customStore0(new BufferedWriter(new OutputStreamWriter(out, "8859_1")),
comments, true);
}
//Override to stop '/' or ':' chars from being replaced by not called
//saveConvert(key, true, escUnicode)
private void customStore0(BufferedWriter bw, String comments, boolean escUnicode)
throws IOException {
bw.write("#" + new Date().toString());
bw.newLine();
synchronized (this) {
for (Enumeration e = keys(); e.hasMoreElements();) {
String key = (String) e.nextElement();
String val = (String) get(key);
// Commented out to stop '/' or ':' chars being replaced
//key = saveConvert(key, true, escUnicode);
//val = saveConvert(val, false, escUnicode);
bw.write(key + "=" + val);
bw.newLine();
}
}
bw.flush();
}
}
We hit this question a couple of days ago. We were manipulating existing properties files with URLs as values.
It's risky but if your property values are less than 40 characters then you can use the "list" method instead of "store":
http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#list(java.io.PrintWriter)
We had a quick look at the JDK code and hacked out a custom implementation of store that works for our purposes:
public void store(Properties props, String propertyFilePath) throws FileNotFoundException {
PrintWriter pw = new PrintWriter(propertyFilePath);
for (Enumeration e = props.propertyNames(); e.hasMoreElements();) {
String key = (String) e.nextElement();
pw.println(key + "=" + props.getProperty(key));
}
pw.close();
}
If you use the xml variant of the properties file (using loadFromXML and storeToXML) this shouldn't be a problem.
Try using unicode.
The unicode for a colon is\u003A
Additionally the unicode for a space is: \u0020
For a list of basic Latin characters see: https://en.wikipedia.org/wiki/Basic_Latin_(Unicode_block)
For example:
ProperName\u003A\NameContinues=Some property value
Will expect a property with a key:
ProperName:NameContinues
And will have a value of:
Some property value
For me it worked by using \ before special character,
e.g,
Before: VCS\u003aIC\u0020Server\u003a=Migration
After: VCS\:IC\ Server\:=Migration
: is escaped with \: and (space) with \ (\ followed by <Space>).
For more info : https://en.wikipedia.org/wiki/.properties
For people like me that get here for this when using Spring Boot configuration properties files: You need to enclose in [..]:
E.g.:
my.test\:key=value
is not enough, you need this in your application.properties for example:
my.[test\:key]=value
See also SpringBoot2 ConfigurationProperties removes colon from yaml keys
Its simple,
just use Apostrophe ' ' over there
E.g.:
Instead of this(case 1)
File file= new File("f:\\properties\\gog\\esave\\apple");
prop.setProperty("basedir",file.toString());
Use this(case 2)
File file= new File("f':'\\properties\\gog\\esave\\apple");
prop.setProperty("basedir",file.toString());
Output will be
Case 1: basedir = f\:\\properties\\gog\\esave\\apple
Case 2: basedir = f:\\properties\\gog\\esave\\apple
I hope this will help you

Categories