Opencsv parser in JAVA, unable to parse double quotes in the data

Opencsv parser in JAVA, unable to parse double quotes in the data - java

I have following csv file,
"id","Description","vale"
1,New"Account","val1"
I am unable to read the above csv file with opencsv jar. It cannot read New"Account, since the double quotes inside data. My csv reader constructor is following,
csvReader = new CSVReader(new FileReader(currentFile), ',', '\"', '\0');

This is invalid csv:
1,New"Account","val1"
should be:
1,"New""Account","val1" -> if you want 1 New"Account val1
or
1,"New""Account""","val1" -> if you want 1 New"Account" val1
Quotes inside (quoted) fields, must be escaped with another quote.
While you could change your code to read the malformed csv correctly, the csv data should be fixed in the first place, because you might get some more erros with larger csv-files or updates of that data.
Usually, quotes are used when there is a seperator or another quote inside the field. So if you would ignore the quotes and only split on the seperator, there will be problems if there is a seperator inside a field in future updates of the data - for example:
1,"John, Doe",123

That is as designed. Your constructor specifies a quote character as "\"" so OpenCSV will treat that character as a quote character, i.e. when it reads a quote it will ignore all commas until a matching quote is found.
To get around this you could use a FilterReader.
Reader reader = new FilterReader(fileReader) {
private int filter(int ch) {
return ch == '"'?' ':ch;
}
#Override
public int read(char[] cbuf, int off, int len) throws IOException {
int red = super.read(cbuf, off, len);
for ( int i = off; i < off + red; i++) {
cbuf[i] = (char)filter(cbuf[i]);
}
return red;
}
#Override
public int read() throws IOException {
return filter(super.read());
}
};

Related

How to extract line with syntax error when parsing PlSQL using Antlr4

I am using the grammar file for PlSql from this Github repository. I want to underline the line in plsql file that I parse if it has a syntax error. I have the following snippet to do so:
public static class UnderlineListener extends BaseErrorListener {
public void syntaxError(Recognizer<?, ?> recognizer,
Object offendingSymbol,
int line, int charPositionInLine,
String msg,
RecognitionException e)
{
System.err.println("line "+line+":"+charPositionInLine+" "+msg);
underlineError(recognizer,(Token)offendingSymbol,
line, charPositionInLine);
}
protected void underlineError(Recognizer recognizer,
Token offendingToken, int line,
int charPositionInLine) {
CommonTokenStream tokens =
(CommonTokenStream)recognizer.getInputStream();
String input = tokens.getTokenSource().getInputStream().toString();
String[] lines = input.split("\n");
String errorLine = lines[line - 1];
System.err.println(errorLine);
for (int i=0; i<charPositionInLine; i++) System.err.print(" ");
int start = offendingToken.getStartIndex();
int stop = offendingToken.getStopIndex();
if ( start>=0 && stop>=0 ) {
for (int i=start; i<=stop; i++) System.err.print("^");
}
System.err.println();
}
}
While this works fine in most cases, some scripting languages, like PlSql, need special handling for case-sensitivity. This means I had to use CaseChangingCharStream as follows:
CharStream s = CharStreams.fromPath(Paths.get('test.sql'));
CaseChangingCharStream upper = new CaseChangingCharStream(s, true);
Lexer lexer = new SomeSQLLexer(upper);
Now when I try to get the input text inside my UnderlineListener using String input = tokens.getTokenSource().getInputStream().toString();, I do not get the actual text of my test.sql. This is because getInputStream() is returning CaseChangingCharStream object which does not give the desired actual text of my test.sql.
How do I get the actual file text in my case? One way could be to pass the file content to the the constructor of UnderlineListener, but I would prefer to stick to the above method of getting actual file text since it can be used for cases where CaseChangingCharStream is not used.

I have found a workaround. The current implementation of CaseChangingCharStream.java does not have a getter method, like getCharStream(), to access final CharStream stream; attribute. Simply adding a getter method for it allows us to access the underlying CharStream object as follows:
CaseChangingCharStream modifiedCharStream = (CaseChangingCharStream) tokens.getTokenSource().getInputStream();
String input = modifiedCharStream.getCharStream().toString();

Line breaks in field treated as end of line while parsing csv file

IN a csv file that I have a record that renders like this:
,"SKYY SPA MARTINI
2 oz. SKYY Vodka
Fresh cucumber
Fresh mint
Splash of simple syrup
Muddle cucumber & mint with syrup.
Add SKYY Vodka and shake with ice.
Strain into a chilled martini glass.
Garnish with a fresh mint sprig and cucumber slice.",
with each line ending with a LF carriage return.
I thought that this would be treated as a string and the carriage returns wouldn't be treated as new lines, but this isn't the case, and is breaking my script. Is there a way to have the reader only have line breaks parsed if they're not flanked by quotes? I'm currently using this as my code, couldn't find a setting for the tokenizer that would allow me to perform this action.
// instantiate description line mapper
DelimitedLineTokenizer lineTokenizer = new DelimitedLineTokenizer();
DefaultLineMapper<LCBOProduct> lineMapper = new DefaultLineMapper<>();
lineMapper.setLineTokenizer(lineTokenizer);
lineMapper.setFieldSetMapper(fieldSetMapper);
// set description line mapper
reader.setLineMapper(lineMapper);
return reader;

Inspired by this CSV regex post, I have written a quick-and-dirty method for doing this:
public static void main(String[] args) {
String line = "\"BEEP\",\"BOOP\",\"TWO SHOTS\rOF VODKA\"\r\"BOOP\",\"BEEP\",\"LEMON\rWEDGES\"";
String quote = "\"";
String splitter = "\r";
String delimiter = ",";
parse(line, delimiter, quote, splitter);
}
public static void parse(String data, String delimiter, String quote, String splitter) {
String regex = splitter+"(?=(?:[^"+quote+"]*\"[^"+quote+"]*\")*[^"+quote+"]*$)";
String[] lines = data.split(regex, -1);
List<String[]> records = new ArrayList<String[]>();
for(String line : lines) {
records.add(line.split(delimiter, -1));
}
for(String[] line : records) {
for(String record : line) {
System.out.println("RECORD: " + record); //do whatever
}
}
}
Of course, considering the large size of some CSV files, you will need to chug along with a StringBuilder and likely use myStringBuilder.toString().split(regex, -1); for the parse method.
This is likely not the Spring way of doing things. But as Jim Garrison commented, this is an edge case that I'm not sure if Spring has ways of solving.
A more complex regex may be required if the records start using other nasty characters (commas, quotes, etc.). I don't know what the source of these records could be, but some sanitizing may be in order before splitting the file.

OpenCSV not escaping the quotes(")

I have a CSV file which will have delimiter or unclosed quotes inside a quotes, How do i make CSVReader ignore the quotes and delimiters inside quotes.
For example:
123|Bhajji|Maga|39|"I said Hey|" I am "5|'10."|"I a do "you"|get that"
This is the content of file.
The below program to read the csv file.
#Test
public void readFromCsv() throws IOException {
FileInputStream fis = new FileInputStream(
"/home/netspurt/awesomefile.csv");
InputStreamReader isr = new InputStreamReader(fis, "UTF-8");
CSVReader reader = new CSVReader(isr, '|', '\"');
for (String[] row; (row = reader.readNext()) != null;) {
System.out.println(Arrays.toString(row));
}
reader.close();
isr.close();
fis.close();
}
I get the o/p something like this.
[123, Bhajji, Maga, 39, I said Hey| I am "5|'10., I am an idiot do "you|get that]
what happened to quote after you
Edit:
The Opencsv dependency
com.opencsv
opencsv
3.4

from the source code of com.opencsv:opencsv:
/**
* Constructs CSVReader.
*
* #param reader the reader to an underlying CSV source.
* #param separator the delimiter to use for separating entries
* #param quotechar the character to use for quoted elements
* #param escape the character to use for escaping a separator or quote
*/
public CSVReader(Reader reader, char separator,
char quotechar, char escape) {
this(reader, separator, quotechar, escape, DEFAULT_SKIP_LINES, CSVParser.DEFAULT_STRICT_QUOTES);
}
see http://sourceforge.net/p/opencsv/source/ci/master/tree/src/main/java/com/opencsv/CSVReader.java
There is a constructor with an additional parameter escape which allows to escape separators and quotes (as per the javadoc).

As the CSV format specifies the quotes(") if its inside a field we need to precede it by another quote("). So this solved my problem.
123|Bhajji|Maga|39|"I said Hey|"" I am ""5|'10."|"I a do ""you""|get that"
Refrence: https://www.ietf.org/rfc/rfc4180.txt

Sorry but I don't have enough rep to add a comment so I will have to add an answer.
For your original question of what happened to the quote after the you the answer is the same as what happened to the quote before the I.
For CSV data the quote immediately before and after the separator is the start and end of the field data and is thus removed. That is why those two quotes are missing.

You need to escape out the quotes that are part of the field. The default escape character is the \
Taking a guess as to which quotes you want to escape the string should look like
123|Bhajji|Maga|39|"I said \"Hey I am \"5'10. Do \"you\" get that?\""

How to get proper string array when parsing CSV?

Using jcsv I'm trying to parse a CSV to a specified type. When I parse it, it says length of the data param is 1. This is incorrect. I tried removing line breaks, but it still says 1. Am I just missing something in plain sight?
This is my input string csvString variable
"Symbol","Last","Chg(%)","Vol",
INTC,23.90,1.06,28419200,
GE,26.83,0.19,22707700,
PFE,31.88,-0.03,17036200,
MRK,49.83,0.50,11565500,
T,35.41,0.37,11471300,
This is the Parser
public class BuySignalParser implements CSVEntryParser<BuySignal> {
#Override
public BuySignal parseEntry(String... data) {
// console says "Length 1"
System.out.println("Length " + data.length);
if (data.length != 4) {
throw new IllegalArgumentException("data is not a valid BuySignal record");
}
String symbol = data[0];
double last = Double.parseDouble(data[1]);
double change = Double.parseDouble(data[2]);
double volume = Double.parseDouble(data[3]);
return new BuySignal(symbol, last, change, volume);
}
}
And this is where I use the parser (right from the example)
CSVReader<BuySignal> cReader = new CSVReaderBuilder<BuySignal>(new StringReader( csvString)).entryParser(new BuySignalParser()).build();
List<BuySignal> signals = cReader.readAll();

jcsv allows different delimiter characters. The default is semicolon. Use CSVStrategy.UK_DEFAULT to get to use commas.
Also, you have four commas, and that usually indicates five values. You might want to remove the delimiters off the end.
I don't know how to make jcsv ignore the first line
I typically use CSVHelper to parse CSV files, and while jcsv seems pretty good, here is how you would do it with CVSHelper:
Reader reader = new InputStreamReader(new FileInputStream("persons.csv"), "UTF-8");
//bring in the first line with the headers if you want them
List<String> firstRow = CSVHelper.parseLine(reader);
List<String> dataRow = CSVHelper.parseLine(reader);
while (dataRow!=null) {
...put your code here to construct your objects from the strings
dataRow = CSVHelper.parseLine(reader);
}

You shouldn't have commas at the end of lines. Generally there are cell delimiters (commas) and line delimiters (newlines). By placing commas at the end of the line it looks like the entire file is one long line.

Reading Java Properties file without escaping values

My application needs to use a .properties file for configuration.
In the properties files, users are allow to specify paths.
Problem
Properties files need values to be escaped, eg
dir = c:\\mydir
Needed
I need some way to accept a properties file where the values are not escaped, so that the users can specify:
dir = c:\mydir

Why not simply extend the properties class to incorporate stripping of double forward slashes. A good feature of this will be that through the rest of your program you can still use the original Properties class.
public class PropertiesEx extends Properties {
public void load(FileInputStream fis) throws IOException {
Scanner in = new Scanner(fis);
ByteArrayOutputStream out = new ByteArrayOutputStream();
while(in.hasNext()) {
out.write(in.nextLine().replace("\\","\\\\").getBytes());
out.write("\n".getBytes());
}
InputStream is = new ByteArrayInputStream(out.toByteArray());
super.load(is);
}
}
Using the new class is a simple as:
PropertiesEx p = new PropertiesEx();
p.load(new FileInputStream("C:\\temp\\demo.properties"));
p.list(System.out);
The stripping code could also be improved upon but the general principle is there.

Two options:
use the XML properties format instead
Writer your own parser for a modified .properties format without escapes

You can "preprocess" the file before loading the properties, for example:
public InputStream preprocessPropertiesFile(String myFile) throws IOException{
Scanner in = new Scanner(new FileReader(myFile));
ByteArrayOutputStream out = new ByteArrayOutputStream();
while(in.hasNext())
out.write(in.nextLine().replace("\\","\\\\").getBytes());
return new ByteArrayInputStream(out.toByteArray());
}
And your code could look this way
Properties properties = new Properties();
properties.load(preprocessPropertiesFile("path/myfile.properties"));
Doing this, your .properties file would look like you need, but you will have the properties values ready to use.
*I know there should be better ways to manipulate files, but I hope this helps.

The right way would be to provide your users with a property file editor (or a plugin for their favorite text editor) which allows them entering the text as pure text, and would save the file in the property file format.
If you don't want this, you are effectively defining a new format for the same (or a subset of the) content model as the property files have.
Go the whole way and actually specify your format, and then think about a way to either
transform the format to the canonical one, and then use this for loading the files, or
parse this format and populate a Properties object from it.
Both of these approaches will only work directly if you actually can control your property object's creation, otherwise you will have to store the transformed format with your application.
So, let's see how we can define this. The content model of normal property files is simple:
A map of string keys to string values, both allowing arbitrary Java strings.
The escaping which you want to avoid serves just to allow arbitrary Java strings, and not just a subset of these.
An often sufficient subset would be:
A map of string keys (not containing any whitespace, : or =) to string values (not containing any leading or trailing white space or line breaks).
In your example dir = c:\mydir, the key would be dir and the value c:\mydir.
If we want our keys and values to contain any Unicode character (other than the forbidden ones mentioned), we should use UTF-8 (or UTF-16) as the storage encoding - since we have no way to escape characters outside of the storage encoding. Otherwise, US-ASCII or ISO-8859-1 (as normal property files) or any other encoding supported by Java would be enough, but make sure to include this in your specification of the content model (and make sure to read it this way).
Since we restricted our content model so that all "dangerous" characters are out of the way, we can now define the file format simply as this:
<simplepropertyfile> ::= (<line> <line break> )*
<line> ::= <comment> | <empty> | <key-value>
<comment> ::= <space>* "#" < any text excluding line breaks >
<key-value> ::= <space>* <key> <space>* "=" <space>* <value> <space>*
<empty> ::= <space>*
<key> ::= < any text excluding ':', '=' and whitespace >
<value> ::= < any text starting and ending not with whitespace,
not including line breaks >
<space> ::= < any whitespace, but not a line break >
<line break> ::= < one of "\n", "\r", and "\r\n" >
Every \ occurring in either key or value now is a real backslash, not anything which escapes something else.
Thus, for transforming it into the original format, we simply need to double it, like Grekz proposed, for example in a filtering reader:
public DoubleBackslashFilter extends FilterReader {
private boolean bufferedBackslash = false;
public DoubleBackslashFilter(Reader org) {
super(org);
}
public int read() {
if(bufferedBackslash) {
bufferedBackslash = false;
return '\\';
}
int c = super.read();
if(c == '\\')
bufferedBackslash = true;
return c;
}
public int read(char[] buf, int off, int len) {
int read = 0;
if(bufferedBackslash) {
buf[off] = '\\';
read++;
off++;
len --;
bufferedBackslash = false;
}
if(len > 1) {
int step = super.read(buf, off, len/2);
for(int i = 0; i < step; i++) {
if(buf[off+i] == '\\') {
// shift everything from here one one char to the right.
System.arraycopy(buf, i, buf, i+1, step - i);
// adjust parameters
step++; i++;
}
}
read += step;
}
return read;
}
}
Then we would pass this Reader to our Properties object (or save the contents to a new file).
Instead, we could simply parse this format ourselves.
public Properties parse(Reader in) {
BufferedReader r = new BufferedReader(in);
Properties prop = new Properties();
Pattern keyValPattern = Pattern.compile("\s*=\s*");
String line;
while((line = r.readLine()) != null) {
line = line.trim(); // remove leading and trailing space
if(line.equals("") || line.startsWith("#")) {
continue; // ignore empty and comment lines
}
String[] kv = line.split(keyValPattern, 2);
// the pattern also grabs space around the separator.
if(kv.length < 2) {
// no key-value separator. TODO: Throw exception or simply ignore this line?
continue;
}
prop.setProperty(kv[0], kv[1]);
}
r.close();
return prop;
}
Again, using Properties.store() after this, we can export it in the original format.

Based on #Ian Harrigan, here is a complete solution to get Netbeans properties file (and other escaping properties file) right from and to ascii text-files :
import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.io.Reader;
import java.io.Writer;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Properties;
/**
* This class allows to handle Netbeans properties file.
* It is based on the work of : http://stackoverflow.com/questions/6233532/reading-java-properties-file-without-escaping-values.
* It overrides both load methods in order to load a netbeans property file, taking into account the \ that
* were escaped by java properties original load methods.
* #author stephane
*/
public class NetbeansProperties extends Properties {
#Override
public synchronized void load(Reader reader) throws IOException {
BufferedReader bfr = new BufferedReader( reader );
ByteArrayOutputStream out = new ByteArrayOutputStream();
String readLine = null;
while( (readLine = bfr.readLine()) != null ) {
out.write(readLine.replace("\\","\\\\").getBytes());
out.write("\n".getBytes());
}//while
InputStream is = new ByteArrayInputStream(out.toByteArray());
super.load(is);
}//met
#Override
public void load(InputStream is) throws IOException {
load( new InputStreamReader( is ) );
}//met
#Override
public void store(Writer writer, String comments) throws IOException {
PrintWriter out = new PrintWriter( writer );
if( comments != null ) {
out.print( '#' );
out.println( comments );
}//if
List<String> listOrderedKey = new ArrayList<String>();
listOrderedKey.addAll( this.stringPropertyNames() );
Collections.sort(listOrderedKey );
for( String key : listOrderedKey ) {
String newValue = this.getProperty(key);
out.println( key+"="+newValue );
}//for
}//met
#Override
public void store(OutputStream out, String comments) throws IOException {
store( new OutputStreamWriter(out), comments );
}//met
}//class

You could try using guava's Splitter: split on '=' and build a map from resulting Iterable.
The disadvantage of this solution is that it does not support comments.

#pdeva: one more solution
//Reads entire file in a String
//available in java1.5
Scanner scan = new Scanner(new File("C:/workspace/Test/src/myfile.properties"));
scan.useDelimiter("\\Z");
String content = scan.next();
//Use apache StringEscapeUtils.escapeJava() method to escape java characters
ByteArrayInputStream bi=new ByteArrayInputStream(StringEscapeUtils.escapeJava(content).getBytes());
//load properties file
Properties properties = new Properties();
properties.load(bi);

It's not an exact answer to your question, but a different solution that may be appropriate to your needs. In Java, you can use / as a path separator and it'll work on both Windows, Linux, and OSX. This is specially useful for relative paths.
In your example, you could use:
dir = c:/mydir

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Opencsv parser in JAVA, unable to parse double quotes in the data - java

Related

How to extract line with syntax error when parsing PlSQL using Antlr4

Line breaks in field treated as end of line while parsing csv file

OpenCSV not escaping the quotes(")

How to get proper string array when parsing CSV?

Reading Java Properties file without escaping values

Categories

Resources