I was trying to export the data from Mongo to Oracle. I used to below approach.
Step 1 : Export the data to CS file usign mongoExport command.
Step 2 : Read the data through a java code, do the necessary data transformation.
Step 3 : Insert the data to Oracle
Issue is that, when any of the comment section has a new line character ('\n'), the data is moving to next line and java read fails to process the document.
There is a open bug with 10gen for this, JIRA. Has any one faced issue. Is there is a workaround for this ?
As with many formatting nuances in CSV, there is no agreed "standard" for how to handle embedded newline characters in a CSV field.
A common implementation is RFC-4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files, which suggests:
6) Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes.
For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
This is the format that mongoexport is currently using. If you use a CSV parser compliant with RFC-4180 (eg. SuperCSV as suggested by #evanchooly) it should handle the quoted newlines as expected.
If you need an alternative to the format used by mongoexport or need more flexibility in your output, you can always write your own export script.
Are you trying to parse the csv manually? If so, take a look at http://opencsv.sourceforge.net/ or http://supercsv.sourceforge.net/ and see if they help.
Related
My main problem is that I'm trying to read a CSV delimited by ; in Java and the problem comes when I try to read a field of the CSV that contains a ;. For example:
"I want you to do that;"
In this case the field is recognized like
"I want you to do that"
And it creates another field that is just an empty string.
I use a BufferedReader to read the CSV and the split method to separate it with the ;. I'm not allowed to use libraries like OpenCSV so I want to find a solution with the method I'm using.
Parse according to the quotation marks
If the data incidentally containing the delimiter is wrapped in double quotes (QUOTATION MARK), then you should have no problem with parsing. Your parsing should look first for pairs of double quote characters. After that, look for delimiters outside of those pairs.
Rather than writing the parsing code yourself, I highly recommend using a CSV library. In the Java ecosystem, you have a wealth of good products to choose. For example, I have made successful use of Apache Commons CSV.
See also the specification for CSV: RFC 4180.
I have a delimited file in HDFS which contains newline characters (\n) in data itself and thus while reading with sc.textFile() records split incorrectly. Since data is not quoted I cannot use the multiLine option available in spark.
As a resolution, I am planning to create a custom reader in pyspark so that if any record contains less number of delimiters then the next record will be merged with the same. I am new to spark and looking for suggestions whether this is the correct approach and possible to implement.
Sample data:
ID,NAME,AGE,SEX,LOCATION
1,Abc,33,M,India
2,De
f,45,F,Australia
3,Ijk,21,F,Canada
Expected output-
ID,NAME,AGE,SEX,LOCATION
1,Abc,33,M,India
2,Def,45,F,Australia
3,Ijk,21,F,Canada
(note that the "De f" in the sample data becomes "Def" in the expected output)
If I have an application sitting on a *nix box and I generate a csv that is then passed off to thin client users what precautions must I take if they are on Windows and want a proper csv?
What I mean is this flow:
UNIX:
1) Generate csv using System.getProperty("line.separator")
2) Pass to thin client
Windows/Unix:
1) Download file from browser
2a) Open in Excel (Windows)
2b) Open in some spreadsheet application
I am not looking for answers that say use Library X, there is a lack of fondness for adding technical risk for libraries that are used for only one project.
According to [rfc4180 section 2][1]
there is no formal specification in existence, which allows for a
wide variety of interpretations of CSV files. This section
documents the format that seems to be followed by most
implementations:
Each record is located on a separate line, delimited by a line
break (CRLF). For example:
aaa,bbb,ccc CRLF
zzz,yyy,xxx CRLF
The last record in the file may or may not have an ending line
break. For example:
aaa,bbb,ccc CRLF
zzz,yyy,xxx
There maybe an optional header line appearing as the first line
of the file with the same format as normal record lines. This
header will contain names corresponding to the fields in the file
and should contain the same number of fields as the records in
the rest of the file (the presence or absence of the header line
should be indicated via the optional "header" parameter of this
MIME type). For example:
field_name,field_name,field_name CRLF
aaa,bbb,ccc CRLF
zzz,yyy,xxx CRLF
Therefore System.getProperty("line.separator") is incorrect and \r\n should be used instead.
[1]: https://www.rfc-editor.org/rfc/rfc4180#section-2
If the task is to create a csv file out of some data where commas may be present, is there a way to do it without later confusing which comma is a delimiter and which comma is part of a value?
Obviously, we can use a different delimiter, replace all occurrences, or replace the original comma with something else, but for the purpose of this question let's say that modifying the original data is not an option and a comma is the only delimiter allowed.
How would you approach something like this? Would it be easier to create the xls instead? Can you recommend any java libraries that handle this well?
A true CSV reader should be able to handle this; the values should be in quotes, e.g.:
one,two,"a, b, c",four
...per item #6 in Section 2 of the RFC.
While there's no single CSV standard, the usual convention is to surround entries containing commas in double quotes (i.e. ").
Prempting the next question: What to do if your data contains a double quote? In this case they are usually substituted for a pair of double quotes.
While I hate to cite wikipedia as a source, they do have a pretty good roundup of basic rules and examples for CSV formatting.
I would either use a different delimiter or use a library like Apache POI.
I think the best way is to use Apache POI: http://poi.apache.org/
You can easily create XLS documents without much hassle.
However, if you really need CSV and not XLS, you can surround the value with quotes. This should also solve the problem.
Usually, you work with , as separator and ' as quote. So your values would look like:
foo, 'bar, baz', iik, aje
the task is to create a csv file
Actually an impossible task, since there is no such thing as "a CSV" file. Different Microsoft produces have used different (subtly different, I grant) formats and named them all "CSV". As most spreadsheets can read delimiter separated value (DSV) files, you might be better writing one of those.
I have a csv file in the below format. I get an issue if either one of the beow csv data is read by the program
"D",abc"def,"","0429"292"0","11","IJ80","Feb10_1.txt-2","FILE RECORD","05/02/2010","04/03/2010","","1","-91","",""
"D","abc"def","","04292920","11","IJ80","Feb10_1.txt-2","FILE RECORD","05/02/2010","04/03/2010","","1","-91","",""
The below split command is used to ignore the commas inside the double quotes i got the below split command from an earlier post. Pasted the URL that i took this command
String items[] = line.split(",(?=([^\"]\"[^\"]\")[^\"]$)",15);
System.out.println("items.length"+items.length);
Regarding Java Split Command Parsing Csv File
The items.length is printed as 14 instead of 15. The abc"def is not recognized as a individual field and it's getting incorrectly stored as
"D",abc"def in items[0]. . I want it to be stored in the below way
items[0] should be "D" and items[1] should be abc"def
The same issue happens when there is a value "abc"def". I want it to be stored as
items[0] should be "D" and items[1] should be "abc"def"
Also this split command works perfectly if the double quotes repeated inside the double quotes( field value is D,"abc""def",1 ).
How can i resolve this issue.
I think you would be much better off writing a parser to parse the CSV files rather than try to use a regular expression. Once you start dealing with CSV files with carriage returns within the lines, then the Regex will probably fall apart. It wouldn't take that much code to write a simple while loop that went through all the characters and split up the data. It would be lot easier to deal with "Non-Standard"* CSV files such as yours when you have a parser rather than a Regex.
*I say non-standard because there isn't really an official standard for CSV, and when you're dealing with CSV files from many different systems, you see lots of weird things, like the abc"def field as shown above.
opencsv is a great simple and light weight CSV parser for Java. It will easily handle your data.
If possible, changing your CSV format would make the solution very simple.
See the following for an overview of Delimiter Separated Values, a common format on Unix-based systems:
http://www.faqs.org/docs/artu/ch05s02.html#id2901882
Opencsv is very simple and best API for CSV parsing . This can be done with Linux SED commands prior processing it in java . If File is not in proper format convert it into proper delimited which is your (" , " ) into pipe or other unique delimiter , so inside field value and column delimiter can be differentiated easily by Opencsv.Use the power of linux with your java code.