Parsing entire csv file vs parsing line by line in java

Parsing entire csv file vs parsing line by line in java - java

I have somewhat of a larger csv file approximately 80K to 120K rows (depending on the day). I'm successfully running the code which parses the entire csv file into a java object using #CsvBindByName annotation. Sample code:
Reader reader = Files.newBufferedReader(Paths.get(file));
CsvToBean csvToBean = new CsvToBeanBuilder<Object>(reader)
.withType(MyCustomClass.class)
.withIgnoreLeadingWhiteSpace(true)
.build();
List<MyCustomClass> myCustomClass= csvToBean.parse();`
I want to change this code to parse the csv file line by line instead of entire file but retain the neatness of mapping to java bean object. Essentially something like this:
CSVReader csvReader = new CSVReader(Files.newBufferedReader(Paths.get(csvFileLoc)));
String[] headerRow = csvReader.readNext(); // save the headerRow
String [] nextLine = null;
MyCustomClass myCustomClass = new MyCustomClass();
while ((nextLine = csvReader.readNext())!=null) {
myCustomClass.setField1(nextLine[0]);
myCustomClass.setField2(nextLine[1]);
//.... so on
}
But the above solution ties me to knowing the column positions for each field. What I would like is to map the string array I get from csv based on the header row similar to what opencsv does while parsing the entire csv file. However, I am not able to do that using opencsv, as far as I can tell. I had assumed this would be a pretty common practice but I am unable to find any references to this online. It could be that I am not understanding the CsvToBean usage correctly for opencsv library. I could use csvToBean.iterator to iterate over the beans but I think entire csv file is loaded in memory with the build method, which kind of defeats the purpose of reading line by line. Any suggestions welcome

Looking at the API docs further, I see that CsvToBean<T> implements Iterable<T> and has an iterator() method that returns an Iterator<T> that is documented as follows:
The iterator returned by this method takes one line of input at a time and returns one bean at a time.
So it looks like you could just write your loop as:
for (MyCustomClass myCustomClass : csvToBean) {
// . . . do something with the bean . . .
}
Just to clear up some potential confusion, you can see in the source code that the build() method of CsvToBeanBuilder just creates the CsvToBean object, and doesn't do the actual input, and that the parse() method and the iterator of the CsvToBean object each do perform input.

Related

Using I/O stream to parse CSV file

I have a CSV file of US population data for every county in the US. I need to get each population from the 8th column of the file. I'm using a fileReader() and bufferedStream() and not sure how to use the split method to accomplish this. I know this isn't much information but I know that I'll be using my args[0] as the destination in my class.
I'm at a loss to where to being to be honest.
import java.io.FileReader;
public class Main {
public static void main(String[] args) {
BufferedReader() buff = new BufferedReader(new FileReader(args[0]));
String
}
try {
}
}
The output should be an integer of the total US population. Any help with pointing me in the right direction would be great.

Don't reinvent the wheel, don't parse CSV yourself: use a library. Even such a simple format as CSV has nuances: fields can be escaped with quotes or unescaped, the file can have or have not a header and so on. Besides that you have to test and maintain the code you've wrote. So writing less code and reusing libraries is good.
There are a plenty of libraries for CSV in Java:
Apache Commons CSV
OpenCSV
Super CSV
Univocity
flatpack
IMHO, the first two are the most popular.
Here is an example for Apache Commons CSV:
final Reader in = new FileReader("counties.csv");
final Iterable<CSVRecord> records = CSVFormat.DEFAULT.parse(in);
for (final CSVRecord record : records) { // Simply iterate over the records via foreach loop. All the parsing is handler for you
String populationString = record.get(7); // Indexes are zero-based
String populationString = record.get("population"); // Or, if your file has headers, you can just use them
… // Do whatever you want with the population
}
Look how easy it is! And it will be similar with other parsers.

Univocity - Detect missing column when parsing CSV

I'm using Univocity library to parse CSV and it works perfectly, but I need a way to detect if the file being parsed has less columns than required
For example, if I'm expecting a 3 columns file, with columns mapped to [H1,H2,H3] then I received a file (which has no headers) that looks like
V1_H1,V1_H2
V2_H1,V2_H2
When using
record.getString("H3");
this would return null, instead, I need this file to either fail to be parsed or I can check if it misses a column and stop processing it
Is there any way to achieve this?

So since my main issue here is to make sure that the headers count is the same as the number of columns provided in the CSV file, and since I'm using an iterator to iterate over records, I've added a check like:
CsvParser parser = new CsvParser(settings);
ResultIterator<Record, ParsingContext> iterator = parser.iterateRecords(inputStream).iterator();
if(iterator.getContext().parsedHeaders().length != settings.getHeaders().length){
throw new Exception("Invalid file");
}
It's working for me, not sure if there is a better way to do it.

I've watched Univocity documentation and I've found here that there is a way to add annotations to the destination objects you are going to generate from the CSV input
#Parsed
#Validate
public String notNulNotBlank; //This should fail if the field is null or blank
#Parsed
#Validate(nullable = true)
public String nullButNotBlank;
#Parsed
#Validate(allowBlanks = true)
public String notNullButBlank;
This will also help you to use the objects instead of having to work with fields.
Hope that helps :-)

Apache Beam - Reading JSON and Stream

I am writing Apache beam code, where I have to read a JSON file which has placed in the project folder, and read the data and Stream it.
This is the sample code to read JSON. Is this correct way of doing it?
PipelineOptions options = PipelineOptionsFactory.create();
options.setRunner(SparkRunner.class);
Pipeline p = Pipeline.create(options);
PCollection<String> lines = p.apply("ReadMyFile", TextIO.read().from("/Users/xyz/eclipse-workspace/beam-prototype/test.json"));
System.out.println("lines: " + lines);
or I should use,
p.apply(FileIO.match().filepattern("/Users/xyz/eclipse-workspace/beam-prototype/test.json"))
I just need to read the below json file. Read the complete testdata from this file and then Stream it.
{
“testdata":{
“siteOwner”:”xxx”,
“siteInfo”:{
“siteID”:”id_member",
"siteplatform”:”web”,
"siteType”:”soap”,
"siteURL”:”www”,
}
}
}
The above code is not reading the json file, it is printing like
lines: ReadMyFile/Read.out [PCollection]
, could you please guide me with sample reference?

This is the sample code to read JSON. Is this correct way of doing it?
To quickly answer your question, yes. Your sample code is the correct way to read a file containing JSON, where each line of the file contains a single JSON element. The TextIO input transform reads a file line by line, so if a single JSON element spans multiple lines, then it will not be parseable.
The second code sample has the same effect.
The above code is not reading the json file, it is printing like
The printed result is expected. The variable lines does not actually contain the JSON strings in the file. lines is a PCollection of Strings; it simply represents the state of the pipeline after a transform is applied. Accessing elements in the pipeline can be done by applying subsequent transforms. The actual JSON string can be access in the implementation of a transform.

Remove the dash present before the key value pair in a YAML file

It is known that the dash (-) in a YAML file before the key value pair is necessary to show that it is separate block(it is what I think). Figure 1 shows the YAML I'm generating using YamlBeans jar.
field1:
- childfield1:
datafield1:
param1:
childparam: paramvalue
param2:
childparam2: paramvalue
param3:
childparam3: paramvalue
datafield2: value2
For my codebase can't be changed, I have to somehow create the YAMLs as shown in Figure 2 (a tab is appended in each line in yaml file) or remove the dash is removed. You can clearly observe that there are only two thin vertical lines in Figure 1 but three thin vertical lines in Figure 2 which shows the alignment of the blocks.
What I want to achieve is to remove that dash from the first block (at the child field) from the file. Using a YAML file reader and writer always introduces the dash.

Glancing quick at (but admittedly not being familiar with) YamlBeans, it doesn't look like it's easy to subclass the behavior of the Emitter. One option though is to generate a temporary form in memory, then manipulate the results when writing out to a file. For example
// let YamlWriter write its contents to an in-memory buffer
StringWriter temp = new StringWriter();
YamlWriter yamlOut = new YamlWriter(temp);
yamlOut.write(someObject);
// then dump the in-memory buffer out to a file, manipulating lines that
// start with a dash
PrintWriter out = new PrintWriter(new FileWriter(new File("someoutput.dat")));
LineNumberReader in = new LineNumberReader(new StringReader(temp.toString()));
String line;
while ((line = in.readLine()) != null) {
if (line.startsWith("-")) {
line = line.substring(1);
}
out.println(line);
}
my specifics may be off, but hopefully the approach of doing simple manipulations of a temporary copy is clear enough.
If I were personally doing this, I'd probably write a custom subclass of java.io.Writer and do the manipulation on the fly (but i haven't gone through YamlWriter/Emitter in enough detail to provide an example on how to do that)

Parsing a CSV file for a unique row using the new Java 8 Streams API

I am trying to use the new Java 8 Streams API (for which I am a complete newbie) to parse for a particular row (the one with 'Neda' in the name column) in a CSV file. Using the following article for motivation, I modified and fixed some errors so that I could parse the file containing 3 columns - 'name', 'age' and 'height'.
name,age,height
Marianne,12,61
Julie,13,73
Neda,14,66
Julia,15,62
Maryam,18,70
The parsing code is as follows:
#Override
public void init() throws Exception {
Map<String, String> params = getParameters().getNamed();
if (params.containsKey("csvfile")) {
Path path = Paths.get(params.get("csvfile"));
if (Files.exists(path)){
// use the new java 8 streams api to read the CSV column headings
Stream<String> lines = Files.lines(path);
List<String> columns = lines
.findFirst()
.map((line) -> Arrays.asList(line.split(",")))
.get();
columns.forEach((l)->System.out.println(l));
// find the relevant sections from the CSV file
// we are only interested in the row with Neda's name
int nameIndex = columns.indexOf("name");
int ageIndex columns.indexOf("age");
int heightIndex = columns.indexOf("height");
// we need to know the index positions of the
// have to re-read the csv file to extract the values
lines = Files.lines(path);
List<List<String>> values = lines
.skip(1)
.map((line) -> Arrays.asList(line.split(",")))
.collect(Collectors.toList());
values.forEach((l)->System.out.println(l));
}
}
}
Is there any way to avoid re-reading the file following the extraction of the header line? Although this is a very small example file, I will be applying this logic to a large CSV file.
Is there technique to use the streams API to create a map between the extracted column names (in the first scan of the file) to the values in the remaining rows?
How can I return just one row in the form of List<String> (instead of List<List<String>> containing all the rows). I would prefer to just find the row as a mapping between the column names and their corresponding values. (a bit like a result set in JDBC). I see a Collectors.mapMerger function that might be helpful here, but I have no idea how to use it.

Use a BufferedReader explicitly:
List<String> columns;
List<List<String>> values;
try(BufferedReader br=Files.newBufferedReader(path)) {
String firstLine=br.readLine();
if(firstLine==null) throw new IOException("empty file");
columns=Arrays.asList(firstLine.split(","));
values = br.lines()
.map(line -> Arrays.asList(line.split(",")))
.collect(Collectors.toList());
}
Files.lines(…) also resorts to BufferedReader.lines(…). The only difference is that Files.lines will configure the stream so that closing the stream will close the reader, which we don’t need here, as the explicit try(…) statement already ensures the closing of the BufferedReader.
Note that there is no guarantee about the state of the reader after the stream returned by lines() has been processed, but we can safely read lines before performing the stream operation.

First, your concern that this code is reading the file twice is not founded. Actually, Files.lines returns a Stream of the lines that is lazy-populated. So, the first part of the code only reads the first line and the second part of the code reads the rest (it does read the first line a second time though, even if ignored). Quoting its documentation:
Read all lines from a file as a Stream. Unlike readAllLines, this method does not read all lines into a List, but instead populates lazily as the stream is consumed.
Onto your second concern about returning just a single row. In functional programming, what you are trying to do is called filtering. The Stream API provides such a method with the help of Stream.filter. This method takes a Predicate as argument, which is a function that returns true for all the items that should be kept, and false otherwise.
In this case, we want a Predicate that would return true when the name is equal to "Neda". This could be written as the lambda expression s -> s.equals("Neda").
So in the second part of your code, you could have:
lines = Files.lines(path);
List<List<String>> values = lines
.skip(1)
.map(line -> Arrays.asList(line.split(",")))
.filter(list -> list.get(0).equals("Neda")) // keep only items where the name is "Neda"
.collect(Collectors.toList());
Note however that this does not ensure that there is only a single item where the name is "Neda", it collects all possible items into a List<List<String>>. You could add some logic to find the first item or throw an exception if no items are found, depending on your business requirement.
Note still that calling twice Files.lines(path) can be avoided by using directly a BufferedReader as in #Holger's answer.

Using a CSV-processing library
Other Answers are good. But I recommend using a CSV-processing library to read your input files. As others noted, the CSV format is not as simple as it may seem. To begin with, the values may or may not be nested in quote-marks. And there are many variations of CSV, such as those used in Postgres, MySQL, Mongo, Microsoft Excel, and so on.
The Java ecosystem offers several such libraries. I use Apache Commons CSV.
The Apache Commons CSV library does make not use of streams. But you have no need for streams for your work if using a library to do the scut work. The library makes easy work of looping the rows from the file, without loading large file into memory.
create a map between the extracted column names (in the first scan of the file) to the values in the remaining rows?
Apache Commons CSV does this automatically when you call withHeader.
return just one row in the form of List
Yes, easy to do.
As you requested, we can fill List with each of the 3 field values for one particular row. This List acts as a tuple.
List < String > tuple = List.of(); // Our goal is to fill this list of values from a single row. Initialize to an empty nonmodifiable list.
We specify the format we expect of our input file: standard CSV (RFC 4180), with the first row populated by column names.
CSVFormat format = CSVFormat.RFC4180.withHeader() ;
We specify the file path where to find our input file.
Path path = Path.of("/Users/basilbourque/people.csv");
We use try-with-resources syntax (see Tutorial) to automatically close our parser.
As we read in each row, we check for the name being Neda. If found, we report file our tuple List with that row's field values. And we interrupt the looping. We use List.of to conveniently return a List object of some unknown concrete class that is unmodifiable, meaning you cannot add nor remove elements from the list.
try (
CSVParser parser =CSVParser.parse( path , StandardCharsets.UTF_8, format ) ;
)
{
for ( CSVRecord record : parser )
{
if ( record.get( "name" ).equals( "Neda" ) )
{
tuple = List.of( record.get( "name" ) , record.get( "age" ) , record.get( "height" ) );
break ;
}
}
}
catch ( FileNotFoundException e )
{
e.printStackTrace();
}
catch ( IOException e )
{
e.printStackTrace();
}
If we found success, we should see some items in our List.
if ( tuple.isEmpty() )
{
System.out.println( "Bummer. Failed to report a row for `Neda` name." );
} else
{
System.out.println( "Success. Found this row for name of `Neda`:" );
System.out.println( tuple.toString() );
}
When run.
Success. Found this row for name of Neda:
[Neda, 14, 66]
Instead of using a List as a tuple, I suggest your define a Person class to represent this data with proper data types. Our code here would return a Person instance rather than a List<String>.

I know I'm responding so late, but maybe it will help someone in the future
I've made a csv parser/writer , easy to use thanks to its builder pattern
For your case: you can filter the lines you want to parse using
csvLineFilter(Predicate<String>)
Hope you find it handy, here is the source code
https://github.com/i7paradise/CsvUtils-Java8/
I've joined a main class Demo.java to display how it works

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing entire csv file vs parsing line by line in java - java

Related

Using I/O stream to parse CSV file

Univocity - Detect missing column when parsing CSV

Apache Beam - Reading JSON and Stream

Remove the dash present before the key value pair in a YAML file

Parsing a CSV file for a unique row using the new Java 8 Streams API

Categories

Resources