I have two very large CSV files that will only continue to get larger with time. The documents I'm using to test are 170 columns wide and roughly 57,000 rows. This is using data from 2018 to now, ideally the end result will be sufficient to run on CSVs with data going as far back as 2008 which will result in the CSVs being massive.
Currently I'm using Univocity, but the creator has been inactive on answering questions for quite some time and their website has been down for weeks, so I'm open to changing parsers if need be.
Right now I have the following code:
public void test() {
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setLineSeparatorDetectionEnabled(true);
parserSettings.setHeaderExtractionEnabled(false);
CsvParser sourceParser = new CsvParser(parserSettings);
sourceParser.beginParsing(sourceFile));
Writer writer = new OutputStreamWriter(new FileOutputStream(outputPath), StandardCharsets.UTF_8);
CsvWriterSettings writerSettings = new CsvWriterSettings();
CsvWriter csvWriter = new CsvWriter(writer, writerSettings);
csvWriter.writeRow(headers);
String[] sourceRow;
String[] compareRow;
while ((sourceRow = sourceParser.parseNext()) != null) {
CsvParser compareParser = new CsvParser(parserSettings);
compareParser.beginParsing(Path.of("src/test/resources/" + compareCsv + ".csv").toFile());
while ((compareRow = compareParser.parseNext()) != null) {
if (Arrays.equals(sourceRow, compareRow)) {
break;
} else {
if (compareRow[KEY_A].trim().equals(sourceRow[KEY_A].trim()) &&
compareRow[KEY_B].trim().equals(sourceRow[KEY_B].trim()) &&
compareRow[KEY_C].trim().equals(sourceRow[KEY_C].trim())) {
for (String[] result : getOnlyDifferentValues(sourceRow, compareRow)) {
csvWriter.writeRow(result);
}
break;
}
}
}
compareParser.stopParsing();
}
}
This all works exactly as I need it to, but of course as you can obviously tell it takes forever. I'm stopping and restarting the parsing of the compare file because order is not guaranteed in these files, so what is in row 1 in the source CSV could be in row 52,000 in the compare CSV.
The Question:
How do I get this faster? Here are my requirements:
Print row under following conditions:
KEY_A, KEY_B, KEY_C are equal but any other column is not equal
Source row is not found in compare CSV
Compare row is not found in source CSV
Presently I only have the first requirement working, but I need to tackle the speed issue first and foremost. Also, if I try to parse the file into memory I immediately run out of heap space and the application laughs at me.
Thanks in advance.
Also, if I try to parse the file into memory I immediately run out of heap space
Have you tried increasing the heap size? You don't say how large your data file is, but 57000 rows * 170 columns * 100 bytes per cell = 1 GB, which should pose no difficulty on a modern hardware. Then, you can keep the comparison file in a HashMap for efficient lookup by key.
Alternatively, you could import the CSVs into a database and make use of its join algorithms.
Or if you'd rather reinvent the wheel while scrupolously avoiding memory use, you could first sort the CSVs (by partitioning them into sets small enough to sort in memory, and then doing a k-way merge to merge the sublists), and then to a merge join. But the other solutions are likely to be a lot easier to implement :-)
I have a scenario where in we get machine code from one machine that needs to be sent to another but by converting it to string which other machine understands following are the scenarios
**if code is 'AGO PRF' then convert to 'AGO.P'
if code is 'HUSQ A' then convert to 'HUSQ.A'
if code is 'AIK B' then convert to 'AIK.B'
if code is 'ALUS WS' then convert to 'ALUS.WS'
if code is 'VST WSA' then convert to 'VST.WSA'
if code is 'SAB CL' then convert to 'SAB.CL'
if code is 'SPR WSB' then convert to 'NSPR.WSB'
if code is 'AXS PRD CL' then change it to 'AXS.PCL'
if code is 'CTEST RT' then convert to 'CTEST.R'
if code is 'ALUS U' then convert to 'ALUS.U'
if code is 'SFUN WI' then convert to 'SFUN.WI'
if code is 'RQI RT WI' then convert to 'RQI.RTWI'
if code is 'ECA WS WI' then change it to 'ECA.WSWI'.**
I used a MAP to fed in these values as keys and give out the output. but I want to know if there can be more generic solution to this
If there exists neither a rule nor a regularity of the String replacement (I find none), then you need either a mapping table stored in the DB or a static Map<String, String> of these constants:
I recommend the Map in case of a small number of these and they would not be changed often.
I recommend reading from the DB in case of a larger number of these. This also allows you to change the mapping on run with no need to build and redeploy the entire application.
In terms of the data structure, the dictionary-based would be the best way to go - Map<String, String>. It doesn't allow you to store duplicated keys and is simple to use for the transformation:
List<String> listOfStringsToBeReplaced = loadFromSomewhere();
Map<String, String> map = loadFromDb();
List<String> listWithReplacedStrnigs = listOfStringsToBeReplaced.stream()
.map(string -> map.getOrDefault(string, string))
.collect(Collectors.toList());
I use Map::getOrDefault to either replace the value or keep it as is if no mapping is found.
I am trying to use the new Java 8 Streams API (for which I am a complete newbie) to parse for a particular row (the one with 'Neda' in the name column) in a CSV file. Using the following article for motivation, I modified and fixed some errors so that I could parse the file containing 3 columns - 'name', 'age' and 'height'.
name,age,height
Marianne,12,61
Julie,13,73
Neda,14,66
Julia,15,62
Maryam,18,70
The parsing code is as follows:
#Override
public void init() throws Exception {
Map<String, String> params = getParameters().getNamed();
if (params.containsKey("csvfile")) {
Path path = Paths.get(params.get("csvfile"));
if (Files.exists(path)){
// use the new java 8 streams api to read the CSV column headings
Stream<String> lines = Files.lines(path);
List<String> columns = lines
.findFirst()
.map((line) -> Arrays.asList(line.split(",")))
.get();
columns.forEach((l)->System.out.println(l));
// find the relevant sections from the CSV file
// we are only interested in the row with Neda's name
int nameIndex = columns.indexOf("name");
int ageIndex columns.indexOf("age");
int heightIndex = columns.indexOf("height");
// we need to know the index positions of the
// have to re-read the csv file to extract the values
lines = Files.lines(path);
List<List<String>> values = lines
.skip(1)
.map((line) -> Arrays.asList(line.split(",")))
.collect(Collectors.toList());
values.forEach((l)->System.out.println(l));
}
}
}
Is there any way to avoid re-reading the file following the extraction of the header line? Although this is a very small example file, I will be applying this logic to a large CSV file.
Is there technique to use the streams API to create a map between the extracted column names (in the first scan of the file) to the values in the remaining rows?
How can I return just one row in the form of List<String> (instead of List<List<String>> containing all the rows). I would prefer to just find the row as a mapping between the column names and their corresponding values. (a bit like a result set in JDBC). I see a Collectors.mapMerger function that might be helpful here, but I have no idea how to use it.
Use a BufferedReader explicitly:
List<String> columns;
List<List<String>> values;
try(BufferedReader br=Files.newBufferedReader(path)) {
String firstLine=br.readLine();
if(firstLine==null) throw new IOException("empty file");
columns=Arrays.asList(firstLine.split(","));
values = br.lines()
.map(line -> Arrays.asList(line.split(",")))
.collect(Collectors.toList());
}
Files.lines(…) also resorts to BufferedReader.lines(…). The only difference is that Files.lines will configure the stream so that closing the stream will close the reader, which we don’t need here, as the explicit try(…) statement already ensures the closing of the BufferedReader.
Note that there is no guarantee about the state of the reader after the stream returned by lines() has been processed, but we can safely read lines before performing the stream operation.
First, your concern that this code is reading the file twice is not founded. Actually, Files.lines returns a Stream of the lines that is lazy-populated. So, the first part of the code only reads the first line and the second part of the code reads the rest (it does read the first line a second time though, even if ignored). Quoting its documentation:
Read all lines from a file as a Stream. Unlike readAllLines, this method does not read all lines into a List, but instead populates lazily as the stream is consumed.
Onto your second concern about returning just a single row. In functional programming, what you are trying to do is called filtering. The Stream API provides such a method with the help of Stream.filter. This method takes a Predicate as argument, which is a function that returns true for all the items that should be kept, and false otherwise.
In this case, we want a Predicate that would return true when the name is equal to "Neda". This could be written as the lambda expression s -> s.equals("Neda").
So in the second part of your code, you could have:
lines = Files.lines(path);
List<List<String>> values = lines
.skip(1)
.map(line -> Arrays.asList(line.split(",")))
.filter(list -> list.get(0).equals("Neda")) // keep only items where the name is "Neda"
.collect(Collectors.toList());
Note however that this does not ensure that there is only a single item where the name is "Neda", it collects all possible items into a List<List<String>>. You could add some logic to find the first item or throw an exception if no items are found, depending on your business requirement.
Note still that calling twice Files.lines(path) can be avoided by using directly a BufferedReader as in #Holger's answer.
Using a CSV-processing library
Other Answers are good. But I recommend using a CSV-processing library to read your input files. As others noted, the CSV format is not as simple as it may seem. To begin with, the values may or may not be nested in quote-marks. And there are many variations of CSV, such as those used in Postgres, MySQL, Mongo, Microsoft Excel, and so on.
The Java ecosystem offers several such libraries. I use Apache Commons CSV.
The Apache Commons CSV library does make not use of streams. But you have no need for streams for your work if using a library to do the scut work. The library makes easy work of looping the rows from the file, without loading large file into memory.
create a map between the extracted column names (in the first scan of the file) to the values in the remaining rows?
Apache Commons CSV does this automatically when you call withHeader.
return just one row in the form of List
Yes, easy to do.
As you requested, we can fill List with each of the 3 field values for one particular row. This List acts as a tuple.
List < String > tuple = List.of(); // Our goal is to fill this list of values from a single row. Initialize to an empty nonmodifiable list.
We specify the format we expect of our input file: standard CSV (RFC 4180), with the first row populated by column names.
CSVFormat format = CSVFormat.RFC4180.withHeader() ;
We specify the file path where to find our input file.
Path path = Path.of("/Users/basilbourque/people.csv");
We use try-with-resources syntax (see Tutorial) to automatically close our parser.
As we read in each row, we check for the name being Neda. If found, we report file our tuple List with that row's field values. And we interrupt the looping. We use List.of to conveniently return a List object of some unknown concrete class that is unmodifiable, meaning you cannot add nor remove elements from the list.
try (
CSVParser parser =CSVParser.parse( path , StandardCharsets.UTF_8, format ) ;
)
{
for ( CSVRecord record : parser )
{
if ( record.get( "name" ).equals( "Neda" ) )
{
tuple = List.of( record.get( "name" ) , record.get( "age" ) , record.get( "height" ) );
break ;
}
}
}
catch ( FileNotFoundException e )
{
e.printStackTrace();
}
catch ( IOException e )
{
e.printStackTrace();
}
If we found success, we should see some items in our List.
if ( tuple.isEmpty() )
{
System.out.println( "Bummer. Failed to report a row for `Neda` name." );
} else
{
System.out.println( "Success. Found this row for name of `Neda`:" );
System.out.println( tuple.toString() );
}
When run.
Success. Found this row for name of Neda:
[Neda, 14, 66]
Instead of using a List as a tuple, I suggest your define a Person class to represent this data with proper data types. Our code here would return a Person instance rather than a List<String>.
I know I'm responding so late, but maybe it will help someone in the future
I've made a csv parser/writer , easy to use thanks to its builder pattern
For your case: you can filter the lines you want to parse using
csvLineFilter(Predicate<String>)
Hope you find it handy, here is the source code
https://github.com/i7paradise/CsvUtils-Java8/
I've joined a main class Demo.java to display how it works
I'm getting Error java.lang.OutOfMemoryError: GC overhead limit exceeded when I try to execute Pipeline if the GATE Document I use is slightly large.
The code works fine if the GATE Document is small.
My JAVA code is something like this:
TestGate Class:
public void gateProcessor(Section section) throws Exception {
Gate.init();
Gate.getCreoleRegister().registerDirectories(....
SerialAnalyserController pipeline .......
pipeline.add(All the language analyzers)
pipeline.add(My Jape File)
Corpus corpus = Factory.newCorpus("Gate Corpus");
Document doc = Factory.newDocument(section.getContent());
corpus.add(doc);
pipeline.setCorpus(corpus);
pipeline.execute();
}
The Main Class Contains:
StringBuilder body = new StringBuilder();
int character;
FileInputStream file = new FileInputStream(
new File(
"filepath\\out.rtf")); //The Document in question
while (true)
{
character = file.read();
if (character == -1) break;
body.append((char) character);
}
Section section = new Section(body.toString()); //Creating object of Type Section with content field = body.toString()
TestGate testgate = new TestGate();
testgate.gateProcessor(section);
Interestingly this thing fails in GATE Developer tool as well the tools basically gets stuck if the document is more than a sepcific limit, say more than 1 page.
This proves that my code is logically correct but my approach is wrong. How do we deal with large chunks data in GATE Document.
You need to call
corpus.clear();
Factory.deleteResource(doc);
after each document, otherwise you'll eventually get OutOfMemory on any size of docs if you run it enough times (Although by the way you initialize gate in the method it seems like you really need to process a single document only once).
Besides that, annotations and features usually take lots of memory. If you have an annotation-intensive pipeline, i.e. you generate lots of annotations with lots of features and values you may run out of memory. Make sure you don't have a processing resource that generates annotations exponentially - for instance a jape or groovy generates n to the power of W annotations, where W is number of words in your doc. Or if you have a feature for each possible word combination in your doc, that would generate factorial of W strings.
every time its create pipeline object that's why it takes huge memory. That's why every time you use 'Annie' cleanup.
pipeline.cleanup();
pipeline=null;
I have a hashmap with large number of entry's which is serialized.If i make a small change in hashmap is it required that I overwrite the old file completely or is there an alternative ?
public class HashMapSerial {
public static void main(String[] args) {
HashMap<String,Integer> hash=new HashMap<String, Integer>(100000);
hash.put("hello",1 );
hash.put("world", 2);
//+ (100000 -2) entry's
ObjectOutputStream s=new ObjectOutputStream(new FileOutputStream(new File("hash.out")));
s.writeObject(hash); // write the hash map to file
hash.put("hello",10);
s.writeObject(hash); //rewrite the whole hashmap again
}
}
Since the change is only for the string "hello" and for no other element is it possible to update the serialized file only for the string "hello" instead of rewriting the whole hashmap once again ?
Use DB or simple File IO maintain upto where you have written previously .
AFAIK, you can't do incremental saves with the simple java serialization.
You should instead use another system to store your data (such as a database).
Maybe it's overkill but a NoSQL db (cassandra for instance) would be simpler than trying to create your own system.