Calculate statistics on +20millions records in Java

Calculate statistics on +20millions records in Java - java

I have csv file (600 MB) and 20 millions rows.
I need to read all this data, create list of java objects out of it, and calculate some metrics on objects field, such as average, median, max , total sum and other statistics. What is the best way of doing it in Java?
I tried simple .forEach loop and it took a while (20 min) to iterate over it .
UPDATE:
I user BufferReader to read the data and converting the csv file into List of Objects of some Java class. It's pretty fast.
It's stuck for 20 minutes in forEach loop, where I trying to iterate over those 20 millions List of objects and divide them into 3 lists, depending on the values in the current object.
So basically,I iterate over whole list once, and I have if/else condition, where I check whether or not certain field in the objects equals to "X","Y" or "Z", and depending on the answer, separating those 20 mlns records into 3 lists.
Then, for those 3 lists I need to calculate different statistics: such as median, average, total sum etc

Having worked extensively with data amounts exceeding those 600Mb I can put out two statements:
600Mb is not a large amount of data, in particular if we are talking about tabular data;
those amounts have nothing to do with BigData and are actually easily processable on conventional hardware in memory, which is the fastest option.
What you should do, however, is make sure that you read that data into column-wise continuous arrays and use methods operating directly on those continuous arrays of column-wise data.
Because it is a csv file, that is stored row-wise, you would be much better off reading it en-block into a byte array and parse that into a column-wise pre-allocated representation.
Reading a block of 600Mb into memory on an SSD should be like a few seconds, parsing it will depend on you algorithm (but it is essential to be able to seek within that structure instantly). Memory wise you will use about triple of 600Mb, but with a 16Gb machine that should be a no-brainer.
So, do not rush for SQL or slicing files and do not instantiate every cell as a Java object. That is, in this exceptional case, you do not want a list of Java objects, you want double[] etc. You can do with ArrayLists though if you preallocate exact sizes. Other standard collections will kill you.
Having said all that, I would rather recommend python with numpy for the task than Java. Java is good with objects, and not as good with continuous memory blocks and corresponding operations. C++ would do as well or even R.

I highly suggest not loading all of the 600MB into RAM and using it as a Java Object.
As you stated this litteraly takes ages to load.
What you could do instead:
Use SQL:
Convert your data into a database, and on this database perform you search query(s).
Don't loop over all objects in RAM. This would make your application very unperformant.
SQL is optimized for handling large amounts of data and performing querys on it.
Read more about Database Management in Java: JDBC Basics

Sounds like your program is simply running out of memory as you are adding stuff to a list. If you get close to the memory limit allocated to the JVM most of the time will be spent by the garbage collector trying to do what it can to prevent you running out of memory.
You should use a fast CSV library such as univocity-parsers to iterate over each row and perform the calculations you need without storing all in memory. Use it like this:
CsvParserSettings parserSettings = new CsvParserSettings(); //configure the parser
parserSettings.selectFields("column3", "column1", "column10"); //only read values from columns you need
CsvParser parser = new CsvParser(parserSettings);
//use this if you just need plain strings
for(String[] row : parser.iterate(new File("/path/to/your.csv"))){
//do stuff with the row
}
//or use records to get values ready for calculation
for(Record record : parser.iterateRecords(new File("/path/to/your.csv"))){
int someValue = record.getInt("columnName");
//perform calculations
}
Just store data in a huge list if for some reason you need to run through all rows more than once. In this case allocate more memory to your program with something like -Xms8G -Xmx8G. Keep in mind you can't have an ArrayList with size over Integer.MAX_VALUE so that's your next limit even if you have memory enough.
If you really need a list you can use use the parser like this:
List<Record> twentyMillionRecords = parser.parseAllRecords(new File("/path/to/your.csv"), 20_000_000);
Otherwise your best bet is to run the parser as many times as needed. The parser I suggested should take a few seconds to go through the file each time.
Hope this helps
Disclaimer: I'm the author of this library. It's open source and free (apache 2.0 license)

I bet majority of the time was spent reading the data. Having a BufferedReader should significantly speed things up.

Related

Java - Millions of records, HashMap throws OutOfMemoryError

I'm reading a file to parse few of the fields of each record as a reference key and another field as the reference value. These keys and values are referred for another process.
Hence, I chose a HashMap, so that I can get the values for each key, easily.
But, each of the file consists of tens of millions or records. Hence, the HashMap throws OutOfMemoryError. I hope increasing the heap memory will not be a good solution, if the input file in future grows.
For similar questions in SO, most have suggested to use a database. I fear I'll not be given option to use a DB. Is there any other way to handle the problem?
EDIT: I need to do this similar HashMap Loading for 4 such files :( I need all the four. Bcoz, If I dont find a matching entry for my input in the first Map, I need to find in second, then if there not, then third and finally in fourth.
Edit 2: The files I have sums up to, around 1 GB.
EDIT 3:
034560000010000001750
000234500010000100752
012340000010000300374
I have records like these in a file.. I need to have 03456000001000000 as key and 1750 as value.. for all the millions of records. I'll refer these keys and get the value for my another process.

Using a database will not reduce memory cost or runtime per itself.
However, the default hashmaps may not be what you are looking for, depending on your data types. When used with primitive values such as Integers then java hashmaps have a massive memory overhead. In a HashMap<Integer, Integer>, every entry uses like 24+16+16 bytes. Unused entries (and the hashmap keeps up to half of them unused) take 4 bytes extra. So you can roughly estimate >56 bytes per int->int entry in Java HashMap<Integer, Integer>.
If you encode the integers as String, and we're talking maybe 6 digit numbers, that is likely 24 bytes for the underlying char[] array (16 bit characters; 12 bytes overhead for the array, sizes are a multiple of 8!), plus 16 bytes for the String object around (maybe 24, too). For key and value each. So that is then around 24+40+40, i.e. over 104 bytes per entry.
(Update: as your keys are 17 characters in length, make this 24+62+40, i.e. 136 bytes)
If you used a primitive hashmap such as GNU Trove TIntIntHashMap, it would only take 8 bytes + unused, so lets estimate 16 bytes per entry, at least 6 times less memory.
(Update: for TLongIntHashMap, estimate 12 bytes per entry, 24 bytes with overhead of unused buckets.)
Now you could also just store everything in a massive sorted list. This will allow you to perform a fast join operation, and you will lose much of the overhead of unused entries, and can probably process twice as many in much shorter time.
Oh, and if you know the valid value range, you can abuse an array as "hashmap".
I.e. if your valid keys are 0...999999, then just use an int[1000000] as storage, and write each entry into the appropriate row. Don't store the key at all - it's the offset in the array.
Last but not least, Java by default only uses 25% of your memory. You probably want to increase its memory limit.

Short answer: no. It's quite clear that you can't load your entire dataset in memory. You need a way to keep it on disk together with an index, so that you can access the relevant bits of the dataset without rescanning the whole file every time a new key is requested.
Essentially, a DBMS is a mechanism for handling (large) quantities of data: storing, retrieving, combining, filtering etc. They also provide caching for commonly used queries and responses. So anything you are going to do will be a (partial) reimplementation of what a DBMS already does.
I understand your concerns about having an external component to depend on, however note that a DBMS is not necessarily a server daemon. There are tiny DBMS which link with your program and keep all the dataset in a file, like SQLite does.,

Such large data collections should be handled with a database. Java programs are limited in memory, varying from device to device. You provided no info about your program, but please remember that if it is run on different devices, some of them may have very little ram and will crash very quickly. DB (be it SQL or file-based) is a must when it comes to large-data programs.

You have to either
a) have enough memory load to load the data into memory.
b) have to read the data from disk, with an index which is either in memory or not.
Whether you use a database or not the problem is much the same. If you don't have enough memory, you will see a dramatic drop in performance if you start randomly accessing the disk.
There are alternatives like Chronicle Map which use off heap and performs well up to double your main memory size so you won't get an out of memory error, however you still have problem that you can't store more data in memory than you have main memory.

The memory footprint depends on how you approach the file in java. A widely used solution is based on streaming the file using the Apache Commons IO LineIterator. Their recommended usage
LineIterator it = FileUtils.lineIterator(file, "UTF-8");
try {
while (it.hasNext()) {
String line = it.nextLine();
// do something with line
}
} finally {
it.close();
}
Its an optimized approach, but if the file is too big, you can still end up with OutOfMemory

Since you write that you fear that you will not be given the option to use a database some kind of embedded DB might be the answer. If it is impossible to keep everything in memory it must be stored somewhere else.
I believe that some kind of embedded database that uses the disk as storage might work. Examples include BerkeleyDB and Neo4j. Since both databases use a file index for fast lookups the memory load is lesser than if you keep the entire load in memory but they are still fast.

You could try lazy loading it.

Java: Optimal approach for storing and reading 1 billion data records

I'm looking for the fastest approach, in Java, to store ~1 billion records of ~250 bytes each (storage will happen only once) and then being able to read it multiple times in a non-sequential order.
The source records are being generated into simple java value objects and I would like to read them back in the same format.
For now my best guess is to store these objects, using a fast serialization library such as Kryo, in a flat file and then to use Java FileChannel to make direct random access to read the records at specific positions in the file (when storing the data, I will keep in a hashmap (also to be saved on disk) with the position in the file of each record so that I know where to read it).
Also, there is no need to optimize disk space. My key concern is to optimize read performance, while having a reasonable write performance (that, again, will happen only once).
Last precision: while the records are all of the same type (same Java value object), their size (in bytes) is variable (e.g. it contains strings).
Is there any better approach than what I mentioned above? Any hint or suggestion would be greatly appreciated !
Many thanks,
Thomas

You can use Apache Lucene, it will take care of everything you have mentioned above :)
It is super fast, you can search results more quickly then ever.
Apache Lucene persist objects in files and indexes them. We have used it in couple of apps and it is super fast.

You could just use an embedded Derby database. It's written in Java and you can actually run it up embedded within your process so there is no overhead of inter-process or networked communication. It will store the data and allow you to query it/etc handling all the complexity and indexing for you.

file based merge sort on large datasets in Java

given large datasets that don't fit in memory, is there any library or api to perform sort in Java?
the implementation would possibly be similar to linux utility sort.

Java provides a general-purpose sorting routine which can be used as part of the larger solution to your problem. A common approach to sort data that's too large to all fit in memory is this:
1) Read as much data as will fit into main memory, let's say it's 1 Gb
2) Quicksort that 1 Gb (here's where you'd use Java's built-in sort from the Collections framework)
3) Write that sorted 1 Gb to disk as "chunk-1"
4) Repeat steps 1-3 until you've gone through all the data, saving each data chunk in a separate file. So if your original data was 9 Gb, you will now have 9 sorted chunks of data labeled "chunk-1" thru "chunk-9"
5) You now just need a final merge sort to merge the 9 sorted chunks into a single fully sorted data set. The merge sort will work very efficiently against these pre-sorted chunks. It will essentially open 9 file readers (one for each chunk), plus one file writer (for output). It then compares the first data element in each read file and selects the smallest value, which is written to the output file. The reader from which that selected value came advances to its next data element, and the 9-way comparison process to find the smallest value is repeated, again writing the answer to the output file. This process repeats until all data has been read from all the chunk files.
6) Once step 5 has finished reading all the data you are done -- your output file now contains a fully sorted data set
With this approach you could easily write a generic "megasort" utility of your own that takes a filename and maxMemory parameter and efficiently sorts the file by using temp files. I'd bet you could find at least a few implementations out there for this, but if not you can just roll your own as described above.

The most common way to handle large datasets is in memory (you can buy a server with 1 TB these days) or in a database.
If you are not going to use a database (or buy more memory) you can write it yourself fair easily.
There are libraries which may help which perform Map-Reduce functions but they may add more complexity than they save.

ArrayList<ArrayList<String>> runs outofmemory (Java heap space). Any other option?

I am working with ArrayList data structure for dealing with cvs file. My machine is pretty powerful:
Memory: 8 GB of Ram
Processor: 4 CPUs, each i5 Intel core 2.5GHz
In eclipse, I assigned -Xmx5120m (5GB of RAM for the java vm) using the vm arguments panel in Run as->Configuration.
I am still getting "outofmemory java heap space" for my ArrayList<ArrayList<String>> if it is more than like 468000 X 108. I am using arraylist because I feel myself most comfortable with it and it makes it easy to process the datas for my purpose.
Actually, I am using this 2-dimension array for column-based context, like
arraylist.get(i).get(0)
where
0 < i < 468000
would represent one column. Since I do operations like (replacing a column by an another column, copying a column, inserting a column into arbitrary position in the arrayList etc.. ), I could only think of arrayList because it has amortized constant time for adding or inserting into the arraylist in its average case.
So now my question is:
Which other datastructures could I use instead of arraylist in order to reach a magnitude of much more than 468000 X 108 (for example, like (833 * 1000000) X 108) and be able to do all operations that I mentioned above? (but I still want to be able to do this on my machine using the capacity that I have)
I could think of doing all this stuff sequentially, meaning that processing first 468000 X 108 and write it to a csv file and then again loading into the 468000 X 108 arraylist and writing it to a different file etc...
I don't think that I reached the limit of arraylist for my capacity.
I would appreciate any kind of help.

You are trying to stuff a file with 468,000 lines into 5G of memory, and are running out of memory.
The data structure isn't the problem.
You need to change your approach and not do that. Process chunks of the file at a time, only extract the data you need, etc.

Inserting somewhere within an ArrayList won't give you amortized constant time, as the list will have to be copied internally - this will only work as long as you insert at the end.
Besides, when the ArrayList has to grow, it will calculate the new size by
int newCapacity = (oldCapacity * 3)/2 + 1;
which could waste huge amounts of memory in your case - it would be more efficient to use custom-sized String-arrays instead of the list (or call at least trimToSize() once you're done reading a column).
As long as you're only needing a few columns per time, I'd suggest to store each column in a separate file, which you can load/write on demand - if they'll only contain strings, you could think of some easy-readable binary format and use DataOutputStream and -InputStream, for instance. Inserting a column would simply become a file renaming operation... You could also add some caching, to keep the most recent or most often used columns in memory (Search for java.util.LinkedHashMap to get an idea of a simple LFU-Cache). Don't use a database if you don't need transactions or such, don't store such data with in a verbose format like XML - you'd get a huge performance loss otherwise.
Finally, I'd think about the content of the matrix, as strings can become pretty huge: Do you really need them as strings, or can you create a less memory consuming representation of them? For instance, if you'd only have 60.000 different strings, you could create a mapping between them and a short, and work with the shorts in memory.

A good way to "change your approach", as others have suggested is to persist your data in a database or xml file, then work with smaller subsets of that data as you need them.

How to store millions of Double during a calculation?

My engine is executing 1,000,000 of simulations on X deals. During each simulation, for each deal, a specific condition may be verified. In this case, I store the value (which is a double) into an array. Each deal will have its own list of values (i.e. these values are indenpendant from one deal to another deal).
At the end of all the simulations, for each deal, I run an algorithm on his List<Double> to get some outputs. Unfortunately, this algorithm requires the complete list of these values, and thus, I am not able to modify my algorithm to calculate the outputs "on the fly", i.e. during the simulations.
In "normal" conditions (i.e. X is low, and the condition is verified less than 10% of the time), the calculation ends correctly, even if this may be enhanced.
My problem occurs when I have many deals (for example X = 30) and almost all of my simulations verify my specific condition (let say 90% of simulations). So just to store the values, I need about 900,000 * 30 * 64bits of memory (about 216Mb). One of my future requirements is to be able to run 5,000,000 of simulations...
So I can't continue with my current way of storing the values. For the moment, I used a "simple" structure of Map<String, List<Double>>, where the key is the ID of the element, and List<Double> the list of values.
So my question is how can I enhance this specific part of my application in order to reduce the memory usage during the simulations?
Also another important note is that for the final calculation, my List<Double> (or whatever structure I will be using) must be ordered. So if the solution to my previous question also provide a structure that order the new inserted element (such as a SortedMap), it will be really great!
I am using Java 1.6.
Edit 1
My engine is executing some financial calculations indeed, and in my case, all deals are related. This means that I cannot run my calculations on the first deal, get the output, clean the List<Double>, and then move to the second deal, and so on.
Of course, as a temporary solution, we will increase the memory allocated to the engine, but it's not the solution I am expecting ;)
Edit 2
Regarding the algorithm itself. I can't give the exact algorithm here, but here are some hints:
We must work on a sorted List<Double>. I will then calculate an index (which is calculated against a given parameter and the size of the List itself). Then, I finally return the index-th value of this List.
public static double algo(double input, List<Double> sortedList) {
if (someSpecificCases) {
return 0;
}
// Calculate the index value, using input and also size of the sortedList...
double index = ...;
// Specific case where I return the first item of my list.
if (index == 1) {
return sortedList.get(0);
}
// Specific case where I return the last item of my list.
if (index == sortedList.size()) {
return sortedList.get(sortedList.size() - 1);
}
// Here, I need the index-th value of my list...
double val = sortedList.get((int) index);
double finalValue = someBasicCalculations(val);
return finalValue;
}
I hope it will help to have such information now...
Edit 3
Currently, I will not consider any hardware modification (too long and complicated here :( ). The solution of increasing the memory will be done, but it's just a quick fix.
I was thinking of a solution that use a temporary file: Until a certain threshold (for example 100,000), my List<Double> stores new values in memory. When the size of List<Double> reaches this threshold, I append this list in the temporary file (one file per deal).
Something like that:
public void addNewValue(double v) {
if (list.size() == 100000) {
appendListInFile();
list.clear();
}
list.add(v);
}
At the end of the whole calculation, for each deal, I will reconstruct the complete List<Double> from what I have in memory and also in the temporary file. Then, I run my algorithm. I clean the values for this deal, and move to the second deal (I can do that now, as all the simulations are now finished).
What do you think of such solution? Do you think it is acceptable?
Of course I will lose some time to read and write my values in an external file, but I think this can be acceptable, no?

Your problem is algorithmic and you are looking for a "reduction in strength" optimization.
Unfortunately, you've been too coy in the the problem description and say "Unfortunately, this algorithm requires the complete list of these values..." which is dubious. The simulation run has already passed a predicate which in itself tells you something about the sets that pass through the sieve.
I expect the data that meets the criteria has a low information content and therefore is amenable to substantial compression.
Without further information, we really can't help you more.

You mentioned that the "engine" is not connected to a database, but have you considered using a database to store the lists of elements? Possibly an embedded DB such as SQLite?
If you used int or even short instead of string for the key field of your Map, that might save some memory.
If you need a collection object that guarantees order, then consider a Queue or a Stack instead of your List that you are currently using.
Possibly think of a way to run deals sequentially, as Dommer and Alan have already suggested.
I hope that was of some help!
EDIT:
Your comment about only having 30 keys is a good point.
In that case, since you have to calculate all your deals at the same time, then have you considered serializing your Lists to disk (i.e. XML)?
Or even just writing a text file to disk for each List, then after the deals are calculated, loading one file/List at a time to verify that List of conditions?
Of course the disadvantage is slow file IO, but, this would reduced your server's memory requirement.

Can you get away with using floats instead of doubles? That would save you 100Mb.

Just to clarify, do you need ALL of the information in memory at once? It sounds like you are doing financial simulations (maybe credit risk?). Say you are running 30 deals, do you need to store all of the values in memory? Or can you run the first deal (~900,000 * 64bits), then discard the list of double (serialize it to disk or something) and then proceed with the next? I thought this might be okay as you say the deals are independent of one another.
Apologies if this sounds patronising; I'm just trying to get a proper idea of the problem.

The flippant answer is to get a bunch more memory. Sun JVM's can (almost happily) handle multi gigabyte heaps and if it's a batch job then longer GC pauses might not be a massive issue.
You may decide that this not a sane solution, the first thing to attempt would be to write a custom list like collection but have it store primitive doubles instead of the object wrapper Double objects. This will help save the per object overhead you pay for each Double object wrapper. I think the Apache common collections project had primitive collection implementations, these might be a starting point.
Another level would be to maintain the list of doubles in a nio Buffer off heap. This has the advantage that the space being used for the data is actually not considered in the GC runs and could in theory could lead you down the road of managing the data structure in a memory mapped file.

From your description, it appears you will not be able to easily improve your memory usage. The size of a double is fixed, and if you need to retain all results until your final processing, you will not be able to reduce the size of that data.
If you need to reduce your memory usage, but can accept a longer run time, you could replace the Map<String, List<Double>> with a List<Double> and only process a single deal at a time.
If you have to have all the values from all the deals, your only option is to increase your available memory. Your calculation of the memory usage is based on just the size of a value and the number of values. Without a way to decrease the number of values you need, no data structure will be able to help you, you just need to increase your available memory.

From what you tell us it sounds like you need 10^6 x 30 processors (ie number of simulations multiplied by number of deals) each with a few K RAM. Perhaps, though, you don't have that many processors -- do you have 30 each of which has sufficient memory for the simulations for one deal ?
Seriously: parallelise your program and buy an 8-core computer with 32GB RAM (or 16-core w 64GB or ...). You are going to have to do this sooner or later, might as well do it now.

There was a theory that I read awhile ago where you would write the data to disk and only read/write a chunk what you. Of course this describes virtual memory, but the difference here is that the programmer controls the flow and location rathan than the OS. The advantage there is that the OS is only allocated so much virtual memory to use, where you have access to the whole HD.
Or an easier option is just to increase your swap/paged memory, which I think would be silly but would help in your case.
After a quick google it seems like this function might help you if you are running on Windows:
http://msdn.microsoft.com/en-us/library/aa366537(VS.85).aspx

You say you need access to all the values, but you cannot possibly operate on all of them at once? Can you serialize the data such that you can store it in a single file. Each record set apart either by some delimiter, key value, or simply the byte count. Keep a byte counter either way. Let that be a "circular file" composed of a left file and a right file operating like opposing stacks. As data is popped(read) off the left file it is processed and pushed(write) into the right file. If your next operation requires a previously processed value reverse the direction of the file transfer. Think of your algorithm as residing at the read/write head of your hard drive. You have access as you would with a list just using different methods and at much reduced speed. The speed hit will be significant but if you can optimize your sequence of serialization so that the most likely accessed data is at the top of the file in order of use and possibly put the left and right files on different physical drives and your page file on a 3rd drive you will benefit from increased hard disk performance due to sequential and simultaneous reads and writes. Of course its a bit harder than it sounds. Each change of direction requires finalizing both files. Logically something like,
if (current data flow if left to right) {send EOF to right_file; left_file = left_file - right_file;} Practically you would want to leave all data in place where it physically resides on the drive and just manipulate the beginning and ending addresses for the files in the master file table. Literally operating like a pair of hard disk stacks. This will be a much slower, more complicated process than simply adding more memory, but very much more efficient than separate files and all that overhead for 1 file per record * millions of records. Or just put all your data into a database. FWIW, this idea just came to me. I've never actually done it or even heard of it done. But I imagine someone must have thought of it before me. If not please let me know. I could really use the credit on my resume.

One solution would be to format the doubles as strings and then add them in a (fast) Key Value store which is ordering by-design.
Then you would only have to read sequentially from the store.
Here is a store that 'naturally' sorts entries as they are inserted.
And they boast that they are doing it at the rate of 100 million entries per second (searching is almost twice as fast):
http://forum.gwan.com/index.php?p=/discussion/comment/897/#Comment_897
With an API of only 3 calls, it should be easy to test.
A fourth call will provide range-based searches.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.