file based merge sort on large datasets in Java

file based merge sort on large datasets in Java - java

given large datasets that don't fit in memory, is there any library or api to perform sort in Java?
the implementation would possibly be similar to linux utility sort.

Java provides a general-purpose sorting routine which can be used as part of the larger solution to your problem. A common approach to sort data that's too large to all fit in memory is this:
1) Read as much data as will fit into main memory, let's say it's 1 Gb
2) Quicksort that 1 Gb (here's where you'd use Java's built-in sort from the Collections framework)
3) Write that sorted 1 Gb to disk as "chunk-1"
4) Repeat steps 1-3 until you've gone through all the data, saving each data chunk in a separate file. So if your original data was 9 Gb, you will now have 9 sorted chunks of data labeled "chunk-1" thru "chunk-9"
5) You now just need a final merge sort to merge the 9 sorted chunks into a single fully sorted data set. The merge sort will work very efficiently against these pre-sorted chunks. It will essentially open 9 file readers (one for each chunk), plus one file writer (for output). It then compares the first data element in each read file and selects the smallest value, which is written to the output file. The reader from which that selected value came advances to its next data element, and the 9-way comparison process to find the smallest value is repeated, again writing the answer to the output file. This process repeats until all data has been read from all the chunk files.
6) Once step 5 has finished reading all the data you are done -- your output file now contains a fully sorted data set
With this approach you could easily write a generic "megasort" utility of your own that takes a filename and maxMemory parameter and efficiently sorts the file by using temp files. I'd bet you could find at least a few implementations out there for this, but if not you can just roll your own as described above.

The most common way to handle large datasets is in memory (you can buy a server with 1 TB these days) or in a database.
If you are not going to use a database (or buy more memory) you can write it yourself fair easily.
There are libraries which may help which perform Map-Reduce functions but they may add more complexity than they save.

Related

Calculating unique URLs in a huge dataset (150+ billions)

My problem is the following:
I get a list of urls consisting about ~150 billion entries.
Each month I get a new batch of another ~150 billion entries.
I should remove the duplicates and store the rest.
This should be done on a lonely and reasonably small machine compared to the task (around 32-64 Gb ram is available + a bunch of disk space).
I should store the unique urls (the storage problem is already solved).
My chosen language for this task is Java.
This is not an interview question or something similar. I need to do this for a business case.
Is there an algorithm available that let me acquire this goal (preferably in less than one month)? My first idea was Bloom/Cuckoo filter but I want to keep all of the URLs if possible.

I would implement a merge sort and eliminate the duplicates during the merge step.
You will want to stream the URLs in and create modest-sized batches that can be sorted in memory. Each of these sorted chunks are written to disk.
To merge these sorted chunks, stream in two (or more) of the files. Look at the next URL in each stream and take the smallest off of the stream, keeping track of the most recently output URL. As the next smallest URL is obtained, compare it to the most recently output URL - if it is a duplicate, skip it; otherwise output it (and remember it as the most recently output).
If your creation of sorted chunks gave you too many files to open at once, keep merging groups of files until you have one file. This result will have no duplicates.
You probably would use Arrays.parallelSort() to do the in-memory sorting of your initial chunks. You probably would benefit from removing duplicates from these initially sorted chunks while outputting the sorted array elements.
Definitely use buffered I/O.
When merging multiple files, I would create a priority queue that has the next record from each stream along with which stream it comes from. You grab the next item from the priority queue, read the next line from the stream that next item came from, and put that new line in the queue. The number of streams you can merge from will be limited either by the number of files you can have open or the memory required for buffered I/O from all the streams.
To implement this probably requires a page or so of code - it is pretty straight-forward to run this on one machine. However, if it fits in your infrastructure, this problem is a good fit for a Hadoop cluster or something like that. If you want to run this fast on, e.g., AWS, you would probably want to use a queue (e.g., SQS on AWS) to manage the chunks to be sorted/merged - it becomes more involved, but will run much faster.
Other Considerations
Fault tolerance: This process will take a long time to run; if it fails in the middle, do you want to have to start over from the beginning? You might want to start with a step that just streams through the source file and breaks it into appropriately-sized, unsorted chunks. This step in itself will take a while as there is a fair bit of I/O involved. Put these chunks into a directory called "unsorted", for example. Then have a second process that will repeatedly look in the "unsorted" directory, pick a chunk, read it, sort it, write it to the "sorted" directory and move the unsorted chunk from "unsorted" to "archived". Then have a third process that will read chunks from "sorted" and merge them (removing duplicates) and write to "sorted1" or "final" (depending on whether it is merging all remaining files, or not). The idea is to structure things so that you are always making forward progress so if you server dies, you can pick up where you left off.
Parallelism: This process will take a long time. The more servers that you can apply to it in parallel, the faster it will go. You can achieve this (if you have servers available) in much the same way as fault tolerance - you can do the sorting and (intermediate) merging steps on many machines in parallel (with appropriate file locking or some other scheme so they don't try to work on the same chunks).

Calculate statistics on +20millions records in Java

I have csv file (600 MB) and 20 millions rows.
I need to read all this data, create list of java objects out of it, and calculate some metrics on objects field, such as average, median, max , total sum and other statistics. What is the best way of doing it in Java?
I tried simple .forEach loop and it took a while (20 min) to iterate over it .
UPDATE:
I user BufferReader to read the data and converting the csv file into List of Objects of some Java class. It's pretty fast.
It's stuck for 20 minutes in forEach loop, where I trying to iterate over those 20 millions List of objects and divide them into 3 lists, depending on the values in the current object.
So basically,I iterate over whole list once, and I have if/else condition, where I check whether or not certain field in the objects equals to "X","Y" or "Z", and depending on the answer, separating those 20 mlns records into 3 lists.
Then, for those 3 lists I need to calculate different statistics: such as median, average, total sum etc

Having worked extensively with data amounts exceeding those 600Mb I can put out two statements:
600Mb is not a large amount of data, in particular if we are talking about tabular data;
those amounts have nothing to do with BigData and are actually easily processable on conventional hardware in memory, which is the fastest option.
What you should do, however, is make sure that you read that data into column-wise continuous arrays and use methods operating directly on those continuous arrays of column-wise data.
Because it is a csv file, that is stored row-wise, you would be much better off reading it en-block into a byte array and parse that into a column-wise pre-allocated representation.
Reading a block of 600Mb into memory on an SSD should be like a few seconds, parsing it will depend on you algorithm (but it is essential to be able to seek within that structure instantly). Memory wise you will use about triple of 600Mb, but with a 16Gb machine that should be a no-brainer.
So, do not rush for SQL or slicing files and do not instantiate every cell as a Java object. That is, in this exceptional case, you do not want a list of Java objects, you want double[] etc. You can do with ArrayLists though if you preallocate exact sizes. Other standard collections will kill you.
Having said all that, I would rather recommend python with numpy for the task than Java. Java is good with objects, and not as good with continuous memory blocks and corresponding operations. C++ would do as well or even R.

I highly suggest not loading all of the 600MB into RAM and using it as a Java Object.
As you stated this litteraly takes ages to load.
What you could do instead:
Use SQL:
Convert your data into a database, and on this database perform you search query(s).
Don't loop over all objects in RAM. This would make your application very unperformant.
SQL is optimized for handling large amounts of data and performing querys on it.
Read more about Database Management in Java: JDBC Basics

Sounds like your program is simply running out of memory as you are adding stuff to a list. If you get close to the memory limit allocated to the JVM most of the time will be spent by the garbage collector trying to do what it can to prevent you running out of memory.
You should use a fast CSV library such as univocity-parsers to iterate over each row and perform the calculations you need without storing all in memory. Use it like this:
CsvParserSettings parserSettings = new CsvParserSettings(); //configure the parser
parserSettings.selectFields("column3", "column1", "column10"); //only read values from columns you need
CsvParser parser = new CsvParser(parserSettings);
//use this if you just need plain strings
for(String[] row : parser.iterate(new File("/path/to/your.csv"))){
//do stuff with the row
}
//or use records to get values ready for calculation
for(Record record : parser.iterateRecords(new File("/path/to/your.csv"))){
int someValue = record.getInt("columnName");
//perform calculations
}
Just store data in a huge list if for some reason you need to run through all rows more than once. In this case allocate more memory to your program with something like -Xms8G -Xmx8G. Keep in mind you can't have an ArrayList with size over Integer.MAX_VALUE so that's your next limit even if you have memory enough.
If you really need a list you can use use the parser like this:
List<Record> twentyMillionRecords = parser.parseAllRecords(new File("/path/to/your.csv"), 20_000_000);
Otherwise your best bet is to run the parser as many times as needed. The parser I suggested should take a few seconds to go through the file each time.
Hope this helps
Disclaimer: I'm the author of this library. It's open source and free (apache 2.0 license)

I bet majority of the time was spent reading the data. Having a BufferedReader should significantly speed things up.

Parsing 20 GB input file to an ArrayList

I need to sort a 20 GB file ( which consists of random numbers) in the ascending order, But I am not understanding what technique should I use. I tried to use ArrayList in my Java Program, but it runs out of Memory. Increasing the heap size didn't work too, I guess 20 GB is too big. Can anybody guide me, how should I proceed ?

You shall use an external sorting algorithm, do not try to fit this in memory.
http://en.wikipedia.org/wiki/External_sorting
If you think it is too complex, try this:
include H2 database in your project
create a new on-disk database (will be created automatically on first connection)
create some simple table where the numbers will be stored
read data number-by-number and insert it into the database (do not forget to commit each 1000 numbers or so)
select numbers with ORDER BY clause :)
use JDBC resultSet to fetch results on-the-fly and write them to an output file
H2 database is simple, works very well with Java and can be embedded in your JAR (does not need any kind of installation or setup).

You don't need any special tools for this really. This is a textbook case for external merge sort, wherein you read in parts of the large file at a time (say 100M), sort them, and write the sorted results to an external file. Read in another part, sort it, spit it back out, until there's nothing left to sort. Then you need to read in the sorted chunks, a smaller piece at a time (say 10M) and sort those in memory. The tricky point is to merge those sorted bits together in the right way. Read the external sorting page on Wikipedia as well, as already mentioned. Also, here's an implementation in Java that does this kind of external merge sorting.

Summing weights based on string in large file

I am pretty sure a modified/similar discussion might have already been done here but I want to present the exact problem i am facing with possible solution from my side. Then I want to hear from you guys that what would be better approach or how can I approve my logic.
PROBLEM
I have a huge file which contains lines. Each line is in following format <weight>,<some_name>. Now what I have to do is to add the weight of all the objects which has same name. The problem is
I don't know how frequent some_name exist in the file. it could appear only once or all of the millions could be it
It is not ordered
I am using File Stream (java specific, but it doesn't matter)
SOLUTION 1: Assuming that I have huge ram, What i am planning to do is to read file line by line and use the name as key in my hash_map. If its already there, sum it up otherwise add. It will cost me m ram (m = numer of lines in file) but overall processing would be fast
SOLUTION 2: Assuming that I don't have huge ram, I am going to do in batches. Read first 10,000 in hashtable, sum it up and dump it into the file. Do the for rest of the file. Once done processing file, I will start reading processed files and will repease this process to sum it up all.
What do you guys suggest here ?
Beside your suggestions, Can I do parallel file reading of the file ? I have access to FileInputStream here, Can i work with fileInputStream to make reading of file more efficient ?

The second approach is not going to help you: in order to produce the final output, you need sufficient amount of RAM to hold all keys from the file, along with a single Integer representing the count. Whether you're going to get to it in one big step or by several iterations of 10K rows at a time does not change the footprint that you would need at the end.
What would help is partitioning the keys in some way, e.g. by the first character of the key. If the name starts in a letter, process the file 26 times, the first time taking only the weights for keys starting in 'A' and ignoring all other keys, the second time taking only 'B's, and so on. This will let you end up with 26 files that do not intersect.
Another valid approach would be using an external sorting algorithm to transform an unordered file to an ordered one. This would let you walk the ordered file, calculate totals as you go, and write them to an output, even without the need for an in-memory table.
As far as optimizing the I/O goes, I would recommend using the newBufferedReader(Path path,Charset c) method of the java.nio.file.Files class: it gives you a BufferedReader that is optimized for reading efficiency.

Is the file static when you do this computation? If so, then you could disk sort the file based on the name and add up the consecutive entries.

Sort a file with huge volume of data given memory constraint

Points:
We process thousands of flat files in a day, concurrently.
Memory constraint is a major issue.
We use thread for each file process.
We don't sort by columns. Each line (record) in the file is treated as one column.
Can't Do:
We cannot use unix/linux's sort commands.
We cannot use any database system no matter how light they can be.
Now, we cannot just load everything in a collection and use the sort mechanism. It will eat up all the memory and the program is gonna get a heap error.
In that situation, how would you sort the records/lines in a file?

It looks like what you are looking for is
external sorting.
Basically, you sort small chunks of data first, write it back to the disk and then iterate over those to sort all.

As other mentionned, you can process in steps.
I would like to explain this with my own words (differs on point 3) :
Read the file sequentially, process N records at a time in memory (N is arbitrary, depending on your memory constraint and the number T of temporary files that you want).
Sort the N records in memory, write them to a temp file. Loop on T until you are done.
Open all the T temp files at the same time, but read only one record per file. (Of course, with buffers). For each of these T records, find the smaller, write it to the final file, and advance only in that file.
Advantages:
The memory consumption is as low as you want.
You only do the double of disk accesses comparing to a everything-in-memory policy. Not bad! :-)
Exemple with numbers:
Original file with 1 million records.
Choose to have 100 temp files, so read and sort 10 000 records at a time, and drop these in their own temp file.
Open the 100 temp file at a time, read the first record in memory.
Compare the first records, write the smaller and advance this temp file.
Loop on step 5, one million times.
EDITED
You mentionned a multi-threaded application, so I wonder ...
As we seen from these discussions on this need, using less memory gives less performance, with a dramatic factor in this case. So I could also suggest to use only one thread to process only one sort at a time, not as a multi-threaded application.
If you process ten threads, each with a tenth of the memory available, your performance will be miserable, much much less than a tenth of the initial time. If you use only one thread, and queue the 9 other demands and process them in turn, you global performance will be much better, you will finish the ten tasks much faster.
After reading this response :
Sort a file with huge volume of data given memory constraint
I suggest you consider this distribution sort. It could be huge gain in your context.
The improvement over my proposal is that you don't need to open all the temp files at once, you only open one of them. It saves your day! :-)

You can read the files in smaller parts, sort these and write them to temporrary files. Then you read two of them sequentially again and merge them to a bigger temporary file and so on. If there is only one left you have your file sorted. Basically that's the Megresort algorithm performed on external files. It scales quite well with aribitrary large files but causes some extra file I/O.
Edit: If you have some knowledge about the likely variance of the lines in your files you can employ a more efficient algorithm (distribution sort). Simplified you would read the original file once and write each line to a temporary file that takes only lines with the same first char (or a certain range of first chars). Then you iterate over all the (now small) temporary files in ascending order, sort them in memory and append them directly to the output file. If a temporary file turns out to be too big for sorting in memory, you can reapeat the same process for this based on the 2nd char in the lines and so on. So if your first partitioning was good enough to produce small enough files, you will have only 100% I/O overhead regardless how large the file is, but in the worst case it can become much more than with the performance wise stable merge sort.

In spite of your restriction, I would use embedded database SQLITE3. Like yourself, I work weekly with 10-15 millions of flat file lines and it is very, very fast to import and generate sorted data, and you only need a little free of charge executable (sqlite3.exe). For example: Once you download the .exe file, in a command prompt you can do this:
C:> sqlite3.exe dbLines.db
sqlite> create table tabLines(line varchar(5000));
sqlite> create index idx1 on tabLines(line);
sqlite> .separator '\r\n'
sqlite> .import 'FileToImport' TabLines
then:
sqlite> select * from tabLines order by line;
or save to a file:
sqlite> .output out.txt
sqlite> select * from tabLines order by line;
sqlite> .output stdout

I would spin up an EC2 cluster and run Hadoop's MergeSort.
Edit: not sure how much detail you would like, or on what. EC2 is Amazon's Elastic Compute Cloud - it lets you rent virtual servers by the hour at low cost. Here is their website.
Hadoop is an open-source MapReduce framework designed for parallel processing of large data sets. A job is a good candidate for MapReduce when it can be split into subsets that can be processed individually and then merged together, usually by sorting on keys (ie the divide-and-conquer strategy). Here is its website.
As mentioned by the other posters, external sorting is also a good strategy. I think the way I would decide between the two depends on the size of the data and speed requirements. A single machine is likely going to be limited to processing a single file at a time (since you will be using up available memory). So look into something like EC2 only if you need to process files faster than that.

You could use the following divide-and-conquer strategy:
Create a function H() that can assign each record in the input file a number. For a record r2 that will be sorted behind a record r1 it must return a larger number for r2 than for r1. Use this function to partition all the records into separate files that will fit into memory so you can sort them. Once you have done that you can just concatenate the sorted files to get one large sorted file.
Suppose you have this input file where each line represents a record
Alan Smith
Jon Doe
Bill Murray
Johnny Cash
Lets just build H() so that it uses the first letter in the record so you might get up to 26 files but in this example you will just get 3:
<file1>
Alan Smith
<file2>
Bill Murray
<file10>
Jon Doe
Johnny Cash
Now you can sort each individual file. Which would swap "Jon Doe" and "Johnny Cash" in <file10>. Now, if you just concatenate the 3 files you'll have a sorted version of the input.
Note that you divide first and only conquer (sort) later. However, you make sure to do the partitioning in a way that the resulting parts which you need to sort don't overlap which will make merging the result much simpler.
The method by which you implement the partitioning function H() depends very much on the nature of your input data. Once you have that part figured out the rest should be a breeze.

If your restriction is only to not use an external database system, you could try an embedded database (e.g. Apache Derby). That way, you get all the advantages of a database without any external infrastructure dependencies.

Here is a way to do it without heavy use of sorting in-side Java and without using DB.
Assumptions : You have 1TB space and files contain or start with unique number, but are unsorted
Divide the files N times.
Read those N files one by one, and create one file for each line/number
Name that file with corresponding number.While naming keep a counter updated to store least count.
Now you can already have the root folder of files marked for sorting by name or pause your program to give you the time to fire command on your OS to sort the files by names. You can do it programmatically too.
Now you have a folder with files sorted with their name, using the counter start taking each file one by one, put numbers in your OUTPUT file, close it.
When you are done you will have a large file with sorted numbers.

I know you mentioned not using a database no matter how light... so, maybe this is not an option. But, what about hsqldb in memory... submit it, sort it by query, purge it. Just a thought.

You can use SQL Lite file db, load the data to the db and then let it sort and return the results for you.
Advantages: No need to worry about writing the best sorting algorithm.
Disadvantage: You will need disk space, slower processing.
https://sites.google.com/site/arjunwebworld/Home/programming/sorting-large-data-files

You can do it with only two temp files - source and destination - and as little memory as you want.
On first step your source is the original file, on last step the destination is the result file.
On each iteration:
read from the source file into a sliding buffer a chunk of data half size of the buffer;
sort the whole buffer
write to the destination file the first half of the buffer.
shift the second half of the buffer to the beginning and repeat
Keep a boolean flag that says whether you had to move some records in current iteration.
If the flag remains false, your file is sorted.
If it's raised, repeat the process using the destination file as a source.
Max number of iterations: (file size)/(buffer size)*2

You could download gnu sort for windows: http://gnuwin32.sourceforge.net/packages/coreutils.htm Even if that uses too much memory, it can merge smaller sorted files as well. It automatically uses temp files.
There's also the sort that comes with windows within cmd.exe. Both of these commands can specify the character column to sort by.

File sort software for big file https://github.com/lianzhoutw/filesort/ .
It is based on file merge sort algorithm.

If you can move forward/backward in a file (seek), and rewrite parts of the file, then you should use bubble sort.
You will have to scan lines in the file, and only have to have 2 rows in memory at the moment, and then swap them if they are not in the right order. Repeat the process until there are no files to swap.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.