N-way merge sort a 2G file of strings

N-way merge sort a 2G file of strings - java

This is another question from cracking coding interview, I still have some doubt after reading it.
9.4 If you have a 2 GB file with one string per line, which sorting algorithm
would you use to sort the file and why?
SOLUTION
When an interviewer gives a size limit of 2GB, it should tell you something - in this case, it suggests that they don’t want you to bring all the data into memory.
So what do we do? We only bring part of the data into memory..
Algorithm:
How much memory do we have available? Let’s assume we have X MB of memory available.
Divide the file into K chunks, where X * K = 2 GB. Bring each chunk into memory and sort the lines as usual using any O(n log n) algorithm. Save the lines back to the file.
Now bring the next chunk into memory and sort.
Once we’re done, merge them one by one.
The above algorithm is also known as external sort. Step 3 is known as N-way merge
The rationale behind using external sort is the size of data. Since the data is too huge and we can’t bring it all into memory, we need to go for a disk based sorting algorithm.
Doubt:
When in step 3, doing the merge sort, while comparing 2 arrays, do we need 2*X space each time we compare? And the limit was X MB. Should we make the chunks into (X/2)*2K = 2GB? So that each chunk will be X/2 MB and there will be 2K chunks. Or I am just understanding the merge sort wrong?
Thanks!

http://en.wikipedia.org/wiki/External_sorting
A quick look on Wikipedia tells me that during the merging process you never hold a whole chunk in memory. So basically, if you have K chunks, you will have K open file pointers but you will only hold one line from each file in memory at any given time. You will compare the lines you have in memory and then output the smallest one (say, from chunk 5) to your sorted file (also an open file pointer, not in memory), then overwrite that line with the next line from that file (in our example, file 5) into memory and repeat until you reach the end of all the chunks.

First off, step 3 itself is not a merge sort, the whole thing is a merge sort. Step 3 is just a merge, with no sorting involved at all.
And as to the storage required, there are two possibilities.
The first is to merge the sorted data in groups of two. Say you have three groups:
A: 1 3 5 7 9
B: 0 2 4 6 8
C: 2 3 5 7
With that method, you would merge A and B in to a single group Y then merge Y and C into the final result Z:
Y: 0 1 2 3 4 5 6 7 8 9 (from merging A and B).
Z: 0 1 2 2 3 3 4 5 5 6 7 7 8 9 (from merging Y and C).
This has the advantage of a very small constant memory requirement in that you only ever need to store the "next" element from each of two lists but, of course, you need to do multiple merge operations.
The second way is a "proper" N-way merge where you select the next element from any of the groups. With that you would check the lowest value in every list to see which one comes next:
Z: 0 1 2 2 3 3 4 5 5 6 7 7 8 9 (from merging A, B and C).
This involves only one merge operation but it requires more storage, basically one element per list.
Which of these you choose depends on the available memory and the element size.
For example, if you have 100M memory available to you and the element size is 100K, you can use the latter. That's because, for a 2G file, you need 20 groups (of 100M each) for the sort phase which means a proper N-way merge will need 100K by 20, or about 2M, well under your memory availability.
Alternatively, let's say you only have 1M available. That will be about 2000 (2G / 1M) groups and multiplying that by 100K gives 200M, well beyond your capacity.
So you would have to do that merge in multiple passes. Keep in mind though that it doesn't have to be multiple passes merging two lists.
You could find a middle ground where for example each pass merges ten lists. Ten groups of 100K is only a meg so will fit into your memory constraint and that will result in fewer merge passes.

The merging process is much simpler than that. You'll be outputting them to a new file, but basically you only need constant memory: you only need to read one element from each of the two input files at a time.

Related

Heap Sort vs Merge Sort in Speed [duplicate]

This question already has answers here:
Quicksort superiority over Heap Sort
(6 answers)
Closed 4 years ago.
Which algorithm is faster when iterating through a large array: heap sort or merge sort? Why is one of these algorithms faster than the other?

Although time complexity is the same, the constant factors are not. Generally merge sort will be significantly faster on a typical system with a 4 or greater way cache, since merge sort will perform sequential reads from two runs and sequential writes to a single merged run. I recall a merge sort written in C was faster than an optimized heap sort written in assembly.
One issue is that heap sort swaps data, that's two reads and two writes per swap, while merge sort moves data, one read and one write per move.
The main drawback for merge sort is a second array (or vector) of the same size as the original (or optionally 1/2 the size of the original) is needed for working storage, on a PC with 4 GB or more of RAM, this usually isn't an issue.
On my system, Intel 3770K 3.5 ghz, Windows 7 Pro 64 bit, Visual Studio 2015, to sort 2^24 = 16,777,216 64 bit unsigned integers, heap sort takes 7.98 seconds while bottom up merge sort takes 1.59 seconds and top down merge sort takes 1.65 seconds.

Both sort methods have the same time complexity, and are optimal. The time required to merge in a merge sort is counterbalanced by the time required to build the heap in heapsort. The merge sort requires additional space. The heapsort may be implemented using additional space, but does not require it. Heapsort, however, is unstable, in that it doesn't guarantee to leave 'equal' elements unchanged. If you test both methods fairly and under the same conditions, the differences will be minimal.

finding and minimizing merge sort algorithm runtime analysis

let's say I have an array of the size n,I want to divine it to k new arrays of the size n/k.-what's the running time of this step may be?****I thought since when we split an array to 2 we look at it like 2^x=n =>x=log N => O(log n) then it works the same here too: k^(n/k)=n => n/k=log N ****but what's next?
now I run the bubble sort algorithm on each of the k arrays-O(n^2) and I use a merge algorithm on all the k arrays to make a sorted array of the size n-let's say the merge complexity is O(kn).
In addition I wan't to find a K so that I could minimizing the runtime of the algorithm,how can I do it?I thought taking derivative of the runtime function and finding it's minimum will do,it is the right way?

Merge sort splits the array into successively smaller pieces until it gets down to a bunch of 2-element subarrays. Then it begins to apply the merge algorithm on successively larger subarrays.
Imagine you have an array of 16 elements. The merge sort does its merges like this:
8 merges of two 1-item subarrays
4 merges of two 2-item subarrays
2 merges of two 4-item subarrays
1 merge of two 8-item subarrays
There are four (log2(16)) passes, and in each pass it examines every item. Each pass is O(n). So the running time for this merge sort is O(n * log2(n)).
Now, imagine you have an array with 81 items, and you want to merge it using a 3-way merge sort. Now you have the following sequence of merges:
27 merges of three 1-item subarrays (gives 27 3-item subarrays)
9 merges of three 3-item subarrays (gives 9 9-item subarrays)
3 merges of three 9-item subarrays (gives 3 27-item subarrays)
1 merge of three 27-item subarrays
There are four (log3(81)) passes. Each merge is O(m * log2(k)), where m is the total number of items to be merged, and k is the number of lists. So the first pass has 27 merges that do 3*log2(3) comparisons. The next pass has 9 merges that do 9*log2(3) comparisons, etc. It ends up that the total merge is O(n * log3(n) * log2(3))
You can see that the 3-way merge sort lets you do fewer passes (a 3-way merge sort of 16 items would require only three passes), but each pass is a little more expensive. What you have to determine is if:
n * logk(n) * log2(k) < n * log2(n)
Where k is the number of subarrays you want to split the array into. I'll let you do that math.
You have to be careful, though, because asymptotic analysis doesn't take into account real-world effects. For example, a 2-way merge is incredibly simple. When you go to a k-way merge where k > 2, you end up having to use a heap or other priority queue data structure, which has quite a bit of overhead. So even if the math above tells you that a 3-way merge sort should be faster, you'll want to benchmark it against the standard 2-way merge.
Update
You're right. If you simplify the equation, you end up with the equations being the same. So the computational complexity is the same regardless of the value of k.
That makes sense, because if k = x, then you end up with a heap sort.
So then you have to determine if there's a point where the merge overhead, which increases as k increases, is offset by the decreased number of passes. You'll probably need to determine that empirically.

Traditionally we use mergesort for external sorting algorithms, and the answer to this question has been dominated by one fact. A mergesort requires streaming data from multiple files and writing to a single one. The bottleneck is in the streaming, and not in CPU. If you are trying to stream from too many locations on a disk at once, the disk breaks down and starts to do random seeks. Your throughput on random seeks sucks.
The right answer on your hardware will vary (and especially if you are using SSD drives), but traditional Unix sort settled on a 16 way merge as a reasonable default.

Algorithms/hash functions to generate many to one mappings

I am looking for hash functions that can be used to generate batches out of integer stream. Specifically, I want to map integers xi from a set or stream (say X) to another set of integers or strings (say Y) such that many xi are mapped to one yj. While doing that, I want to ensure that there are at max n xi mapped to a single yj. As with the hashing, I need to be able to reliably find the y given an x.
I would like to ensure most of the yj have close to n number of xi mapped to them (to avoid very sparse mapping from X to Y).
One function I can think of is quotient:
int BATCH_SIZE = 3;
public int map(int x) {
return x / BATCH_SIZE;
}
for a stream of sequential integers, it can work fairly well. e.g. stream 1..9 will be mapped to
1 -> 0
2 -> 0
3 -> 1
4 -> 1
5 -> 1
6 -> 2
7 -> 2
8 -> 2
9 -> 3
and so on. However, for non sequential large integers and small batch size (my use case), this can generate super sparse mapping (each batch will have only 1 element most of the time).
Are there any standard ways to generate such a mapping (batching)

There's no way to get it to work under these assumptions.
You need to know how many items are in the stream and their distribution or you need to relax the ability to map item to batch precisely.
Let's say you have items a and b from the stream.
Are you going to put them together in the same batch or not? You can't answer this unless you know if you're going to get more items to fill the 2 or more batches (if you decide to put them in separate batches).
If you know how many there will be (even approximately) you can take their distribution and build batches based on that. Say you have string hashes (uniform distribution over 32bit). If you know you are getting 1M items and you want batches of 100 you can generate intervals of 2^32 / (1.000.000 / 100) and use that as the batch id (yj). This doesn't guarantee you get batches of exactly batch_size but they should be approximately batch_size. If the distribution is not uniform things are more difficult, but can still be done.
If you relax the ability to map item to batch then just group them every batch_size as they come out of the stream. You could keep a map for steam item to batch if you have the space.

Compress array of numbers

I have a large array (~400.000.000 entries) with integers of {0, 1, ..., 8}.
So I need 4 bits per entry. Around 200 MB.
At the moment I use a byte-array and save 2 numbers in each entry.
I wonder, if there is a good method, to compress this array. I did a quick research and found algorithms like Huffmann or LZW. But these algorithms are all for compressing the data, send the compressed data to someone and decompress them.
I just want to have a table, with less memory space, so I can load it into the RAM. The 200MB table easily fits, but I'm thinking on even bigger tables.
Important is, that I still be able to determine the values on certain positions.
Any tips?
Further information:
I just did a little experimenting, and it turns out, that on average 2.14 consecutive numbers have the same value.
There are 1 zero, 154 ones, 10373 twos, 385990 threes, 8146188 fours, 85008968 fives, 265638366 sixes, 70791576 sevens and 80 eights.
So more than half of the numbers are 6s.
I only need a fast getValue(idx) funktion, setValue(idx, value) is not important.

It depends on how your data look like. Are there repeating entries, or do they change only slowly, or what?
If so, you can try to compress chunks of your data and decompress when needed. The bigger the chunks, the more memory you can save and the worse the speed. IMHO no good deal. You could also save the data compressed and decompress in memory.
Otherwise, i.e., in case of no regularities, you'll need at least log(9) / log(2) = 3.17 bits per entry and there's nothing what could improve it.
You can come pretty close to this value by packing 5 numbers into a short. As 9**5 = 59049 < 65536 = 2**16, it fits nearly perfectly. You'll need 3.2 bits per number, no big win. Packing of five number is given via this formula
a + 9 * (b + 9 * (c + 9 * (d + 9 * e)))
and unpacking is trivial via a precomputed table.
UPDATE after question update
Further information: I just did a little experimenting, and it turns out, that on average 2.14 consecutive numbers have the same value. There are 1 zero, 154 ones, 10373 twos, 385990 threes, 8146188 fours, 85008968 fives, 265638366 sixes, 70791576 sevens and 80 eights. So more than half of the numbers are 6s.
The fact that there are on the average about 2.14 consecutive numbers are the same could lead to some compression, but in this case it says us nothing. There are nearly only fives and sixes, so encountering two equal consecutive numbers seems to be implied.
Given this facts, you can forget my above optimization. There are practically only 8 values there as you can treat the single zero separately. So you need just 3 bits per value and a single index for the zero.
You can even create a HashMap for all values below four or above seven, store there 1+154+10373+385990+80 entries and use only 2 bits per value. But this is still far from ideal.
Assuming no regularities, you'd need 1.44 bit per value as this is the entropy. You could go over all 5-tuples, compute their histogram, and use 1 byte for encoding of the 255 most frequent tuples. All the remaining tuples would map to the 256th value, telling you that you have to look in a HashMap for the rare tuple value.
Some evaluation
I was curious if it can work. The packing of 5 numbers into one byte needs 85996340 bytes. There are nearly 5 million tuples which don't fit, so my idea was to use a hash map for them. Assuming rehashing rather than chaining it makes sense to keep it maybe 50% full, so we need 10 million entries. Assuming TIntShortHashMap (mapping indexes to tuples) each entry takes 6 bytes, leading to 60 MB. Too bad.
Packing only 4 numbers into one byte consumes 107495425 bytes and leaves 159531 tuples which don't fit. This looks better, however, I'm sure the denser packing could be improved a lot.
The results as produced by this little program:
*** Packing 5 numbers in a byte. ***
Normal packed size: 85996340.
Number of tuples in need of special handling: 4813535.
*** Packing 4 numbers in a byte. ***
Normal packed size: 107495425.
Number of tuples in need of special handling: 159531.

There are many options - most depend on how your data looks. You could use any of the following and even combinations of them.
LZW - or variants
In your case a variant that uses a 4-bit initial dictionary would probably be a good start.
You could compress your data in blocks so you could use the index requested to determine which block to decode on the fly.
This would be a good fit if there are repeating patterns in your data.
Difference Coding
Your edit suggests that your data may benefit from a differencing pass. Essentially you replace every value with the difference between it and its predecessor.
Again you would need to treat your data in blocks and difference fixed run lengths.
You may also find that using differencing following by LZW would be a good solution.
Fourier Transform
If some data loss would be acceptable then some of the Fourier Transform compression schemes may be effective.
Lossless JPEG
If your data has a 2-dimensional aspect then some of the JPEG algorithms may lebd themselves well.
The bottom line
You need to bear in mind:
The longer time you spend compressing - up to a limit - the better compression ratio you can achieve
There is a real practical limit to how far you can go with lossless compression.
Once you go lossy you are essentially no longer restricted. You could approximate the whole of your data with new int[]{6} and get quite a few correct results.

As more than 1/2 of the entries are sixes, then just encode those as a single bit. Use 2 bits for the second most common and so on. Then you have something like this:
encoding total
#entrie bit pattern #bits # of bits
zero 1 000000001 9 9
ones 154 0000001 7 1078
twos 10373 000001 6 62238
threes 385990 00001 5 1929950
fours 8146188 0001 4 32584752
fives 85008968 01 2 170017936
sixes 265638366 1 1 265638366
sevens 70791576 001 3 212374728
eights 80 00000001 8 640
--------------------------------------------------------
Total 682609697 bits
With 429981696 entries encoded with 682609697 bits, you would then need 1.59 bit per entry on average.
Edit:
To allow for fast lookup, you can make an index into the compressed data that show where every n entry starts. Finding the exact value would then involve decompressing on average n/2 entries. Depending on how fast it should be you can adjust the number of entries in the index. To reduce the size of the pointer into the compressed data (and those the size of the index), use an estimate and just store the offset from that estimate.
Estimated pos Offset from
# entry no Actual Position (n * 1.59) estimated
0 0 0 0
100 162 159 3 Use this
200 332 318 14 <-- column as
300 471 477 -6 the index
400 642 636 6
500 807 795 12
600 943 954 -11
The overhead for such an index with every 100 entry and 10 bits for the offset, would mean 0.1 bit extra per entry.

There are 1 zero, 154 ones, 10373 twos, 385990 threes, 8146188 fours,
85008968 fives, 265638366 sixes, 70791576 sevens and 80 eights
Total = 429981696 symbols
Assuming the distribution is random, the entropy theorem says you cannot do better than 618297161.7 bits ~ 73.707 MB or on average 1.438 bits / symbol.
Minimum number of bits is SUM(count[i] * LOG(429981696 / count[i], 2)).
You can achieve this size using a range coder.
Given Sqrt(N) = 20736
Again you can achieve O(Sqrt(N)) complexity for accessing a random element by saving an int[k = 0 .. CEIL(SQRT(N)) - 1] state with the arithmetic decoder state after each SQRT(N) decoded symbols. This allows fast decoding of the next 20736 block of symbols.
The complexity of accessing an element drops to O(1) if you access the memory stream in a linear way.
Additional memory used: 20736 * 4 = 81KB.

How about considering some caching solution, like mapdb, or apache jcs. This will enable you to persist the Collection to disk, thus enabling you to work with very large lists.

You should look into a BitSet to store it most efficiently. Contrary to what the name suggests, it is not exactly a set, it has order and you can access it per index.
Internally it uses an array of longs to store the bits and hence can update itself by using bit masks.
I don't believe you can store it any more efficiently natively, if you want even more efficiency, then you should consider packing/compression algorithms.

Effective search from a huge number of points

I have a bunch of gps points collected and now I need to make a match of these points with 18000 points. I have these in two arraylists. Is there a better way to search? I am performing this in Java.
Here is a sample of huge data. They contain one more additional parameter ID1 by which a set of points can be grouped.
ID1 ID2 ID3 longi lati,
2 1 1 -79.911635 39.609849,
2 1 2 -79.91151 39.60956,
2 1 3 -79.9115 39.609489,
2 1 4 -79.911496 39.609433,
3 1 1 -79.908162 39.609841,
3 1 2 -79.908447 39.610019,
4 1 1 -79.911136 39.608433,
4 1 2 -79.910961 39.608446,
4 1 3 -79.910629 39.608451,
4 1 4 -79.910064 39.608493,
4 1 5 -79.909117 39.608586,

If you are looking for exact matches, then you can place the points in a set (both HashSet and TreeSet will work), and find the intersection: set1.intersect(set2). You will have to implement compare() or hashcode() accordingly, and equals() in any case, but that is the easy scenario.
If you are looking for "closer than X", you should use a quadtree. Place all the nodes in the first arraylist in a quadtree, and then perform quick lookup using this datastructure (which can yield the closest point in O(log N) per lookup instead of the O(N) per lookup of the brute-force approach). There is an open-source implementation of a quadtree in, for example, geotools.

You could also use the spatial index known as RTREE. It is usually faster than quadtree.
For example this paper finds it to be 2 -3 times faster in Oracle databases: http://pdf.aminer.org/000/300/406/incorporating_updates_in_domain_indexes_experiences_with_oracle_spatial_r.pdf
Java Topology Suite (JTS) contains a good implementation of the rtree: http://www.vividsolutions.com/jts/javadoc/com/vividsolutions/jts/index/strtree/STRtree.html
Note that GeoTools is based on JTS, so there may well also be an rtree lurking inside the spatial index functionality of it: http://docs.geotools.org/latest/userguide/library/main/collection.html

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.