Given a harddrive with 120GB, 100 of which are filled with the strings of length 256 and 2 GB Ram how do I sort those strings in Java most efficiently?
How long will it take?
A1. You probably want to implement some form of merge-sort.
A2: Longer than it would if you had 256GB RAM on your machine.
Edit: stung by criticism, I quote from Wikipedia's article on merge sort:
Merge sort is so inherently sequential that it is practical to run it using slow tape drives as input and output devices. It requires very
little memory, and the memory required does not depend on the number
of data elements.
For the same reason it is also useful for sorting data on disk that is
too large to fit entirely into primary memory. On tape drives that can
run both backwards and forwards, merge passes can be run in both
directions, avoiding rewind time.
Here is how I'd do it:
Phase 1 is to split the 100Gb into 50 partitions of 2Gb, read each of the 50 partitions into memory, sort using quicksort, and write out. You want the sorted partitions at the top end of the disc.
Phase 2 is to then merge the 50 sorted partitions. This is the tricky bit because you don't have enough space on the disc to store the partitions AND the final sorted output. So ...
Do a 50-way merge to fill the first 20Gb at the bottom end of disc.
Slide the remaining data in the 50 partitions to the top to make another 20Gb of free space contiguous with the end of the first 20Gb.
Repeat steps 1. and 2. until completed.
This does a lot of disc IO, but you can make use of your 2Gb of memory for buffering in the copying and merging steps to get data throughput by minimizing the number of disc seeks, and do large data transfers.
EDIT - #meriton has proposed a clever way to reduce copying. Instead of sliding, he suggests that the partitions be sorted into reverse order and read backwards in the merge phase. This would allows the algorithm to release disc space used by partitions (phase 2, step 2) by simply truncating the partition files.
The potential downsides of this are increased disk fragmentation, and loss of performance due reading the partitions backwards. (On the latter point, reading a file backwards on Linux / UNIX requires more syscalls, and the FS implementation may not be able to do "read-ahead" in the reverse direction.)
Finally, I'd like to point out that any theoretically predictions of the time taken by this algorithm (and others) are largely guesswork. The behaviour of these algorithms on a real JVM + real OS + real discs are just too complex for "back for the envelope" calculations to give reliable answers. A proper treatment would require actual implementation, tuning and benchmarking.
I am basically repeating Krystian's answer, but elaborating:
Yes you need to do this more-or-less in place, since you have little RAM available. But naive in-place sorts would be a disaster here just due to the cost of moving strings around.
Rather than actually move strings around, just keep track of which strings should swap with which others and actually move them, once, at the end, to their final spot. That is, if you had 1000 strings, make an array of 1000 ints. array[i] is the location where string i should end up. If array[17] == 133 at the end, it means string 17 should end up in the spot for string 133. array[i] == i for all i to start. Swapping strings, then, is just a matter of swapping two ints.
Then, any in-place algorithm like quicksort works pretty well.
The running time is surely dominated by the final move of the strings. Assuming each one moves, you're moving around about 100GB of data in reasonably-sized writes. I might assume the drive / controller / OS can move about 100MB/sec for you. So, 1000 seconds or so? 20 minutes?
But does it fit in memory? You have 100GB of strings, each of which is 256 bytes. How many strings? 100 * 2^30 / 2^8, or about 419M strings. You need 419M ints, each is 4 bytes, or about 1.7GB. Voila, fits in your 2GB.
Sounds like a task that calls for External sorting method. Volume 3 of "The Art of Computer Programming" contains a section with extensive discussion of external sorting methods.
I think you should use BogoSort. You might have to modify the algorithm a bit to allow for inplace sorting, but that shouldn't be too hard. :)
You should use a trie (aka: a prefix tree): to build a tree-like structure that allows you to easily walk through your strings in an ordered manner by comparing their prefixes. In fact, you don't need to store it in memory. You can build the trie as a tree of directories on your file system (obviously, not the one which the data is coming from).
AFAIK, merge-sort requires as much free space as you have data. This may be a requirement for any external sort that avoids random access, though I'm not sure of this.
Related
I have a Multiobjective Particle Swarm Optimization algorithm for a complex problem, it uses a big population (4000 particles) and is a time consuming simulation (4 - 6 hours of execution).
As the algorithm keeps an archive, a repository of best solutions found so far, in order to analyze algorithm convergence and behavior I need to save some data from this repository and sometimes from the entire population at each iteration.
Currently in each iteration I'm (Java speaking) copying some attributes from the particle's object (from the repository and/or the population), formatting it to a StringBuffer in a method that runs in a separate thread from the simulation and, only at the end of the program execution I save it to a text file.
I think my algorithm is consuming memory in a bad way by doing this. But thinking also about performance I don't know what is the best way to save all these data: should I follow the same logic but save a .txt file each iteration instead of doing it by the end of the algorithm? Or should I save to a database? If so, should I save it in each iteration or at the end or another time? Or should I approach it differently somehow?
Edit: Repository data are often in a [5 - 10] MB range while the Population data occupies [100 - 200]MB memory. Every time I run the program I need about 20 simulations to analyze average convergence.
StringBuffer uses an array to keep characters, which is continuous area of memory. Whenever it needs to be expanded it creates a new array which is twice bigger. Usually it's enough for most of applications, but if you think that this buffer can be really big and want to eliminate the overhead of managing continuous part of memory, you can replace it with lists of Strings (or StringBuffers). This will require more memory, but it doesn't require this memory to be continuous.
I have a file (size = ~1.9 GB) which contains ~220,000,000 (~220 million) words / strings. They have duplication, almost 1 duplicate word every 100 words.
In my second program, I want to read the file. I am successful to read the file by lines using BufferedReader.
Now to remove duplicates, we can use Set (and it's implementations), but Set has problems, as described following in 3 different scenarios:
With default JVM size, Set can contain up to 0.7-0.8 million words, and then OutOfMemoryError.
With 512M JVM size, Set can contain up to 5-6 million words, and then OOM error.
With 1024M JVM size, Set can contain up to 12-13 million words, and then OOM error. Here after 10 million records addition into Set, operations become extremely slow. For example, addition of next ~4000 records, it took 60 seconds.
I have restrictions that I can't increase the JVM size further, and I want to remove duplicate words from the file.
Please let me know if you have any idea about any other ways/approaches to remove duplicate words using Java from such a gigantic file. Many Thanks :)
Addition of info to question: My words are basically alpha-numeric and they are IDs which are unique in our system. Hence they are not plain English words.
Use merge sort and remove the duplicates in a second pass. You could even remove the duplicates while merging (just keep the latest word added to output in RAM and compare the candidates to it as well).
Divide the huge file into 26 smaller files based on the first letter of the word. If any of the letter files are still too large, divide that letter file by using the second letter.
Process each of the letter files separately using a Set to remove duplicates.
You might be able to use a trie data structure to do the job in one pass. It has advantages that recommend it for this type of problem. Lookup and insert are quick. And its representation is relatively space efficient. You might be able to represent all of your words in RAM.
If you sort the items, duplicates will be easy to detect and remove, as the duplicates will bunch together.
There is code here you could use to mergesort the large file:
http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194
For large files I try not to read the data into memory but instead operate on a memory mapped file and let the OS page in/out memory as needed. If your set structures contain offsets into this memory mapped file instead of the actual strings it would consume significantly less memory.
Check out this article:
http://javarevisited.blogspot.com/2012/01/memorymapped-file-and-io-in-java.html
Question: Are these really WORDS, or are they something else -- phrases, part numbers, etc?
For WORDS in a common spoken language one would expect that after the first couple of thousand you'd have found most of the unique words, so all you really need to do is read a word in, check it against a dictionary, if found skip it, if not found add it to the dictionary and write it out.
In this case your dictionary is only a few thousand words large. And you don't need to retain the source file since you write out the unique words as soon as you find them (or you can simply dump the dictionary when you're done).
If you have the posibility to insert the words in a temporary table of a database (using batch inserts), then it would be a select distinct towards that table.
One classic way to solve this kind of problem is a Bloom filter. Basically you hash your word a number of times and for each hash result set some bits in a bit vector. If you're checking a word and all the bits from its hashes are set in the vector you've probably (you can set this probability arbitrarily low by increasing the number of hashes/bits in the vector) seen it before and it's a duplicate.
This was how early spell checkers worked. They knew if a word was in the dictionary, but they couldn't tell you what the correct spelling was because it only tell you if the current word is seen.
There are a number of open source implementations out there including java-bloomfilter
I'd tackle this in Java the same way as in every other language: Write a deduplication filter and pipe it as often as necessary.
This is what I mean (in pseudo code):
Input parameters: Offset, Size
Allocate searchable structure of size Size (=Set, but need not be one)
Read Offset (or EOF is encountered) elements from stdin and just copy them to stdout
Read Size elments from stdin (or EOF), store them in Set. If duplicate, drop, else write to stdout.
Read elements from stdin until EOF, if they are in Set then drop, else write to stdout
Now pipe as many instances as you need (If storage is no problem, maybe only as many as you have cores) with increasing Offsets and sane Size. This lets you use more cores, as I suspect the process is CPU bound. You can even use netcat and spread processing over more machines, if you are in a hurry.
Even in English, which has a huge number of words for a natural language, the upper estimates are only about 80000 words. Based on that, you could just use a HashSet and add all your words it (probably in all lower case to avoid case issues):
Set<String> words = new HashSet<String>();
while (read-next-word) {
words.add(word.toLowerCase());
}
If they are real words, this isn't going to cause memory problems, will will be pretty fast too!
To not have to worry to much about implementation you should use a database system, either plain old relational SQL or a No-SQL solution. Im pretty sure you could use e.g. Berkeley DB java edition and then do (pseudo code)
for(word : stream) {
if(!DB.exists(word)) {
DB.put(word)
outstream.add(word)
}
}
The problem is in essence easy, you need to store things on disk because there is not enough memory, then either use sorting O(N log N) (unecessary) or hashing O(N) to find the unique words.
If you want a solution that will very likely work but is not guaranteed to do so use a LRU type hash table. According to the empirical Zpif's law you should be OK.
A follow up question to some smart guy out there, what if I have 64-bit machine and set heap size to say 12GB, shouldn't virtual memory take care of the problem (although not in an optimal way) or is java not designed this way?
Quicksort would be a good option over Mergesort in this case because it needs less memory. This thread has a good explanation as to why.
Most performant solutions arise from omiting unecessary stuff. You look only for duplicates, so just do not store words itself, store hashes. But wait, you are not interested in hashes either, only if they awere seen already - do not store them. Treat hash as really large number, and use bitset to see whether you already seen this number.
So your problem boils down to really big sparse populated bitmap - with size depending on hash width. If your hash is up to 32 bit, you can use riak bitmap.
... gone thinking about really big bitmap for 128+ bit hashes %) (I'll be back )
I am sorting a number of integers from a file, which will probably be too large to fit into memory in one go, my current idea is to get sort chucks with quicksort, then mergesort them together. I would like to make the chunks as big as possible, so I'd like to know how much I can read in in one go.
I know about Runtime.FreeMemory, but how should I go about using it. Should I carefully work out what other variables I use in the program then create an array of size (freeMemory - variablesSizes), or is that too likely to go wrong?
Thanks!
Experiment until you find a size that works well. The largest array you can allocate on the heap isn't necessarily the fastest way to do it. In many circumstances, the entire heap does not fit in the computers RAM, and might be swapped out in parts. Just because you can allocate a huge array, does not mean it will be the best size for optimizing speed.
Some adaptive approach would probably be best (testing number of items sorted/second depending on array size) and adjusting for what you can fit without getting an OutOfMemoryError.
Simpler: stick with some large value that works well, but isn't necessarily the largest you can use.
Or: use an external library/database to do what you want - working with huge amounts of data is tricky to get right in general, and you will probably get better performance and shorter development time if you don't reinvent the wheel.
I'd start with a relatively small chunk size for the first chunk. Then I'd double the chunk for every next chunk until you get an OutOfMemoryException. Though that will probably trigger swapping.
I think figuring out exactly how much memory we can allocate is a sticky buisness, as by default in java the jvm will allocate a heap space of 256M, but this can always be increated using -Xmx, so it is best to trade performace for portability by having a fixed chunk size of lets say around 150M.
If you go with java building sorting functionality, you will have to use a Collection of some sort, which will not take int primitive types, but rather, you will have to use Integer objects. (List<Integer>)
In my experiences (not to be taken as gospel), an int weighs in at (obviously) 4 bytes of ram, whereas an Integer weighs in at 12 bytes on a 32bit machine and 24 bytes on a 64bit machine.
If you need to minimize memory foot print, use int[] and then implement your own sorter...
However, it might be easier all the way around to use List<Integer>, and the built in sorting functions and just deal with having to have more of smaller sized Lists.
To answer the question though, you should definitely look at the Merge-Sort angle of attack to this problem and just pick an arbitrary List size to start with. You will likely find, after some experimentation, that there is a trade off between list size and number of chunks. Find the sweet spot and tell us your results!
My engine is executing 1,000,000 of simulations on X deals. During each simulation, for each deal, a specific condition may be verified. In this case, I store the value (which is a double) into an array. Each deal will have its own list of values (i.e. these values are indenpendant from one deal to another deal).
At the end of all the simulations, for each deal, I run an algorithm on his List<Double> to get some outputs. Unfortunately, this algorithm requires the complete list of these values, and thus, I am not able to modify my algorithm to calculate the outputs "on the fly", i.e. during the simulations.
In "normal" conditions (i.e. X is low, and the condition is verified less than 10% of the time), the calculation ends correctly, even if this may be enhanced.
My problem occurs when I have many deals (for example X = 30) and almost all of my simulations verify my specific condition (let say 90% of simulations). So just to store the values, I need about 900,000 * 30 * 64bits of memory (about 216Mb). One of my future requirements is to be able to run 5,000,000 of simulations...
So I can't continue with my current way of storing the values. For the moment, I used a "simple" structure of Map<String, List<Double>>, where the key is the ID of the element, and List<Double> the list of values.
So my question is how can I enhance this specific part of my application in order to reduce the memory usage during the simulations?
Also another important note is that for the final calculation, my List<Double> (or whatever structure I will be using) must be ordered. So if the solution to my previous question also provide a structure that order the new inserted element (such as a SortedMap), it will be really great!
I am using Java 1.6.
Edit 1
My engine is executing some financial calculations indeed, and in my case, all deals are related. This means that I cannot run my calculations on the first deal, get the output, clean the List<Double>, and then move to the second deal, and so on.
Of course, as a temporary solution, we will increase the memory allocated to the engine, but it's not the solution I am expecting ;)
Edit 2
Regarding the algorithm itself. I can't give the exact algorithm here, but here are some hints:
We must work on a sorted List<Double>. I will then calculate an index (which is calculated against a given parameter and the size of the List itself). Then, I finally return the index-th value of this List.
public static double algo(double input, List<Double> sortedList) {
if (someSpecificCases) {
return 0;
}
// Calculate the index value, using input and also size of the sortedList...
double index = ...;
// Specific case where I return the first item of my list.
if (index == 1) {
return sortedList.get(0);
}
// Specific case where I return the last item of my list.
if (index == sortedList.size()) {
return sortedList.get(sortedList.size() - 1);
}
// Here, I need the index-th value of my list...
double val = sortedList.get((int) index);
double finalValue = someBasicCalculations(val);
return finalValue;
}
I hope it will help to have such information now...
Edit 3
Currently, I will not consider any hardware modification (too long and complicated here :( ). The solution of increasing the memory will be done, but it's just a quick fix.
I was thinking of a solution that use a temporary file: Until a certain threshold (for example 100,000), my List<Double> stores new values in memory. When the size of List<Double> reaches this threshold, I append this list in the temporary file (one file per deal).
Something like that:
public void addNewValue(double v) {
if (list.size() == 100000) {
appendListInFile();
list.clear();
}
list.add(v);
}
At the end of the whole calculation, for each deal, I will reconstruct the complete List<Double> from what I have in memory and also in the temporary file. Then, I run my algorithm. I clean the values for this deal, and move to the second deal (I can do that now, as all the simulations are now finished).
What do you think of such solution? Do you think it is acceptable?
Of course I will lose some time to read and write my values in an external file, but I think this can be acceptable, no?
Your problem is algorithmic and you are looking for a "reduction in strength" optimization.
Unfortunately, you've been too coy in the the problem description and say "Unfortunately, this algorithm requires the complete list of these values..." which is dubious. The simulation run has already passed a predicate which in itself tells you something about the sets that pass through the sieve.
I expect the data that meets the criteria has a low information content and therefore is amenable to substantial compression.
Without further information, we really can't help you more.
You mentioned that the "engine" is not connected to a database, but have you considered using a database to store the lists of elements? Possibly an embedded DB such as SQLite?
If you used int or even short instead of string for the key field of your Map, that might save some memory.
If you need a collection object that guarantees order, then consider a Queue or a Stack instead of your List that you are currently using.
Possibly think of a way to run deals sequentially, as Dommer and Alan have already suggested.
I hope that was of some help!
EDIT:
Your comment about only having 30 keys is a good point.
In that case, since you have to calculate all your deals at the same time, then have you considered serializing your Lists to disk (i.e. XML)?
Or even just writing a text file to disk for each List, then after the deals are calculated, loading one file/List at a time to verify that List of conditions?
Of course the disadvantage is slow file IO, but, this would reduced your server's memory requirement.
Can you get away with using floats instead of doubles? That would save you 100Mb.
Just to clarify, do you need ALL of the information in memory at once? It sounds like you are doing financial simulations (maybe credit risk?). Say you are running 30 deals, do you need to store all of the values in memory? Or can you run the first deal (~900,000 * 64bits), then discard the list of double (serialize it to disk or something) and then proceed with the next? I thought this might be okay as you say the deals are independent of one another.
Apologies if this sounds patronising; I'm just trying to get a proper idea of the problem.
The flippant answer is to get a bunch more memory. Sun JVM's can (almost happily) handle multi gigabyte heaps and if it's a batch job then longer GC pauses might not be a massive issue.
You may decide that this not a sane solution, the first thing to attempt would be to write a custom list like collection but have it store primitive doubles instead of the object wrapper Double objects. This will help save the per object overhead you pay for each Double object wrapper. I think the Apache common collections project had primitive collection implementations, these might be a starting point.
Another level would be to maintain the list of doubles in a nio Buffer off heap. This has the advantage that the space being used for the data is actually not considered in the GC runs and could in theory could lead you down the road of managing the data structure in a memory mapped file.
From your description, it appears you will not be able to easily improve your memory usage. The size of a double is fixed, and if you need to retain all results until your final processing, you will not be able to reduce the size of that data.
If you need to reduce your memory usage, but can accept a longer run time, you could replace the Map<String, List<Double>> with a List<Double> and only process a single deal at a time.
If you have to have all the values from all the deals, your only option is to increase your available memory. Your calculation of the memory usage is based on just the size of a value and the number of values. Without a way to decrease the number of values you need, no data structure will be able to help you, you just need to increase your available memory.
From what you tell us it sounds like you need 10^6 x 30 processors (ie number of simulations multiplied by number of deals) each with a few K RAM. Perhaps, though, you don't have that many processors -- do you have 30 each of which has sufficient memory for the simulations for one deal ?
Seriously: parallelise your program and buy an 8-core computer with 32GB RAM (or 16-core w 64GB or ...). You are going to have to do this sooner or later, might as well do it now.
There was a theory that I read awhile ago where you would write the data to disk and only read/write a chunk what you. Of course this describes virtual memory, but the difference here is that the programmer controls the flow and location rathan than the OS. The advantage there is that the OS is only allocated so much virtual memory to use, where you have access to the whole HD.
Or an easier option is just to increase your swap/paged memory, which I think would be silly but would help in your case.
After a quick google it seems like this function might help you if you are running on Windows:
http://msdn.microsoft.com/en-us/library/aa366537(VS.85).aspx
You say you need access to all the values, but you cannot possibly operate on all of them at once? Can you serialize the data such that you can store it in a single file. Each record set apart either by some delimiter, key value, or simply the byte count. Keep a byte counter either way. Let that be a "circular file" composed of a left file and a right file operating like opposing stacks. As data is popped(read) off the left file it is processed and pushed(write) into the right file. If your next operation requires a previously processed value reverse the direction of the file transfer. Think of your algorithm as residing at the read/write head of your hard drive. You have access as you would with a list just using different methods and at much reduced speed. The speed hit will be significant but if you can optimize your sequence of serialization so that the most likely accessed data is at the top of the file in order of use and possibly put the left and right files on different physical drives and your page file on a 3rd drive you will benefit from increased hard disk performance due to sequential and simultaneous reads and writes. Of course its a bit harder than it sounds. Each change of direction requires finalizing both files. Logically something like,
if (current data flow if left to right) {send EOF to right_file; left_file = left_file - right_file;} Practically you would want to leave all data in place where it physically resides on the drive and just manipulate the beginning and ending addresses for the files in the master file table. Literally operating like a pair of hard disk stacks. This will be a much slower, more complicated process than simply adding more memory, but very much more efficient than separate files and all that overhead for 1 file per record * millions of records. Or just put all your data into a database. FWIW, this idea just came to me. I've never actually done it or even heard of it done. But I imagine someone must have thought of it before me. If not please let me know. I could really use the credit on my resume.
One solution would be to format the doubles as strings and then add them in a (fast) Key Value store which is ordering by-design.
Then you would only have to read sequentially from the store.
Here is a store that 'naturally' sorts entries as they are inserted.
And they boast that they are doing it at the rate of 100 million entries per second (searching is almost twice as fast):
http://forum.gwan.com/index.php?p=/discussion/comment/897/#Comment_897
With an API of only 3 calls, it should be easy to test.
A fourth call will provide range-based searches.
OK, so I am writing a program that unfortunately needs to use a huge data structure to complete its work, but it is failing with a "out of memory error" during its initialization. While I understand entirely what that means and why it is a problem, I am having trouble overcoming it, since my program needs to use this large structure and I don't know any other way to store it.
The program first indexes a large corpus of text files that I provide. This works fine.
Then it uses this index to initialize a large 2D array. This array will have n² entries, where "n" is the number of unique words in the corpus of text. For the relatively small chunk I am testing it o n(about 60 files) it needs to make approximately 30,000x30,000 entries. This will probably be bigger once I run it on my full intended corpus too.
It consistently fails every time, after it indexes, while it is initializing the data structure(to be worked on later).
Things I have done include:
revamp my code to use a primitive int[] instead of a TreeMap
eliminate redundant structures, etc...
Also, I have run the program with-Xmx2g to max out my allocated memory
I am fairly confident this is not going to be a simple line of code solution, but is most likely going to require a very new approach. I am looking for what that approach is, any ideas?
Thanks,
B.
It sounds like (making some assumptions about what you're using your array for) most of the entries will be 0. If so, you might consider using a sparse matrix representation.
If you really have that many entries (your current array is somewhere over 3 gigabytes already, even assuming no overhead), then you'll have to use some kind of on-disk storage, or a lazy-load/unload system.
There are several causes of out of memory issues.
Firstly, the simplest case is you simply need more heap. You're using 512M max heap when your program could operate correctly with 2G. Increase is with -Xmx2048m as a JVM option and you're fine. Also be aware than 64 bit VMs will use up to twice the memory of 32 bit VMs depending on the makeup of that data.
If your problem isn't that simple then you can look at optimization. Replacing objects with primitives and so on. This might be an option. I can't really say based on what you've posted.
Ultimately however you come to a cross roads where you have to make a choice between virtulization and partitioning.
Virtualizing in this context simply means some form of pretending there is more memory than there is. Operating systems use this with virtual address spaces and using hard disk space as extra memory. This could mean only keeping some of the data structure in memory at a time and persisting the rest to secondary storage (eg file or database).
Partitioning is splitting your data across multiple servers (either real or virtual). For example, if you were keeping track of stock trades on the NASDAQ you could put stock codes starting with "A" on server1, "B" on server2, etc. You need to find a reasonable approach to slice your data such that you reduce or eliminate the need for cross-communication because that cross-communication is what limits your scalability.
So simple case, if what you're storing is 30K words and 30K x 30K combinations of words you could divide it up into four server:
A-M x A-M
A-M x N-Z
N-Z x A-M
N-Z x N-Z
That's just one idea. Again it's hard toc omment without knowing specifics.
This is a common problem dealing with large datasets. You can optimize as much as you want, but the memory will never be enough (probably), and as soon as the dataset grows a little more you are still smoked. The most scalable solution is simply to keep less in memory, work on chunks, and persist the structure on disk (database/file).
If you don't need a full 32 bits (size of integer) for each value in your 2D array, perhaps a smaller type such as a byte would do the trick? Also you should give it as much heap space as possible - 2GB is still relatively small for a modern system. RAM is cheap, especially if you're expecting to be doing a lot of processing in-memory.