I am sorting a number of integers from a file, which will probably be too large to fit into memory in one go, my current idea is to get sort chucks with quicksort, then mergesort them together. I would like to make the chunks as big as possible, so I'd like to know how much I can read in in one go.
I know about Runtime.FreeMemory, but how should I go about using it. Should I carefully work out what other variables I use in the program then create an array of size (freeMemory - variablesSizes), or is that too likely to go wrong?
Thanks!
Experiment until you find a size that works well. The largest array you can allocate on the heap isn't necessarily the fastest way to do it. In many circumstances, the entire heap does not fit in the computers RAM, and might be swapped out in parts. Just because you can allocate a huge array, does not mean it will be the best size for optimizing speed.
Some adaptive approach would probably be best (testing number of items sorted/second depending on array size) and adjusting for what you can fit without getting an OutOfMemoryError.
Simpler: stick with some large value that works well, but isn't necessarily the largest you can use.
Or: use an external library/database to do what you want - working with huge amounts of data is tricky to get right in general, and you will probably get better performance and shorter development time if you don't reinvent the wheel.
I'd start with a relatively small chunk size for the first chunk. Then I'd double the chunk for every next chunk until you get an OutOfMemoryException. Though that will probably trigger swapping.
I think figuring out exactly how much memory we can allocate is a sticky buisness, as by default in java the jvm will allocate a heap space of 256M, but this can always be increated using -Xmx, so it is best to trade performace for portability by having a fixed chunk size of lets say around 150M.
If you go with java building sorting functionality, you will have to use a Collection of some sort, which will not take int primitive types, but rather, you will have to use Integer objects. (List<Integer>)
In my experiences (not to be taken as gospel), an int weighs in at (obviously) 4 bytes of ram, whereas an Integer weighs in at 12 bytes on a 32bit machine and 24 bytes on a 64bit machine.
If you need to minimize memory foot print, use int[] and then implement your own sorter...
However, it might be easier all the way around to use List<Integer>, and the built in sorting functions and just deal with having to have more of smaller sized Lists.
To answer the question though, you should definitely look at the Merge-Sort angle of attack to this problem and just pick an arbitrary List size to start with. You will likely find, after some experimentation, that there is a trade off between list size and number of chunks. Find the sweet spot and tell us your results!
Related
I have a graph algorithm that generates intermediate results associated to different nodes. Currently, I have solved this by using a ConcurrentHashMap<Node, List<Result> (I am running multithreaded). So at first I add new results with map.get(node).add(result) and then I consume all results for a node at once with map.get(node).
However, I need to run on a pretty large graph where the number of intermediate results wan't fit into memory (good old OutOfMemory Exception). So I require some solution to write out the results on disk—because that's where there is still space.
Having looked at a lot of different "off-heap" maps and caches as well as MapDB I figured they are all not a fit for me. All of them don't seem to support Multimaps (which I guess you can call my map) or mutable values (which the list would be). Additionally, MapDB has been very slow for me when trying to create a new collection for every node (even with a custom serializer based on FST).
I can barely imagine, though, that I am the first and only to have such a problem. All I need is a mapping from a key to a list which I only need to extend or read as a whole. What would an elegant and simple solution look like? Or are there any existing libraries that I can use for this?
Thanks in advance for saving my week :).
EDIT
I have seen many good answers, however, I have two important constraints: I don't want to depend on an external database (e.g. Redis) and I can't influence the heap size.
You can increase the size of heap. The size of heap can be
configured to larger than physical memory size of your server while
you make sure the condition is right:
the size of heap + the size of other applications < the size of physical memory + the size of swap space
For instance, if the physical memory is 4G and the swap space is 4G,
the heap size can be configured to 6G.
But the program will suffer from page swapping.
You can use some database like Redis. Redis is key-value
database and has List structure.
I think this is the simplest way to solve your problem.
You can compress the Result instance. First, you serialize the
instance and compress that. And define the class:
class CompressResult {
byte[] result;
//...
}
And replace the Result to CompressResult. But you should deserialize
the result when you want to use it.
It will work well if the class Result has many fields and is very
complicated.
My recollection is that the JVM runs with a small initial max heap size. If you use the -Xmx10000m you can tell the JVM to run with a 10,000 MB (or whatever number you selected) heap. If your underlying OS resources support a larger heap that might work.
could you please suggest me (novice in Android/JAVA) what`s the most efficient way to deal with a relatively large amounts of data?
I need to compute some stuff for each of the 1000...5000 of elements in say a big datatype (x1,y1,z1 - double, flag1...flagn - boolean, desc1...descn - string) quite often (once a sec), that is why I want to do is as fast as possible.
What way would be the best? To declare a multidimensional array, or produce an array for each element (x1[i], y1[i]...), special class, some sort of JavaBean? Which one is the most efficient in terms of speed etc? Which is the most common way to deal with that sort of thing in Java?
Many thanks in advance!
Nick, you've asked a very generally questions. I'll do my best to answer it, but please be aware if you want anything more specific, you're going to need to drill down your question a bit.
Some back-envolope-calculations show that for and array of 5000 doubles you'll use 8 bytes * 5000 = 40,000 bytes or roughly 40 kB of memory. This isn't too bad as memory on most android devices is on the order of mega or even giga bytes. A good 'ol ArrayList should do just fine for storing this data. You could probably make things a little faster by specifying the ArrayLists length when you constructor. That way the Arraylist doesn't have to dynamically expand every time you add more data to it.
Word of caution though. Since we are on a memory restricted device, what could potentially happen is if you generate a lot of these ArrayLists rapidly in succession, you might start triggering the garbage collector a lot. This could cause your app to slow down (the whole device actually). If you're really going to be generating lots of data, then don't store it in memory. Store it off on disk where you'll have plenty of room and won't be triggering the garbage collector all the time.
I think that the efficiency with which you write the computation you need to do on each element is way more important than the data structure you use to store it. The difference between using an array for each element or an array of objects (each of which is the instance of a class containing all elements) should practically be negligible. Use whatever data structures you feel most comfortable with and focus on writing efficient algorithms.
OK, so I am writing a program that unfortunately needs to use a huge data structure to complete its work, but it is failing with a "out of memory error" during its initialization. While I understand entirely what that means and why it is a problem, I am having trouble overcoming it, since my program needs to use this large structure and I don't know any other way to store it.
The program first indexes a large corpus of text files that I provide. This works fine.
Then it uses this index to initialize a large 2D array. This array will have n² entries, where "n" is the number of unique words in the corpus of text. For the relatively small chunk I am testing it o n(about 60 files) it needs to make approximately 30,000x30,000 entries. This will probably be bigger once I run it on my full intended corpus too.
It consistently fails every time, after it indexes, while it is initializing the data structure(to be worked on later).
Things I have done include:
revamp my code to use a primitive int[] instead of a TreeMap
eliminate redundant structures, etc...
Also, I have run the program with-Xmx2g to max out my allocated memory
I am fairly confident this is not going to be a simple line of code solution, but is most likely going to require a very new approach. I am looking for what that approach is, any ideas?
Thanks,
B.
It sounds like (making some assumptions about what you're using your array for) most of the entries will be 0. If so, you might consider using a sparse matrix representation.
If you really have that many entries (your current array is somewhere over 3 gigabytes already, even assuming no overhead), then you'll have to use some kind of on-disk storage, or a lazy-load/unload system.
There are several causes of out of memory issues.
Firstly, the simplest case is you simply need more heap. You're using 512M max heap when your program could operate correctly with 2G. Increase is with -Xmx2048m as a JVM option and you're fine. Also be aware than 64 bit VMs will use up to twice the memory of 32 bit VMs depending on the makeup of that data.
If your problem isn't that simple then you can look at optimization. Replacing objects with primitives and so on. This might be an option. I can't really say based on what you've posted.
Ultimately however you come to a cross roads where you have to make a choice between virtulization and partitioning.
Virtualizing in this context simply means some form of pretending there is more memory than there is. Operating systems use this with virtual address spaces and using hard disk space as extra memory. This could mean only keeping some of the data structure in memory at a time and persisting the rest to secondary storage (eg file or database).
Partitioning is splitting your data across multiple servers (either real or virtual). For example, if you were keeping track of stock trades on the NASDAQ you could put stock codes starting with "A" on server1, "B" on server2, etc. You need to find a reasonable approach to slice your data such that you reduce or eliminate the need for cross-communication because that cross-communication is what limits your scalability.
So simple case, if what you're storing is 30K words and 30K x 30K combinations of words you could divide it up into four server:
A-M x A-M
A-M x N-Z
N-Z x A-M
N-Z x N-Z
That's just one idea. Again it's hard toc omment without knowing specifics.
This is a common problem dealing with large datasets. You can optimize as much as you want, but the memory will never be enough (probably), and as soon as the dataset grows a little more you are still smoked. The most scalable solution is simply to keep less in memory, work on chunks, and persist the structure on disk (database/file).
If you don't need a full 32 bits (size of integer) for each value in your 2D array, perhaps a smaller type such as a byte would do the trick? Also you should give it as much heap space as possible - 2GB is still relatively small for a modern system. RAM is cheap, especially if you're expecting to be doing a lot of processing in-memory.
Given a harddrive with 120GB, 100 of which are filled with the strings of length 256 and 2 GB Ram how do I sort those strings in Java most efficiently?
How long will it take?
A1. You probably want to implement some form of merge-sort.
A2: Longer than it would if you had 256GB RAM on your machine.
Edit: stung by criticism, I quote from Wikipedia's article on merge sort:
Merge sort is so inherently sequential that it is practical to run it using slow tape drives as input and output devices. It requires very
little memory, and the memory required does not depend on the number
of data elements.
For the same reason it is also useful for sorting data on disk that is
too large to fit entirely into primary memory. On tape drives that can
run both backwards and forwards, merge passes can be run in both
directions, avoiding rewind time.
Here is how I'd do it:
Phase 1 is to split the 100Gb into 50 partitions of 2Gb, read each of the 50 partitions into memory, sort using quicksort, and write out. You want the sorted partitions at the top end of the disc.
Phase 2 is to then merge the 50 sorted partitions. This is the tricky bit because you don't have enough space on the disc to store the partitions AND the final sorted output. So ...
Do a 50-way merge to fill the first 20Gb at the bottom end of disc.
Slide the remaining data in the 50 partitions to the top to make another 20Gb of free space contiguous with the end of the first 20Gb.
Repeat steps 1. and 2. until completed.
This does a lot of disc IO, but you can make use of your 2Gb of memory for buffering in the copying and merging steps to get data throughput by minimizing the number of disc seeks, and do large data transfers.
EDIT - #meriton has proposed a clever way to reduce copying. Instead of sliding, he suggests that the partitions be sorted into reverse order and read backwards in the merge phase. This would allows the algorithm to release disc space used by partitions (phase 2, step 2) by simply truncating the partition files.
The potential downsides of this are increased disk fragmentation, and loss of performance due reading the partitions backwards. (On the latter point, reading a file backwards on Linux / UNIX requires more syscalls, and the FS implementation may not be able to do "read-ahead" in the reverse direction.)
Finally, I'd like to point out that any theoretically predictions of the time taken by this algorithm (and others) are largely guesswork. The behaviour of these algorithms on a real JVM + real OS + real discs are just too complex for "back for the envelope" calculations to give reliable answers. A proper treatment would require actual implementation, tuning and benchmarking.
I am basically repeating Krystian's answer, but elaborating:
Yes you need to do this more-or-less in place, since you have little RAM available. But naive in-place sorts would be a disaster here just due to the cost of moving strings around.
Rather than actually move strings around, just keep track of which strings should swap with which others and actually move them, once, at the end, to their final spot. That is, if you had 1000 strings, make an array of 1000 ints. array[i] is the location where string i should end up. If array[17] == 133 at the end, it means string 17 should end up in the spot for string 133. array[i] == i for all i to start. Swapping strings, then, is just a matter of swapping two ints.
Then, any in-place algorithm like quicksort works pretty well.
The running time is surely dominated by the final move of the strings. Assuming each one moves, you're moving around about 100GB of data in reasonably-sized writes. I might assume the drive / controller / OS can move about 100MB/sec for you. So, 1000 seconds or so? 20 minutes?
But does it fit in memory? You have 100GB of strings, each of which is 256 bytes. How many strings? 100 * 2^30 / 2^8, or about 419M strings. You need 419M ints, each is 4 bytes, or about 1.7GB. Voila, fits in your 2GB.
Sounds like a task that calls for External sorting method. Volume 3 of "The Art of Computer Programming" contains a section with extensive discussion of external sorting methods.
I think you should use BogoSort. You might have to modify the algorithm a bit to allow for inplace sorting, but that shouldn't be too hard. :)
You should use a trie (aka: a prefix tree): to build a tree-like structure that allows you to easily walk through your strings in an ordered manner by comparing their prefixes. In fact, you don't need to store it in memory. You can build the trie as a tree of directories on your file system (obviously, not the one which the data is coming from).
AFAIK, merge-sort requires as much free space as you have data. This may be a requirement for any external sort that avoids random access, though I'm not sure of this.
When I was using C++ in college, I was told to use multidimensional arrays (hereby MDA) whenever possible, since it exhibits better memory locality since it's allocated in one big chunk. Array of arrays (AoA), on the other hand, are allocated in multiple smaller chunks, possibly scattered all over the place in the physical memory wherever vacancies are found.
So I guess the first question is: is this a myth, or is this an advice worth following?
Assuming that it's the latter, then the next question would be what to do in a language like Java that doesn't have true MDA. It's not that hard to emulate MDA with a 1DA, of course. Essentially, what is syntactic sugar for languages with MDA can be implemented as library support for languages without MDA.
Is this worth the effort? Is this too low level of an optimization issue for a language like Java? Should we just abandon arrays and use Lists even for primitives?
Another question: in Java, does allocating AoA at once (new int[M][N]) possibly yield a different memory allocation than doing it hierarchically (new int[M][]; for (... new int[N])?
Java and C# allocate memory in much different fashion that C++ does. In fact, in .NET for sure all the arrays of AoA will be close together if they are allocated one after another because memory there is just one continuous chunk without any fragmentation whatsoever.
But it is still true for C++ and still makes sense if you want maximum speed. Although you shouldn't follow that advise every time you want multidimensional array, you should write maintainable code first and then profile it if it is slow, premature optimization is root for all evil in this world.
Is this worth the effort? Is this too low level of an optimization issue for a language like Java?
Generally speaking, it is not worth the effort. The best strategy to to forget about this issue in the first version of your application, and implement in a straight-forward (i.e. easy to maintain) way. If the first version runs too slowly for your requirements, use a profiling tool to find the application's bottlenecks. If the profiling suggests that arrays of arrays is likely to be the problem, do some experiments to change your data structures to simulated multi-dimensional arrays and profile see if it makes a significant difference. [I suspect that it won't make much difference. But the most important things is to not waste your time optimizing something unnecessarily.]
Should we just abandon arrays and use Lists even for primitives?
I wouldn't go that far. Assuming that you are dealing with arrays of a predetermined size:
arrays of objects will be a bit faster than equivalent lists of objects, and
arrays of primitives will be considerably faster and take considerably less space than equivalent lists of primitive wrapper.
On the other hand, if your application needs to "grow" the arrays, using a List will simplify your code.
I would not waste the effort to use a 1D array as a multidim array in Java because there is no syntax to help. Of course one could define functions (methods) to hide the work for you, but you just end up with a function call instead of following a pointer when using an array of arrays. Even if the compiler/interpreter speeds this up for you, I still don't think it is worth the effort. In addition you can run into complications when trying to use code that expects 2D (or N-Dim) arrays that expect as arrays of arrays. I'm sure most general code out there will be written for arrays like this in Java. Also one can cheaply reassign rows (or columns if you decide to think like that).
If you know that this multidim array is a bottleneck, you may disregard what I said and see if manually optimizing helps.
From personal experience in Java, multidimensional arrays are far far slower than one dimensional arrays if one is loading a large amount of data, or accessing elements in the data which are at different positions. I wrote a program that took a screen shot image in BMP format, and then searched the screenshot for a smaller image. Loading the screenshot image (approx. 3 mb) into a multidimensional array (three dimensional, [xPos][yPos][color] (with color=0 being red value, and suchforth)) took 14 seconds. To load it into a single dimensional array took 1 second.
The gain for finding the smaller image in the larger image was similar. It took around 28 seconds to find the smaller image in the larger image when both images were stored as multi-dimensional arrays. It took around a second to find the smaller image in the larger image when both images were stored as one dimensional arrays. That said, I first wrote my program using a dimensional arrays for the sake of readability.