I'm using MapDB in a project that deals with billions of Objects that need to be mapped/queued. I don't need any kind of persistence after the program finishes (the MapDB databases are all temporary). I want the program to run as fast as possible, but I'm confused about MapDB's commit() function (which I assume is relevant to performance), even after reading the docs. My questions:
What exactly does commit do? My working understanding is that it serializes Objects from the heap to disk, thus freeing heap space. Is this accurate?
What happens to the references to Objects that were just committed? Do they get cleaned up by GC, or do they somehow 'reference' an Object on disk (with MapDB making this transparent?)
Ultimately I want to know how to use MapDB as efficiently as I can, but I can't do that without knowing what commit() is for. I'd appreciate any other advice that you might have for using MapDB efficiently.
The commit operation is an operation on transactions, just as you would find in a database system. MapDB implements transactions, so commit is effectively 'make the changes I've made to this DB permanent and visible to other users of it'. The complimentary operation is rollback, which discards all of the changes you've made within the current transaction. Commit doesn't (directly) affect what is in memory and what is not. You might want to look at compact() instead, if you're trying to reclaim heap space.
For your second question, if you're holding a strong reference to an object then you continue holding that strong reference. MapDB isn't going to delete it for you. You should think of MapDB as a normal Java Map, most of the time. When you call get, MapDB hides whether it's in memory or on disk from you and just returns you a usable reference to the retrieved object. That retrieved object will hang around in memory until it becomes garbage, just like anything else.
It is a good idea to try to commit not after every single change to a map you make, but instead do it on some sort of schedule.
like
Every N changes
Every M seconds
After some sort of logical checkpoints in your code.
Doing too many commits will make your application very slow.
Related
Backstory: So I had this great idea, right? Sometimes you're collecting a massive amount of data, and you don't need to access all of it all the time, but you also may not need it after the program has finished, and you don't really want to muck around with database tables, etc. What if you had a library that would silently and automatically serialize objects to disk when you're not using them, and silently bring them back when you needed them? So I started writing a library; it has a number of collections like "DiskList" or "DiskMap" where you put your objects. They keep your objects via WeakReferences. While you're still using a given object, it has strong references to it, so it stays in memory. When you stop using it, the object is garbage collected, and just before that happens, the collection serializes it to disk (*). When you want the object again, you ask for it by index or key, like usual, and the collection deserializes it (or returns it from its inner cache, if it hasn't been GCd yet).
(*) See now, this is the sticking point. In order for this to work, I need to be able to be notified JUST BEFORE the object is GCd - after no other references to it exist (and therefore the object can no longer be modified), but before the object is wiped from memory. This is proving difficult. I thought briefly that using a ReferenceQueue would save me, but alas, it returns a Reference, whose referent has thus far always been null.
Is there a way, having been given an arbitrary object, to receive (via callback or queue, etc.) the object after it is ready to be garbage collected, but before it IS garbage collected?
I know (Object).finalize() can basically do that, but I'll have to deal with classes that don't belong to me, and whose finalize methods I can't legitimately override. I'd prefer not to go as arcane as custom classloaders, bytecode manipulation, or reflection, but I will if I have to.
(Also, if you know of existing libraries that do transparent disk caching, I'd look favorably on that, though my requirements on such a library would be fairly stringent.)
You can look for a cache that supports "write behind caching" and tiering. Notable products would be EHCache, Hazelcast, Infinispan.
Or you can construct something by yourself with a cache and a time to idle expiry.
Then, the cache access would be "the usage" of the object.
Is there a way, having been given an arbitrary object, to receive (via callback or queue, etc.) the object after it is ready to be garbage collected, but before it IS garbage collected?
This interferes heavily with garbage collection. Chances are high that it will bring down your application or whole system. What you want to do is to start disk I/O and potentially allocate additional objects, when the system is low or out of memory. If you manage it to work, you'll end up using more heap than before, since the heap must always be extended when the GC kicks in.
I have a database where I store invoices. I have to make complex operations for any given month with a series of algorithms using the information from all of the invoices. Retrieving and processing the necessary data for these operations takes a large amount of memory since there might be lots of invoices. The problem gets increasingly worse when the interval requested by the user for these calculations goes up to several years. The result is I'm getting a PermGen exception because it seems that the garbage collector is not doing its job between each month calculation.
I've always taken using System.GC to hint the GC to do its job is not a good practice. So my question is, are there any other ways to free memory aside from that? Can you force the JVM to use HD swapping in order to store partial calculations temporarily?
Also, I've tried to use System.gc at the end of each month calculation and the result was a high CPU usage (due to garbage collector being called) and a reasonably lower memory use. This could do the job but I don't think this would be a proper solution.
Don't ever use System.gc(). It always takes a long time to run and often doesn't do anything.
The best thing to do is rewrite your code to minimize memory usage as much as possible. You haven't explained exactly how your code works, but here are two ideas:
Try to reuse the data structures you generate yourself for each month. So, say you have a list of invoices, reuse that list for the next month.
If you need all of it, consider writing the processed files to temporary files as you do the processing, then reloading them when you're ready.
We should remember System.gc() does not really run the Garbage Collector. It simply asks to do the same. The JVM may or may not run the Garbage Collector. All we can do is to make unnecessary data structures available for garbage collection. You can do the same by:
Assigning Null as the value of any data structure after it has been used. Hence no active threads will be able to access it( in short enabling it for gc).
reusing the same structures instead of creating new ones.
I have run into a situation in which I would like to store an in-memory cache of spatial data which is not immediately needed, and is not loaded from disk, but generated algorithmically. Because the data is accessed spatially, data would be deleted from the cache based on irrelevance factors and the distance from the location of the most recent read operation. The problem is that Java's garbage collection does not seem to integrate well with this system. I would like to use the spatial knowledge of the data to enable it to be garbage-collected by the JVM. Is there a way to mark these cache objects as garbage-collectible? If the JVM encounters an out-of-memory exception, is there a way to catch that exception and delete the cache objects to free up memory?
Or is this the wrong way to do things?
Is there a way to mark these cache objects as garbage-collectible?
The simplest way is to store
some data with strong references e.g. in a LinkedHashMap, possible as a LRU cache.
data which you would like to retain if possible in a SoftReferences cache. These will not be cleaned up immediately but will be cleaned up before an OOME.
data which can be discarded with little cost in a WeakHashMap. This data is available until the GC is performed.
If the JVM encounters an out-of-memory exception, is there a way to catch that exception and delete the cache objects to free up memory?
You can do this but its not ideal as the error can be thrown anywhere in just about any thread.
I am trying to implement an LRU cache in Java which should be able to:
Change size dynamically. In the sense that I plan to have it as SoftReference subscribed to a ReferenceQueue. So depending upon the memory consumption, the cache size will vary.
I plan to use ConcurrentHashMap where the value will be a soft reference and then periodically clear the queue to update the map.
But the problem with the above is that, how do I make it LRU?
I know that we have no control over GC, but can we manage references to the value (in cache) in such a manner that all the possible objects in cache, will become softly reachable (under GC) depending upon usage (i.e. the last time it was accessed) and not in some random manner.
Neither weak nor soft references are really well suited for this. WeakReferences tend to get cleared immediatly as soon as the object has no stronger references anymore and soft references get cleared only after the heap has grown to it's maximum size and when a OutOufMemoryError would need to be thrown otherwise.
Typically it's more efficient to use some time based approach with regular strong refernces which are much cheaper for the VM than the Reference subclasses (faster to handle for the program and the GC and use no extra memory for the reference itself.). I.e. release all objects that have not been used for a certain time. You can check this with a periodic TimerTask that you would need anyway to operate your reference queue. The idea is that if it takes i.e. 10ms to create the object and you keep it at most 1s after it was last used you will on average only be 1% slower than when you would keep all objects forever. But since it will most likely use less memory it will actually be faster.
Edit: One way to implement this would be to use 3 buckets internally. Objects that are placed into the cache get always inserted into bucket 0. When a object is requested the cache looks for it in all 3 buckets in order and places it into bucket 0 if it was not already there. The TimerTask gets invoked in fixed intervals and just drops bucket 2 and places a new empty bucket at the front of the bucket list, so that the new bucket 0 will be empty and the former bucket 0 becomes 1 and the former bucket 1 is now bucket 2. This will ensure that idle objects will survive at least one and at most two timer intervals and that objects that are accessed more than once per interval are very fast to retrieve. The total maintenance overhead for such a data structure will be considerably smaller than everything that is based on reference objects and reference queues.
Your question doesn't really make sense unless you want several of these caches at the same time. If you have only a single cache, don't give it a size limit and always use WeakReference. That way, the cache will automatically use all available free memory.
Prepare for some hot discussions with your sysadmins, though, since they will come complaining that your app has a memory leak and "will crash any moment!" sigh
The other option is to use a mature cache library like EHCache since it already knows everything that there is to know about caches and they spent years getting them right - literally. Unless you want to spend years debugging your code to make it work with every corner case of the Java memory model, I suggest that you avoid reinventing the wheel this time.
I would use a LinkedHashMap as it support access order and use as a LRU map. It can have a variable maximum size.
Switching between weak and soft references based on usage is very difficult to get right because. Its hard to determine a) how much your cache is using exclusively, b) how much is being used by the system c) how much would be used after a Full GC.
You should note that weak and soft references are only cleaned on a GC, and that discarding them or changing them won't free memory until a GC is run.
I'm working with a program that runs lengthy SQL queries and stores the processed results in a HashMap. Currently, to get around the slow execution time of each of the 20-200 queries, I am using a fixed thread pool and a custom callable to do the searching. As a result, each callable is creating a local copy of the data which it then returns to the main program to be included in the report.
I've noticed that 100 query reports, which used to run without issue, now cause me to run out of memory. My speculation is that because these callables are creating their own copy of the data, I'm doubling memory usage when I join them into another large HashMap. I realize I could try to coax the garbage collector to run by attempting to reduce the scope of the callable's table, but that level of restructuring is not really what I want to do if it's possible to avoid.
Could I improve memory usage by replacing the callables with runnables that instead of storing the data, write it to a concurrent HashMap? Or does it sound like I have some other problem here?
Don't create copy of data, just pass references around, ensuring thread safety if needed. If without data copying you still have OOM, consider increasing max available heap for application.
Drawback of above approach not using copy of data is that thread safety is harder to achieve, though.
Do you really need all 100-200 reports at the same time?
May be it's worth to limit the 1st level of caching by just 50 reports and introduce a 2nd level based on WeakHashMap?
When 1st level exceeds its size LRU will be pushed to the 2nd level which will depend on the amount of available memory (with use of WeakHashMap).
Then to search for reports you will first need to query 1st level, if value is not there query 2nd level and if value is not there then report was reclaimed by GC when there was not enough memory and you have to query DB again for this report.
Do the results of the queries depend on other query results? If not, whenever you discover the results in another thread, just use a ConcurrentHashMap like you are implying. Do you really need to ask if creating several unnecessary copies of data is causing your program to run out of memory? This should almost be obvious.