Caching large amounts of spatial data in Java - is it feasible?

Caching large amounts of spatial data in Java - is it feasible? - java

I have run into a situation in which I would like to store an in-memory cache of spatial data which is not immediately needed, and is not loaded from disk, but generated algorithmically. Because the data is accessed spatially, data would be deleted from the cache based on irrelevance factors and the distance from the location of the most recent read operation. The problem is that Java's garbage collection does not seem to integrate well with this system. I would like to use the spatial knowledge of the data to enable it to be garbage-collected by the JVM. Is there a way to mark these cache objects as garbage-collectible? If the JVM encounters an out-of-memory exception, is there a way to catch that exception and delete the cache objects to free up memory?
Or is this the wrong way to do things?

Is there a way to mark these cache objects as garbage-collectible?
The simplest way is to store
some data with strong references e.g. in a LinkedHashMap, possible as a LRU cache.
data which you would like to retain if possible in a SoftReferences cache. These will not be cleaned up immediately but will be cleaned up before an OOME.
data which can be discarded with little cost in a WeakHashMap. This data is available until the GC is performed.
If the JVM encounters an out-of-memory exception, is there a way to catch that exception and delete the cache objects to free up memory?
You can do this but its not ideal as the error can be thrown anywhere in just about any thread.

Related

Using MapDB efficiently (confused about commits)

I'm using MapDB in a project that deals with billions of Objects that need to be mapped/queued. I don't need any kind of persistence after the program finishes (the MapDB databases are all temporary). I want the program to run as fast as possible, but I'm confused about MapDB's commit() function (which I assume is relevant to performance), even after reading the docs. My questions:
What exactly does commit do? My working understanding is that it serializes Objects from the heap to disk, thus freeing heap space. Is this accurate?
What happens to the references to Objects that were just committed? Do they get cleaned up by GC, or do they somehow 'reference' an Object on disk (with MapDB making this transparent?)
Ultimately I want to know how to use MapDB as efficiently as I can, but I can't do that without knowing what commit() is for. I'd appreciate any other advice that you might have for using MapDB efficiently.

The commit operation is an operation on transactions, just as you would find in a database system. MapDB implements transactions, so commit is effectively 'make the changes I've made to this DB permanent and visible to other users of it'. The complimentary operation is rollback, which discards all of the changes you've made within the current transaction. Commit doesn't (directly) affect what is in memory and what is not. You might want to look at compact() instead, if you're trying to reclaim heap space.
For your second question, if you're holding a strong reference to an object then you continue holding that strong reference. MapDB isn't going to delete it for you. You should think of MapDB as a normal Java Map, most of the time. When you call get, MapDB hides whether it's in memory or on disk from you and just returns you a usable reference to the retrieved object. That retrieved object will hang around in memory until it becomes garbage, just like anything else.

It is a good idea to try to commit not after every single change to a map you make, but instead do it on some sort of schedule.
like
Every N changes
Every M seconds
After some sort of logical checkpoints in your code.
Doing too many commits will make your application very slow.

java fastest concurrent random file R/W method for SSDs without memory swap

I have a linux box with 32GB of ram and a set of 4 SSD in a raid 0 config that maxes out at about 1GB of throughput (random 4k reads) and I am trying to determine the best way of accessing files on them randomly and conccurently using java. The two main ways I have seen so far are via random access file and mapped direct byte buffers.
Heres where it gets tricky though. I have my own memory cache for objects so any call to the objects stored in a file should go through to disk and not paged memory (I have disabled the swap space on my linux box to prevent this). Whilst mapped direct memory buffers are supposedly the fastest they rely on swapping which is not good because A) I am using all the free memory for the object cache, using mappedbytebuffers instead would incur a massive serialization overhead which is what the object cache is there to prevent.(My program is already CPU limited) B) with mappedbytebuffers the OS handles the details of when data is written to disk, I need to control this myself, ie. when I write(byte[]) it goes straight out to disk instantly, this is to prevent data corruption incase of power failure as I am not using ACID transactions.
On the other hand I need massive concurrency, ie. I need to read and write to multiple locations in the same file at the same time (whilst using offset/Range locks to prevent data corruption) I'm not sure how I can do this without mappedbytebuffers, I could always just que the reads/Writes but I'm not sure how this will negatively affect my throughput.
Finally I can not have a situation when I am creating new byte[] objects for reads or writes, this is because I perform almost a 100000 read/write operations per second, allocating and Garbage collecting all those objects would kill my program which is time sensitive and already CPU limited, reusing byte[] objects is fine through.
Please do not suggest any DB software as I have tried most of them and they add to much complexity and cpu overhead.
Anybody had this kind of dilemma?

Whilst mapped direct memory buffers are supposedly the fastest they rely on swapping
No, not if you have enough RAM. The mapping associates pages in memory with pages on disk. Unless the OS decides that it needs to recover RAM, the pages won't be swapped out. And if you are running short of RAM, all that disabling swap does is cause a fatal error rather than a performance degradation.
I am using all the free memory for the object cache
Unless your objects are extremely long-lived, this is a bad idea because the garbage collector will have to do a lot of work when it runs. You'll often find that a smaller cache results in higher overall throughput.
with mappedbytebuffers the OS handles the details of when data is written to disk, I need to control this myself, ie. when I write(byte[]) it goes straight out to disk instantly
Actually, it doesn't, unless you've mounted your filesystem with the sync option. And then you still run the risk of data loss from a failed drive (especially in RAID 0).
I'm not sure how I can do this without mappedbytebuffers
A RandomAccessFile will do this. However, you'll be paying for at least a kernel context switch on every write (and if you have the filesystem mounted for synchronous writes, each of those writes will involve a disk round-trip).
I am not using ACID transactions
Then I guess the data isn't really that valuable. So stop worrying about the possibility that someone will trip over a power cord.

Your objections to mapped byte buffers don't hold up. Your mapped files will be distinct from your object cache, and though they take address space they don't consume RAM. You can also sync your mapped byte buffers whenever you want (at the cost of some performance). Moreover, random access files end up using the same apparatus under the covers, so you can't save any performance there.
If mapped bytes buffers aren't getting you the performance you need, you might have to bypass the filesystem and write directly to raw partitions (which is what DBMS' do). To do that, you probably need to write C++ code for your data handling and access it through JNI.

Is it possible to mark java objects non-collectable from gc perspective to save on gc-sweep time?

Is it possible to mark java objects non-collectable from gc perspective to save on gc-sweep time?
Something along the lines of http://wwwasd.web.cern.ch/wwwasd/lhc++/Objectivity/V5.2/Java/guide/jgdStorage.fm.html and specifically non-garbage-collectible containers there (non-garbage-collectable?).
The problem is that I have lots of ordinary temporary objects, but I have even bigger (several Gigs) of objects that are stored for Cache purposes. For no reason should the Java GC traverse all those Cache gigabytes trying to find anything to collect, because they contain cached data which have their own timeouts.
This way I could partition my data in a custom way into infinite-lived and normal-lived objects, and hopefully GC would be quite fast because normal objects don't live so long and amount to smaller amounts.
There are some workarounds to this problem, such as Apache DirectMemory and Commercial Terracotta BigMemory(http://terracotta.org/products/bigmemory), but a java-native solution would be nicer (I mean free and probably more reliable?). Also I want to avoid serialization overhead which means it should happen within same jvm. To my understanding DirectMemory and BigMemory operate mainly off heap which means that the objects must be serialized/deserialized to/from memory outside jvm. Simply marking non-gc regions within the jvm would seem a better solution. Using Files for cache is not an option either, it has the same unaffordable serialization/deserialization overhead - use case is a HA server with lots of data used in random (human) order and low latency needed.

Any memory the JVM manages is also garbage-collected by the JVM. And any “live” objects which are directly available to Java methods without deserialization have to live in JVM memory. Therefore in my understanding you cannot have live objects which are immune to garbage collection.
On the other hand, the usage you describe should make the generational approach to garbage collection quite efficient. If your big objects stay around for a while, they will be checked for reclamation less often. So I doubt there is much to be gained from avoiding those checks.

Is it possible to mark java objects non-collectable from gc perspective to save on gc-sweep time?
No it is not possible.
You can prevent objects from being garbage collected by keeping them reachable, but the GC will still need to trace them to check reachability on each full; GC (at least).
Is simply my assumption, that when the jvm is starving it begins scanning all those unnecessary objects too.
Yes. That is correct. However, unless you've got LOTS of objects that you want to be treated this way, the overhead is likely to be insignificant. (And anyway, a better idea is to give the JVM more memory ... if that is possible.)

Quite simply, for you to be able to do this, the garbage collection algorithm would need to be aware of such a flag, and take it into account when doing its work.
I'm not aware of any of the standard GC algorithms having such a flag, so for this to work you would need to write your own GC algorithm (after deciding on some feasible way to communicate this information to it).
In principle, in fact, you've already started down this track - you're deciding how garbage collection should be done rather than being happy to leaving it to the JVM's GC algo. Is the situation you describe a measurable problem for you; something for which the existing garbage collection is insufficient, but your plan would work? Garbage collectors are extremely well-tuned, so I wouldn't be surprised if the "inefficient" default strategy is actually faster than your naively-optimal one.
(Doing manual memory management is tricky and error-prone at the best of times; managing some memory yourself while using a stock garbage collector to handle the rest seems even worse. I expect you'd run into a lot of edge cases where the GC assumes it "knows" what's happening with the whole heap, which would no longer be true. Steer clear if you can...)

The recommended approaches would be to use either a commerical RTSJ implementation to avoid GC, or to use off heap memory. One could also look into soft references for caches as well (they do get collected).
This is not recommended:
If for some reason you do not believe these options are sufficient, you could look into direct memory access which is UNSAFE (part of sun.misc.Unsafe). You can use the 'theUnsafe' field to get the 'Unsafe' instance. Unsafe allows to allocation/deallocate memory via 'allocateMemory' and 'freeMemory'. This is not under GC control nor limited by JVM heap size. The impact on GC/application, once you go down this route, is not guaranteed - which is why using byte buffers might be the way to go (if you're not using a RTSJ like implementation).
Hope this helps.

Living Java objects will always be part of the GC life cycle. Or said another way, marking an object to be non-gc is the same order of overhead than having your object referenced by a root reference (a static final map for instance).
But thinking a bit further, data put in a cache are most likely to be temporary, and would eventually be evicted. At that point you will start again to like the JVM and the GC.
If you have 100's of GBs of permanent data, you may want to rethink the architecture of your application, and try to shard and distribute your data (horizontally scalability).
Last but not least, lots of work has been done around serialization, and the overhead of serialization (I'm not speaking about the poor reputation of ObjectInputStream and ObjectOutputStream) is not that big.
More than that, if your data is mainly composed of primitive types (including bytes array), there is efficient way to readInt() or readBytes() from off heap buffers (for instannce netty.io's ChannelBuffer). This could be a way to go.

JCS Cache shutdown,guaranteed persistence to disk

Iam using JCS for caching.Now I am using disk cache to temporarily store all the data.The problem is when I use JCS,the keys are written to disk only if the cache is properly shutdown.
I am using the disk usage pattern as UPDATE which tells JCS to immediately write data to the disk without keeping it in memory.But the problem is we are not maitaining the key list of objects in the cache.So I use group cache access and get the keys from the cache and then iterate through the keys to get the results.
So now I am caught in a situation where I have to shutdown the cache properly i.e after all the data is written to disk using Indexed disk cache.But there is a complexity here,the indexed disk cache uses a background thread to write to disk which does not return anything on its status.
So now,I am unable to guarantee that indexed disk cache has written data to the disk to my front end implementation.Is there a way to tackle this situation,because now I am just sleeping some random time(say 10 seconds),before the cache is shutdown,which is a very stupid way of doing it actually.
Edit : I am facing this issue with Memory Cache as well,but a sleep of one second is mostly enough for 500mb of data.But the case of disk cache is little different.

It could be because your objects are stored in the memory and waiting to write to disk. If you need to write the objects immediately to disk while in execution then you need to make the MaxObjects of your cache configs to 0.
jcs.region.<yourRegion>.cacheattributes.MaxObjects=0
jcs.region.<yourRegion>.cacheattributes.DiskUsagePattern=UPDATE
I know you already aware of UPDATE. Adding it for reference again.

LRU LinkedHashMap that limits size based on available memory

I want to create a LinkedHashMap which will limit its size based on available memory (ie. when freeMemory + (maxMemory - allocatedMemory) gets below a certain threshold). This will be used as a form of cache, probably using "least recently used" as a caching strategy.
My concern though is that allocatedMemory also includes (I assume) un-garbage collected data, and thus will over-estimate the amount of used memory. I'm concerned about the unintended consequences this might have.
For example, the LinkedHashMap may keep deleting items because it thinks there isn't enough free memory, but the free memory doesn't increase because these deleted items aren't being garbage collected immediately.
Does anyone have any experience with this type of thing? Is my concern warranted? If so, can anyone suggest a good approach?
I should add that I also want to be able to "lock" the cache, basically saying "ok, from now on don't delete anything because of memory usage issues".

I know I'm biased, but I really have to strongly recommend our MapMaker for this. Use the softKeys() or softValues() feature, depending on whether it's GC collection of the key or of the value that more aptly describes when an entry can be cleaned up.

Caches tend to be problematic. IIRC, there's a SoftCache in Sun's JRE that has had many problems.
Anyway, the simplest thing is to use SoftReferences in the map. This should work fine so long as the overhead of SoftReference plus Map.Entry is significantly lower than the cached data.
Alternatively you can, like WeakHashMap, use a ReferenceQueue and either poll it or have a thread blocking on it (one thread per instance, unfortunately). Be careful with synchronisation issues.
"Locking" the map, you probably want to avoid if necessary. You'd need keep strong references to all the data (and evict if not null). That is going to be ugly.

I would strongly suggest using something like Ehcache instead of re-inventing a caching system. It's super simple to use, very configurable, and works great.

As matt b said, something like Ehcache or JbossCache is a good first step.
If you want something light-weight and in-process, look at google collections. For example, you can use MapMaker (http://google-collections.googlecode.com/svn/trunk/javadoc/index.html?com/google/common/collect/BiMap.html) to make a map with Soft/Weak keys and values, so it would cache only those items it has room for (though you wouldn't get LRU).

I had the same need in the past and this is how I implemented my cache:
there is a cache memory manager, which has a minimum and a maximum memory limit(max limit it matters anyway)
every registered cache has the following(important) parameters: maximum capacity(most of the time you have a higher limit, you don't want go hold more than X items) & percent memory usage
I use a LinkedHashMap and a ReentrantReadWriteLock to guard the cache.
every X puts I calculate the mean memory consumption per entry and trigger eviction(async) if the calculated memory limit > allowed memory limit.
of course memory computation doesn't actually shows the real memory consumption but comparing the computed memory with the real value(using a profiler) I find that it is close enough.
I was planning to put also an additional guard on the cache, to evict in case puts are going faster than memory based evictions, but until now I didn't find the need to do it.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.