I am looking for a Java based caching library that supports multiple standard Map interfaces as tiers (that I for instance could use for on-heap, off-heap and flash based maps) i.e. instead of layering multiple independent caches that each keep their own eviction mechanism I want a SINGLE logic that will move entries between the tiers as they becomes more or less frequently used.
My use-case involves a huge number of relatively small entries so holding separate caches where each lower level also holds the entries of the previous tiers as well as resulting in duplication of usage meta data for each key would be very storage inefficient.
The access time must be as low and consistent as possible so not considering distributed/remote cache tiers (Redis, Memcached...) in this case.
Related
I've gone through javax.cache.Cache to understand it's usage and behavior. It's stated that,
JCache is a Map-like data structure that provides temporary storage of
application data.
JCache and HashMap stores the elements in the local Heap memory and don't have persistence behavior by default. By implementing custom CacheLoader and CacheWriter we can achieve persistence. Other than that, When to use it?
Caches usually have more management logic than a map, which are nothing else but a more or less simple datastructure.
Some concepts, JCaches may implement
Expiration: Entries may expire and get removed from the cache after a certain period of time or since last use
Eviction: elements get removed from the cache if space is limited. There can be different eviction strategies .e. LRU, FIFO, ...
Distribution: i.e. in a cluster, while Maps are local to a JVM
Persistence: Elements in the cache can be persistent and present after restart, contents of a Map are just lost
More Memory: Cache implementations may use more memory than the JVM Heap provides, using a technique called BigMemory where objects are serialized into a separately allocated bytebuffer. This JVM-external memory is managed by the OS (paging) and not the JVM
option to store keys and values either by value or by reference (in maps you to handle this yourself)
option to apply security
Some of these some are more general concepts of JCache, some are specific implementation details of cache providers
Here are the five main differences between both objects.
Unlike java.util.Map, Cache :
do not allow null keys or values. Attempts to use null will result in a java.lang.NullPointerException
provide the ability to read values from a javax.cache.integration.CacheLoader (read-through-caching) when a
value being requested is not in a cache
provide the ability to write values to a javax.cache.integration.CacheWriter (write-through-caching) when a
value being created/updated/removed from a cache
provide the ability to observe cache entry changes
may capture and measure operational statistics
Source : GrepCode.com
Mostly, caching implementations keep those cached objects off heap (outside the reach of GC). GC keeps track of each and every object allocated in java. Imagine you have millions of objects in memory. If those objects are not off heap, believe me, GC will make your application performance horrible.
I have a simple country states hashmap, which is a simple static final unmodifiable concurrent hashmap.
Now we have implemented memcached cache in our application.
My question is, Is it beneficial to get the values from cache instead of such a simple map?
What benefits I will get or not get if I move this map to cache?
This really depends on the size of the data and how much memory is you've allocated for your JVM.
For simple data like states of a country which are within a few hundred entries, a simple HashMap would suffice and using memcache is an overkill and in fact slower.
If it's large amount of data which grow (typically 10s/100s MBs or larger) and require frequent access, memcache (or any other persistent cache) would be better than an in-memory storage.
It will be much faster as a HashMap because it is stored in memory and the lookup can be done via the jvm by it's reference. The lookup from memcache would require extra work for the processor to look up the map.
If your application is hosted on only one server then you don't need distributed feature of memcache and HashMap will be damn fast. Stats
But this is not case of web applications. ~99% cases for web applications you host it on multiple servers and want to use distributed caching, memcache is best in such cases.
My algorithm will likely not be used on the web. The object I describe may be used by multiple threads, however.
The original object I had designed emulated pointers.
Reduced, a symbol would map to multiple pointers, and each unique pointer would map to a single symbol.
When I was finally finished and had a working algorithm, it turns out I actually needed six maps in total (these maps are called tens of thousands of times).
Initial testing with a very very small sample set of symbols showed the program to be working very efficiently. However, I'm afraid that once I increase the number of symbols by a few thousand-fold it will become sluggish.
Once the program completes and closes, the pointers do not need to persist.
I was wondering if I should re implement my algorithm using a database as a backend. Would this be better than using all of these maps?
The maps are stored in memory. The database will be stored on a hard drive (I have a SSD, so I'm afraid there will be a large difference in performance on my machine vs a machine using SATA/PATA). The maps should also be O(1). The maps might also become very ugly once multithreading is introduced, unless I use thread safe mapping, which would slow the program down. A database would efficiently handle these tasks.
I've formally written out the proper relations, and I'm sure I can implement it in a database if that was the best option. Which is the better option?
If you need not to persist that data structure, do not try to support it on a database. In your place, I would try some load tests with a proper amount of data on the data structure you already have and try to refine it from there if performance was not what I expected.
Anyway, the trend currently is to use relational databases in hard disk for persistence and cache frequently queried data in "big hashtables" in memory for performance, I doubt falling back to a database would improve your performance
If your data structures fit in memory, I would be shocked if using a database would be faster (not even considering the complexity of using a database implementation). By throwing away all the assumptions, features, safety and consistency that a database must maintain, you will gain performance. Even the best DB implementation, assuming enough memory to cache everything, pretty much has a ConcurrentHashMap as an upper bound on performance. As a practical matter, you won't get CHM performance even with great caching, because a DB API will require defensive copies or cache invalidations that you can avoid with your in-memory structure.
Apart from the likely performance boost simply from using an in-memory hashmap, you may also get additional performance by tuning your structure based on your specific use case. For example, perhaps the initial lookup is multi-threaded, but individual values are only accessed by a single thread. In that case, you can avoid locking those values.
Hard drives, even fast, are several orders of magnitude slower than your memory. So if your goal is performance you should stay in memory and use maps. For thread safety you can just use a ConcurrentHashMap which uses a lock-free algorithm and the synchronisation penalty in a multi threaded environment should be minimal.
You should also check if a single thread does not provide enough performance - multiple threads always introduce some overhead and they need to bring enough gains to offset it.
You may also want to check in-memory DBs such as HyperSQL or H2 Database.
I'm new to NoSQL, and I'm scratching my head trying to figure out the most appropriate NoSQL implementation for the application I'm trying to build.
My Java application needs to have an in-memory hashmap containing millions to billions of entries as it models a single-layer neural network. Right now we're using Trove in order to be able to use primitives as keys and values to reduce the size of the map and increase the access speed. The map is a map of maps where the outer map's keys are longs and the inner maps have long/float key/values.
We need to be able to read the saved state from disk to the map of maps when the application starts up. The changes to the map of maps need also to be saved to disk either continuously or according to some scheduled interval.
I was at first drawn towards OrientDB because of their document and object DBs, although I'm still not sure at this point what would be better. Then I came across Redis, which is a key value store and works with an in-memory dataset that can be dumped to disk, including master-slave replication. However, it doesn't look like the values of the map can be anything other than Strings.
Am I looking in the right places for a solution to my needs? Right now, I like the in-memory and master-slave aspect of Redis, but I like the object/document capabilities of OrientDB as my data structures are more complicated than simple Strings and being able to use Trove with the primitive key/value types is very advantageous. It would be better if reading was cheap and writing was expensive rather than the other way around.
Thoughts?
Why not just serialize the Trove data structures directly to disk? There appears to be some sort of support for that judging by the documentation (http://trove4j.sourceforge.net/javadocs/serialized-form.html), but it's hard to tell because it's all auto-generated cruft instead of lovingly-made tutorials. Still, for your use case it's not obvious why you need a proper database, so perhaps KISS applies.
OrientDB has the most flexible engine with index, graph, transactions and complex documents as JSON. Why not?
Check out Java-Chronicle. It's a low latency persistence library. I think you may find it offers excellent performance for this type of data.
If you'd like to use Redis for this, you'd likely be best suited by using either ZSETs or HASHes as underlying structures (Redis supports structures, not just string values). Unless you need to fetch your parts of your maps based on the values/sorted order of the values, HASHes would probably be best (in terms of memory and speed).
So you would probably want to use a long -> {long:float, ...} . That is, longs mapping to long/float maps. You can then either fetch individual entries in the map with HGET, multiple entries with HMGET, or the full map with HGETALL. You can see the command reference http://redis.io/commands
On the space saving side of things, depending on the expected size of your HASHes, you may be able to tune them to use less space with limited/no negative effects on performance.
On the persistence side of things, you can either run Redis with snapshots or using incremental saving with append-only files. You can see the persistence documentation here: http://redis.io/topics/persistence
If you'd like to ask more pointed questions, you should head over to the mailing list https://groups.google.com/forum/?fromgroups=#!topic/redis-db/33ZYReULius
Redis supports more complex data structures than simple strings such as lists, (sorted) sets or hashes which might come handy for your domain model. On the other your neural network can leverage from rich graph capabilities of OrientDB depending on it's strucuture.
I'm developing a simple Java EE 5 "routing" application. Different messages from a MQ queue are first transformed and then, according to the value of a certain field, stored in different datasources (stored procedures in different ds need to be called).
For example valueX -> dataSource1, valueY -> dataSource2. All datasources are setup in the application server with different jndi entries. Since the routing info usually won't change while the app is running, is it save to cache the datasource lookups? For example I would implement a singleton, which holds a hashmap where I store valueX->DataSource1. When a certain entry is not in the list, I would do the resource lookup and store the result in the map. Do I gain any performance with the cache or are these resource lookups fast enough?
In general, what's the best way to build this kind of cache? I could use a cache for some other db lookups too. For example the mapping valueX -> resource name is defined in a simple table in a DB. Is it better too lookup the values on demand and save the result in a map, do a lookup all the time or even read and save all entries on startup? Do I need to synchronize the access? Can I just create a "enum" singleton implementation?
It is safe from operational/change management point of view, but not safe from programmer's one.
From programmer's PoV, DataSource configuration can be changed at runtime, and therefore one should always repeat the lookup.
But this is not how things are happening in real life.
When a change to a Datasource is to be implemented, this is done via a Change Management procedure. There is a c/r record, and that record states that the application will have a downtime. In other words, operational folks executing the c/r will bring the application down, do the change and bring it back up. Nobody does the changes like this on a live AS -- for safety reasons. As the result, you shouldn't take into account a possibility that DS changes at runtime.
So any permanent synchronized shared cache is good in the case.
Will you get a performance boost? This depends on the AS implementation. It likely to have a cache of its own, but that cache may be more generic and so slower and in fact you cannot count on its presence at all.
Do you need to build a cache? The answer usually comes from performance tests. If there is no problem, why waste time and introduce risks?
Resume: yes, build a simple cache and use it -- if it is justified by the performance increase.
Specifics of implementation depend on your preferences. I usually have a cache that does lookups on demand, and has a synchronized map of jndi->object inside. For high-concurrency cache I'd use Read/Write locks instead of naive synchronized -- i.e. many reads can go in parallel, while adding a new entry gets an exclusive access. But those are details much depending on the application details.