For an upcoming project I will keep a large amount of data (up to 10GB) in RAM, but not as a cache. Is is possible to use BigMemory (in particular Go, i.e. the free edition) without EH Cache, simply as a non garbage collected memory storage? I have not found a clear answer in the docs, which mostly talk about the typical integration with EHCache.
Thank you.
Yes, EhCache is the API for BigMemory:
BigMemory Go currently uses Ehcache as its user-facing data access API.
Basically, the way BigMemory has been designed is as sort of another storage tier. You store things in the heap exceeding which you store things offheap (which is the bigmemory) and then exceeding which you store things on the disk. It makes sense to do so because in the nosql paradigm where we want to store bigdata; things work well if they are in key-value form. You can choose to store any kind of value by just making it serializable.
As for your constraint of "not as a cache", its very much possible to configure the cache so that values don't get evicted from the memory. Anyways if you use BigMemory Go, you get a limit of 32GB so storing 10GB won't trigger any eviction algorithms even without any configuration.
Related
I'm using Infinispan (7.2.3.Final) to store data into multiples caches.
The thing is : I only want to store datas locally, into files. I don't want to store datas into memory to avoid memory issues.
I get this error :
java.lang.OutOfMemoryError: Java heap space
Infinispan is a Java distributed in-memory. So I don't think it is relevant to use it if you don't want to use RAM. In my opinion, a good usage of Infinispan means you will have tune the memory (size and eviction) to find a good tradeoff between the running-cost, the complexity, and the performance.
You can configure Infinispan to persist data (doc). And you can configure it to evict data from RAM memory (doc). But I cannot advice a real configuration if you do not describe your use-case, and in particular why you think you need Infinispan (why not a data base?)
A possible usage is to keep all-in-memory. Obviously your data has to be small enough (I don't give number, some people can accept to pay several machines to reduce latency, it depends on your business...)
Ten year ago, we could use them to simply batch insertions in a data base. Now we use Kafka for this use case.
A frequent usage is keep hot data in memory. In this case we configure eviction and persistence. I think you are looking for eviction strategies here. There are several eviction strategies. But as far I know none allows to not use RAM at all: objects will pass through the memory, at least during the persistence.
What is the difference between "In memory distributed caching" vs "In memory data grid" ?
When do we use one over the other i.e., what are the practical use cases for "In memory data grid" ?
Can you name few popular "In memory data grid" frameworks which are compatible with Java applications ?
In-memory distributed caching is just that - a cache that is distributed across different nodes. It makes data highly available to the application(s) using it. They're usually key/value stores and support the standard put/get operations along with abilities to partition, replicate or backup data.
An in-memory data grid is a distributed cache with a bit of computing power thrown in. Over the abilities of a distributed cache, it allows you to do distributed SQLs, co-located processing etc...
plamb has given a good list of in-memory data grids.
With a In memory data grid you can do a in memory distributed cache. For example Oracle use Oracle Coherence to implement that kind of cache with Weblogic. So with this example I answered the last part of your question.
But it's the most expensive solution to just doing cache (money, memory, network, cpu) : In memory data grid is reliable more that you need to do cache, it can handle the real data so you don't need another backend.
If you need to do cache in memory distributed, solutions like EHCache, Memcached, Infinispan can do it. In fact almost Java EE application server give a solution to do it.
I have a Java application deployed on a cluster of JBoss AS 5.1 which requires a lot (> 3 GB) of data to be cached.
Right now the server cluster has just 2 nodes (separate machines).
Here are specific requirements:
Both nodes should not require data to be loaded into cache (i.e., there should either be replication or cache should reside on a separate server)
The data should never expire.
Both of the above requirements are REALLY important for the application. I'd be thankful if the suggestion would be made keeping both of these in mind.
I should also add a third requirement:
ease of use
The application was initially using Hashmap. I tried replacing the hashmap with JBoss Cache 3.2.1 for its replication and thread safety features. But i'm not really happy with JBoss Cache performance. Also when i load the data in the cache the 8 Gig of RAM is almost entirely being used (most of it is used by the cache entries).
I'd like to hear the experience of people who have handled such kind of caching scenario themselves. Thanks for your time in advance.
You can try out using GigaSpaces XAP datagrid is a replicated cache. It is very highly performant.
http://www.gigaspaces.com/datagrid
If you want a cache that provides a Java HashMap interface and can easily support gigabytes of cache data, with no expiry, then check out Oracle Coherence. This would use the Coherence "distributed cache" option (which is the default configuration). For more info, see: http://coherence.oracle.com/
Elastic. Just add nodes. Auto-discovery. Auto-load-balancing. No data loss. No interruption. Every time you add a node, you get more data capacity and more throughput.
Use both RAM and flash. Transparently. Easily handle 10s or even 100s of gigabytes per Coherence node (e.g. up to a TB or more per physical server).
Automatic high availability (HA). Kill a process, no data loss. Kill a server, no data loss.
Datacenter continuous availability (CA). Kill a data center, no data loss.
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.
I have 'solved' this problem before (work code, can't show you)... but, I can tell you this much:
with large volumes, a large amount of memory is used in overhead in HashMaps.
you can save a lot of memory by replacing java.util.* classes with smart uses of arrays.
every time you have memory allocations you also have to scan/collect that memory in the GC, so saving memory also improves performance.
Wherever you can, use arrays....
Edit: Apparently the concept of Hash Maps has been forgotten.... Has the Java implementation of HashMap made people believe it is the only way? A structured set of arrays, with a hash function, and a binary sort.... all basic structures... http://en.wikipedia.org/wiki/Hash_table
One array to add keys to. A parallel array to store the values in, and an int-based hash table to make a fast lookup in to the key array...
Computer Science - maybe second year ;-)
Edit again: I used to core concepts I have described in the JDOM project here: https://github.com/hunterhacker/jdom/blob/master/core/src/java/org/jdom2/StringBin.java
I am looking for some ideas, and maybe already some concrete implemenatation if somebody knows any, but I am willing to code the wanted cache on my own.
I want to have a cache that caches only as many gigs as I configure. In comparision to the rest of the app the cache part will use nearly 100% of memory, so we can generalize the used memory of the app beeing the cache size(+ garbage).
Are there methods for getting a guess of how much memory is used? Or is it better to rely on soft pointers? Soft pointer and running always at the top of the jvm memory limit might be very inefficent with lots of cpu cycles for memory cleaning? Can I do some analysis on existing objects, like a myObject.getMemoryUsage()?
The LinkedHashMap has enough cache hits for my purpose so I don't have to code some strategic caching monster, but I don't know how to solve this momory issue properly. Any ideas? I don't want OOME flying anywhere.
What is best pratice?
SoftReference are not a great idea as they tend to be clearer all at once. This means when you get a performance hit from a GC, you also get a hit having to re-build your cache.
You can use Instrumentation.getObjectSize() to get the shallow size of an Object and use reflection to obtain a deep size. However, doing this relatively expensive and not something you want to get doing very often.
Why can't you limit the size to a number of object? In fact, I would start with the simplest cache you can and only add what you really need.
LRU cache in Java.
EDIT: One way to track how much memory you are using is to Serialize the value and store it as a byte[]. This can give you fairly precise control however can slow down your solution by up to 1000x times. (Nothing comes for free ;)
I would recommend using the Java Caching System. Though if you wanted to roll your own, I'm not aware of any way to get an objects size in memory. Your best bet would be to extend AbstractMap and wrap the values in SoftReferences. Then you could set the java heap size to the maximum size you wanted. Though, your implementation would also have to find and clean out stale data. It's probably easier just to use JCS.
The problem with SoftReferences is that they give more work to the garbage collector. Although it doesn't meet your requirements, HBase has a very interesting strategy in order to prevent the cache from contributing to the garbage collection pauses : they store the cache in native memory :
https://issues.apache.org/jira/browse/HBASE-4027
https://issues.apache.org/jira/secure/attachment/12488272/HBase-4027+%281%29.pdf
A good start for your use-case would be to store all your data on disk. It might seem naive, but thanks to the I/O cache, frequently accessed data will reside in memory. I highly recommend reading these architecture notes from the Varnish caching system :
https://www.varnish-cache.org/trac/wiki/ArchitectNotes
The best practice I find is to delegate the caching functionality outside of Java if possible. Java may be good in managing memory, but at dedicated caching system should be used for anything more than a simple LRU cache.
There is a large cost with GC when it kicks in.
EHCache is one of the more popular ones I know of. Java Caching System from another answer is good as well.
However, I generally offload that work to an underlying function (usually the JPA persistence layer by the application server, I let it get handled there so I don't have to deal with it on the application tier).
If you are caching other data such as web requests, http://hc.apache.org/httpclient-3.x/ is also another good candidate.
However, just remember you also have "a file system" there's absolutely nothing wrong with writing to the file system data you have retrieved. I've used the technique several times to fix out of memory errors due to improper use of ByteArrayOutputStreams
I am new to memcached and caching in general. I have a java web application running on Ubuntu + Tomcat + MySQL on a VPS Server with 1GB of memory.
Does it make sense to add a memcached layer with about 256MB for caching? Will this be too much load on the server? Which is more appropriate caching rendered html pages or database objects?
Please advise.
If you're going to cache pages, don't use memcached, use Varnish. However, there's a good chance that's not a great use of memory. Cacheing pages trades memory for computation and database work, but it does cost quite a lot of memory per page, so it's best for cases where the computation and database work needed to produce a single page amounts to a lot (or the pages are very small!). Also, consider that page cacheing won't be effective, or even possible, if you want to use per-user customisation on your pages (eg showing the number of items in a shopping cart). At least not without getting into some truly hairy shenanigans (edge-side includes, anyone?).
If you're not going to cache pages, and your app is on a single machine, then there's no point using memcached or similar. The point of cache servers like that is to make the memory on one machine work as a cache for another - like how a file server shares a disk, they're essentially memory servers. On a single machine, you might as well give all the memory to Java and cache objects on the heap.
Are you using an object-relational mapper? If so, see if it has any support for a second-level cache. The big three implementations (Hibernate, OpenJPA, and EclipseLink) all support in-memory caches. They're likely to do a much better job than you would if you did the cacheing yourself.
But, if you're not using a mapper, you have no choice but to do the cacheing yourself. There are extension points in LinkedHashMap for building LRU caches, and then of course there's the people's favourite, SoftReference, in combination with a HashMap. Plus, there are probably cache implementations out there you could download and use - i'd be shocked if there wasn't something in the Apache Commons libraries.
memcached won't add any noticeable load on your server, but it will be memory your app can't use. If you only plan to have a single app server for a while, you're better off using an in-JVM cache.
As far what to cache, the answer falls somewhere in the middle of the above. You don't want to cache exactly what's in your database and you certainly don't want to cache the final output. You have a data model representation in your application that isn't exactly what's in the DB (e.g. a User object might be made up of multiple queries from a few different tables). Cache that kind of thing as it's most reusable.
There's lots of info in the memcached site that should help you understand and get going with caching in general and memcached specifically.
It might make sense to do that, why don't try a smaller size like 64 MB and see how that goes. When you use more resources for the memcache, there is less for everything else. You should try it and see what will give you the best performance.