Java Large Object storage - Protocol buffers, MemoryMappedFiles - java

We have Java program with Large Objects of Tree structure, ArrayList and MultiMaps.
The problem I'm having is, we have allocated 3GB of heap memory but it is still running out of space.
I'm wondering if anyone here can suggest a way to store these objects outside heap and read the chunks of data back into java program on need basis for each processing call. I'm interested to store them in files and not database for other reasons.
I came across 'Memory Mapped File' and some one suggested "Protocol Buffers" on a related question, these are alien concepts to me at the moment, and wondering if there is an easy way. I also couldn't find good examples on both of these concepts.
Would really appreciate your help on this.
Performance is very important consideration and I'm aware of JVM heap allocations but I'm not looking for increasing JVM heap size.

You might consider storing data in something like Chronicle Map. This uses off heap memory and can be stored and accessed without creating any garbage. This allows you to reduce the heap size but you still need to buy a sensible about of memory. I would suggest you consider having at least 32 GB of memory whether you use on heap or off heap for larger datasets.
there is no reason i have to go for exotic solutions
In that case, stick to an on heap solution. You can buy 16 GB of memory for around $200.
I'm not looking for increasing JVM heap size.
Ask yourself how much time/money you are willing to invest to avoid increasing the heap. You can certainly do this but to save 4 GB I wouldn't spend a day on this. To Save 40 GB or 400 GB or 4 TB that is a different story.

Protocol Buffers does not work well with memory-mapped files, because the file contains the encoded data, which must first be decoded before you can use it. This decoding step generates heap objects. You might be able to use Protobufs with memory-mapped files if you split the file into lots of small messages which you decode on-demand when you need them, but then immediately discard the decoded versions. But, you may waste a lot of time repeatedly decoding the same data if you aren't careful.
Cap'n Proto is a newer format that is very similar to Protocol Buffers but is explicitly designed to work with memory-mapped files. The on-disk format is designed such that it can be used in-place without a decoding step. We are working on a Java version which should be ready for production use within a few weeks.
(Disclosure: I'm the creator of Cap'n Proto, and also was previously the maintainer of Protocol Buffers at Google.)

You may be able to use immutable collections from Guava, they're usually less memory hungry.
You may be able to use String.intern if strings take a fair portion of your memory.
You may save a lot using trove4j if you have a lot of boxed primitives.
You may do some small tricks like using smaller datatypes, etc....
But your really should make your office get more memory before wasting your time with computers having as much RAM as a smartphone!

Related

Java Heap vs Cache

I have a question with respect to performance optimization.
Which is faster with respect to retrieving, from a Cache or from Java's heap?
According to the definition which I got :-
https://www.google.co.in/search?client=ubuntu&channel=fs&q=cache+vs+heap&ie=utf-8&oe=utf-8&gfe_rd=cr&ei=G7V1Ve-xDoeCoAP6goHACg#channel=fs&q=difference+between+cache+and+RAM
And if storing my data in cache via my java code is faster than storing it in java heap, then should we always store data in cache if required for faster access for complex computations and results.?
Kindly guide which one is faster and the use case scenarios as to when what to be used over the other..
Thanks
You mix up different concepts.
The quote is:
The difference between RAM and cache is its performance, cost, and proximity to the CPU. Cache is faster, more costly, and closest to the CPU. Due to the cost there is much less cache than RAM. The most basic computer is a CPU and storage for data.
This is about Computer architecture and applies for all computers, regardless what programming language you are using. There is no way to directly control what data is inside the cache. The CPU cache will hold data that is requested very often automatically. Programmers can improve their programs to make it more "friendly" to a particular hardware architecture. For example if the CPU has only a small cache, the code could be optimized to work on a smaller data set.
A Java Cache is something different. This is a library that caches Java objects, e.g. to save requests to an external service. A Java Cache, can store the object data in heap, outside the heap in separate memory or disk. Inside the heap has fastest access, since for any storage outside the heap the Objects need to be converted to byte streams (Called serialization or marshalling)

Java Memory aware cache

I am looking for some ideas, and maybe already some concrete implemenatation if somebody knows any, but I am willing to code the wanted cache on my own.
I want to have a cache that caches only as many gigs as I configure. In comparision to the rest of the app the cache part will use nearly 100% of memory, so we can generalize the used memory of the app beeing the cache size(+ garbage).
Are there methods for getting a guess of how much memory is used? Or is it better to rely on soft pointers? Soft pointer and running always at the top of the jvm memory limit might be very inefficent with lots of cpu cycles for memory cleaning? Can I do some analysis on existing objects, like a myObject.getMemoryUsage()?
The LinkedHashMap has enough cache hits for my purpose so I don't have to code some strategic caching monster, but I don't know how to solve this momory issue properly. Any ideas? I don't want OOME flying anywhere.
What is best pratice?
SoftReference are not a great idea as they tend to be clearer all at once. This means when you get a performance hit from a GC, you also get a hit having to re-build your cache.
You can use Instrumentation.getObjectSize() to get the shallow size of an Object and use reflection to obtain a deep size. However, doing this relatively expensive and not something you want to get doing very often.
Why can't you limit the size to a number of object? In fact, I would start with the simplest cache you can and only add what you really need.
LRU cache in Java.
EDIT: One way to track how much memory you are using is to Serialize the value and store it as a byte[]. This can give you fairly precise control however can slow down your solution by up to 1000x times. (Nothing comes for free ;)
I would recommend using the Java Caching System. Though if you wanted to roll your own, I'm not aware of any way to get an objects size in memory. Your best bet would be to extend AbstractMap and wrap the values in SoftReferences. Then you could set the java heap size to the maximum size you wanted. Though, your implementation would also have to find and clean out stale data. It's probably easier just to use JCS.
The problem with SoftReferences is that they give more work to the garbage collector. Although it doesn't meet your requirements, HBase has a very interesting strategy in order to prevent the cache from contributing to the garbage collection pauses : they store the cache in native memory :
https://issues.apache.org/jira/browse/HBASE-4027
https://issues.apache.org/jira/secure/attachment/12488272/HBase-4027+%281%29.pdf
A good start for your use-case would be to store all your data on disk. It might seem naive, but thanks to the I/O cache, frequently accessed data will reside in memory. I highly recommend reading these architecture notes from the Varnish caching system :
https://www.varnish-cache.org/trac/wiki/ArchitectNotes
The best practice I find is to delegate the caching functionality outside of Java if possible. Java may be good in managing memory, but at dedicated caching system should be used for anything more than a simple LRU cache.
There is a large cost with GC when it kicks in.
EHCache is one of the more popular ones I know of. Java Caching System from another answer is good as well.
However, I generally offload that work to an underlying function (usually the JPA persistence layer by the application server, I let it get handled there so I don't have to deal with it on the application tier).
If you are caching other data such as web requests, http://hc.apache.org/httpclient-3.x/ is also another good candidate.
However, just remember you also have "a file system" there's absolutely nothing wrong with writing to the file system data you have retrieved. I've used the technique several times to fix out of memory errors due to improper use of ByteArrayOutputStreams

Difference between "on-heap" and "off-heap"

Ehcache talks about on-heap and off-heap memory. What is the difference? What JVM args are used to configure them?
The on-heap store refers to objects that will be present in the Java heap (and also subject to GC). On the other hand, the off-heap store refers to (serialized) objects that are managed by EHCache, but stored outside the heap (and also not subject to GC). As the off-heap store continues to be managed in memory, it is slightly slower than the on-heap store, but still faster than the disk store.
The internal details involved in management and usage of the off-heap store aren't very evident in the link posted in the question, so it would be wise to check out the details of Terracotta BigMemory, which is used to manage the off-disk store. BigMemory (the off-heap store) is to be used to avoid the overhead of GC on a heap that is several Megabytes or Gigabytes large. BigMemory uses the memory address space of the JVM process, via direct ByteBuffers that are not subject to GC unlike other native Java objects.
from http://code.google.com/p/fast-serialization/wiki/QuickStartHeapOff
What is Heap-Offloading ?
Usually all non-temporary objects you allocate are managed by java's garbage collector. Although the VM does a decent job doing garbage collection, at a certain point the VM has to do a so called 'Full GC'. A full GC involves scanning the complete allocated Heap, which means GC pauses/slowdowns are proportional to an applications heap size. So don't trust any person telling you 'Memory is Cheap'. In java memory consumtion hurts performance. Additionally you may get notable pauses using heap sizes > 1 Gb. This can be nasty if you have any near-real-time stuff going on, in a cluster or grid a java process might get unresponsive and get dropped from the cluster.
However todays server applications (frequently built on top of bloaty frameworks ;-) ) easily require heaps far beyond 4Gb.
One solution to these memory requirements, is to 'offload' parts of the objects to the non-java heap (directly allocated from the OS). Fortunately java.nio provides classes to directly allocate/read and write 'unmanaged' chunks of memory (even memory mapped files).
So one can allocate large amounts of 'unmanaged' memory and use this to save objects there. In order to save arbitrary objects into unmanaged memory, the most viable solution is the use of Serialization. This means the application serializes objects into the offheap memory, later on the object can be read using deserialization.
The heap size managed by the java VM can be kept small, so GC pauses are in the millis, everybody is happy, job done.
It is clear, that the performance of such an off heap buffer depends mostly on the performance of the serialization implementation. Good news: for some reason FST-serialization is pretty fast :-).
Sample usage scenarios:
Session cache in a server application. Use a memory mapped file to store gigabytes of (inactive) user sessions. Once the user logs into your application, you can quickly access user-related data without having to deal with a database.
Caching of computational results (queries, html pages, ..) (only applicable if computation is slower than deserializing the result object ofc).
very simple and fast persistance using memory mapped files
Edit: For some scenarios one might choose more sophisticated Garbage Collection algorithms such as ConcurrentMarkAndSweep or G1 to support larger heaps (but this also has its limits beyond 16GB heaps). There is also a commercial JVM with improved 'pauseless' GC (Azul) available.
The heap is the place in memory where your dynamically allocated objects live. If you used new then it's on the heap. That's as opposed to stack space, which is where the function stack lives. If you have a local variable then that reference is on the stack.
Java's heap is subject to garbage collection and the objects are usable directly.
EHCache's off-heap storage takes your regular object off the heap, serializes it, and stores it as bytes in a chunk of memory that EHCache manages. It's like storing it to disk but it's still in RAM. The objects are not directly usable in this state, they have to be deserialized first. Also not subject to garbage collection.
In short picture
pic credits
Detailed picture
pic credits
Not 100%; however, it sounds like the heap is an object or set of allocated space (on RAM) that is built into the functionality of the code either Java itself or more likely functionality from ehcache itself, and the off-heap Ram is there own system as well; however, it sounds like this is one magnitude slower as it is not as organized, meaning it may not use a heap (meaning one long set of space of ram), and instead uses different address spaces likely making it slightly less efficient.
Then of course the next tier lower is hard-drive space itself.
I don't use ehcache, so you may not want to trust me, but that what is what I gathered from their documentation.
The JVM doesn't know anything about off-heap memory. Ehcache implements an on-disk cache as well as an in-memory cache.

EhCache BigMemory vs Diskstore on RAM disk

How is the performance of BigMemory of Enterprise Ehcache compared to Diskstore of Ehcache Community Edition used with RAM disk?
Big Memory permits caches to use an additional type of memory store outside the object heap there by reducing the overhead of GC, had we used all of RAM in object heap. Serialization and deserialization does take place on putting and getting from this off-heap store.
Similarly Diskstore is also second level cache that stores the serialized object on disk.
On the link above it is mentioned that off-heap store is two order of magnitude faster then Diskstore. What happens if I configure the Diskstore to store data in RAM Disk? Will BigMemory still have noticeable performance benefit?
Are there some other optimizations done by BigMemory? Has anyone come across any such experiments that compare the two approaches?
Following is excerpt of the answer given to this question on terracotta forum.
"The three big problems I'd expect you to face with open source (community edition) Ehcache disk stores are: Firstly in open source only the values are stored on disk - the keys and the meta data to map keys to values is still stored in heap (which is not true for BigMemory). This means the heap would still be the limiting factor on cache size. Secondly the open source disk store is designed to be backed by a single (conventionally spinning disk - although some people do use SSD drives now), this means the backend is less concurrent (especially with regard to writing) than Enterprise BigMemory since the bottleneck is expected to be at the hardware level. Thirdly the serialization performed by the open source disk store is less space efficient so serialized values have much larger overheads."

Loading and analyzing massive amounts of data

So for some research work, I need to analyze a ton of raw movement data (currently almost a gig of data, and growing) and spit out quantitative information and plots.
I wrote most of it using Groovy (with JFreeChart for charting) and when performance became an issue, I rewrote the core parts in Java.
The problem is that analysis and plotting takes about a minute, whereas loading all of the data takes about 5-10 minutes. As you can imagine, this gets really annoying when I want to make small changes to plots and see the output.
I have a couple ideas on fixing this:
Load all of the data into a SQLite database.
Pros: It'll be fast. I'll be able to run SQL to get aggregate data if I need to.
Cons: I have to write all that code. Also, for some of the plots, I need access to each point of data, so loading a couple hundred thousand files, some parts may still be slow.
Java RMI to return the object. All the data gets loaded into one root object, which, when serialized, is about 200 megs. I'm not sure how long it would take to transfer a 200meg object through RMI. (same client).
I'd have to run the server and load all the data but that's not a big deal.
Major pro: this should take the least amount of time to write
Run a server that loads the data and executes a groovy script on command within the server vm. Overall, this seems like the best idea (for implementation time vs performance as well as other long term benefits)
What I'd like to know is have other people tackled this problem?
Post-analysis (3/29/2011): A couple months after writing this question, I ended up having to learn R to run some statistics. Using R was far, far easier and faster for data analysis and aggregation than what I was doing.
Eventually, I ended up using Java to run preliminary aggregation, and then ran everything else in R. R was also much easier to make beautiful charts than using JFreeChart.
Databases are very scalable, if you are going to have massive amounts of data. In MS SQL we currently group/sum/filter about 30GB of data in 4 minutes (somewhere around 17 million records I think).
If the data is not going to grow very much, then I'd try out approach #2. You can make a simple test application that creates a 200-400mb object with random data and test the performance of transferring it before deciding if you want to go that route.
Before you make a decision its probably worth understanding what is going on with your JVM as well as your physical system resources.
There are several factors that could be at play here:
jvm heap size
garbage collection algorithms
how much physical memory you have
how you load the data - is it from a file that is fragmented all over the disk?
do you even need to load all of the data at once - can it be done it batches
if you are doing it in batches you can vary the batch size and see what happens
if your system has multiple cores perhaps you could look at using more than one thread at a time to process/load data
if using multiple cores already and disk I/O is the bottleneck, perhaps you could try loading from different disks at the same time
You should also look at http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp if you aren't familiar with the settings for the VM.
If your data have a relational properties, there are nothing more natural than storing it at some SQL database. There you can solve your biggest problem -- performance, costing "just" to write your appropriate SQL code.
Seems very plain to me.
I'd look into analysis using R. It's a statistical language with graphing capabilities. It could put you ahead, especially if that's the kind of analysis you intend to do. Why write all that code?
I would recommend running a profiler to see what part of the loading process is taking the most time and if there's a possible quick win optimization. You can download an evaluation license of JProfiler or YourKit.
Ah, yes: large data structures in Java. Good luck with that, surviving "death by garbage collection" and all. What java seems to do best is wrapping a UI around some other processing engine, although it does free developers from most memory management tasks -- for a price. If it were me, I would most likely do the heavy crunching in Perl (having had to recode several chunks of a batch system in perl instead of java in a past job for performance reasons), then spit the results back to your existing graphing code.
However, given your suggested choices, you probably want to go with the SQL DB route. Just make sure that it really is faster for a few sample queries, watch the query-plan data and all that (assuming your system will log or interactively show such details)
Edit,(to Jim Ferrans) re: java big-N faster than perl (comment below): the benchmarks you referenced are primarily little "arithmetic" loops, rather than something that does a few hundred MB of IO and stores it in a Map / %hash / Dictionary / associative-array for later revisiting. Java I/O might have gotten better, but I suspect all the abstractness still makes it comparitively slow, and I know the GC is a killer. I haven't checked this lately, I don't process multi-GB data files on a daily basis at my current job, like I used to.
Feeding the trolls (12/21): I measured Perl to be faster than Java for doing a bunch of sequential string processing. In fact, depending on which machine I used, Perl was between 3 and 25 times faster than Java for this kind of work (batch + string). Of course, the particular thrash-test I put together did not involve any numeric work, which I suspect Java would have done a bit better, nor did it involve caching a lot of data in a Map/hash, which I suspect Perl would have done a bit better. Note that Java did much better at using large numbers of threads, though.

Categories