In my Spark program, I'm interested in allocating and using data that is not touched by Java's garbage collector. Basically, I want to do the memory management of such data myself like you would do in C++. Is this a good case of using off heap memory? Secondly, how do you read and write to off heap memory in Java or Scala. I tried searching for examples, but couldn't find any.
Manual memory management is a viable optimization strategy for garbage collected languages. Garbage collection is a known source of overhead and algorithms can be tailored to minimize it. For example, when picking a hash table implementation one might prefer Open Addressing because it allocates its entries manually on the main array instead of handling them to the language memory allocation and its GC. As another example, here's a Trie searcher that packs the Trie into a single byte array in order to mimimize the GC overhead. Similar optimization can be used for regular expressions.
That kind of optimization, when the Java arrays are used as a low-level storage for the data, goes hand in hand with the Data-oriented design, where data is stored in arrays in order to achieve better cache locality. Data-oriented design is widely used in gamedev, where the performance matters.
In JavaScript this kind of array-backed data storage is an important part of asm.js.
The array-backed approach is sufficiently supported by most garbage collectors used in the Java world, as they'll try to avoid moving the large arrays around.
If you want to dig deeper, in Linux you can create a file inside the "/dev/shm" filesystem. This filesystem is backed by RAM and won't be flushed to disk unless your operating system is out of memory. Memory-mapping such files (with FileChannel.map) is a good enough way to get the off-heap memory directly from the operating system. (MappedByteBuffer operations are JIT-optimized to direct memory access, minus the boundary checks).
If you want to go even deeper, then you'll have to resort to JNI libraries in order to access the C-level memory allocator, malloc.
If you are not able to achieve "Efficiency with Algorithms, Performance with Data Structures", and if efficiency and performance are so critical, you could consider using "sun.misc.Unsafe". As the name suggests it is unsafe!!!
Spark is already using it as mentioned in project-tungsten.
Also, you can start here, to understand it better!!!
Note: Spark provides a highly concurrent for execution of application and with multiple JVMs mostlikely across multiple machines, manual memory management will be extreamly complex. Fundamemtally spark promotes re-computation over global shared memory. So, perhaps, you could store partially computed data/result in another store like HDFS/Kafka/cassandra!!!
Have a look at ByteBuffer.allocateDirect(int bytes). You don't need to memory map files to make use of them.
Off heap can be a good choice if the objects will stick there for a while (i.e. are reused). If you'll be allocating/deallocating them as you go, that's going to be slower.
Unsafe is cool but it's going to be removed. Probably in Java 9.
Related
I need to store lots of data (Objects) in memory (for computations).
Since computations are done based on this data it is critical that all data will reside in the same JVM process memory.
Most data will be built from Strings, Integers and other sub-objects (Collections, HashSet, etc...).
Since Java's objects memory overhead is significant (Strings are UTF-16, each object has 8 bytes overhead) I'm looking for libraries which enable storing such data in memory with lower overhead.
I've read interesting articles about reducing memory:
* http://www.cs.virginia.edu/kim/publicity/pldi09tutorials/memory-efficient-java-tutorial.pdf
* http://blog.griddynamics.com/2010/01/java-tricks-reducing-memory-consumption.html
I was just wondering if there is some library for such scenarios out there or I'll need to start from scratch.
To understand better my requirement imagine a server which process high volume of records and need to analyze them based on millions of other records which are stored in memory (for high processing rate).
for collection overhead have a look at trove - their memory overhead is lower than the built-in Collections classes (especially for maps and sets which, in the JDK are based on maps).
if you have large objects it might be worthwhile to save them "serialized" as some compact binary representation (not java serialization) and deserialize back to a full-blown object when needed)you could also use a cache library that can page out to disk? take a look at infinispan or ehcache. also, some of those libraries (ehcache among them, if memory serves) provide "off-heap storage" as part of your jvm process - a chunk of memory not subject to GC managed by the (native) library. if you have an efficient binary representation you could store it there (wont lower your footpring but might make GC behave better)
For the String bit you can store the byte[] you get from String.getBytes("UTF8"). If you require a String object again you can then create it again from the ByteArray. It will ofcourse incur some more CPU for creating the String objects over and over again, so it will be a tradeoff between size<->speed.
Regarding strings, also look into -XX:+UseCompressedStrings jvm option, but looks like is has been dropped from latest jvm updates, see this other question
I'm writing a library where:
It will need to run on a wide range of different platforms / Java implementations (the common case is likely to be OpenJDK or Oracle Java on Intel 64 bit machines with Windows or Linux)
Achieving high performance is a priority, to the extent that I care about CPU cache line efficiency in object access
In some areas, quite large graphs of small objects will be traversed / processed (let's say around 1GB scale)
The main workload is almost exclusively reads
Reads will be scattered across the object graph, but not totally randomly (i.e. there will be significant hotspots, with occasional reads to less frequently accessed areas)
The object graph will be accessed concurrently (but not modified) by multiple threads. There is no locking, on the assumption that concurrent modification will not occur.
Are there some rules of thumb / guidelines for designing small objects so that they utilise CPU cache lines effectively in this kind of environment?
I'm particularly interested in sizing and structuring the objects correctly, so that e.g. the most commonly accessed fields fit in the first cache line etc.
Note: I am fully aware that this is implementation dependent, that I will need to benchmark, and of the general risks of premature optimization. No need to waste any further bandwidth pointing this out. :-)
A first step towards cache line efficiency is to provide for referential locality (i.e. keeping your data close to each other). This is hard to do in JAVA where almost everything is system allocated and accessed by reference.
To avoid references, the following might be obvious:
have non-reference types (i.e. int, char, etc.) as fields in your
objects
keep your objects in arrays
keep your objects small
These rules will at least ensure some referential locality when working on a single object and when traversing the object references in your object graph.
Another approach might be to not use object for your data at all, but have global non-ref typed arrays (of same size) for each item that would normally be a field in your class and then each instance would be identified by a common index into these arrays.
Then for optimizing the size of the arrays or chunks thereof, you have to know the MMU characteristics (page/cache size, number of cache lines, etc). I don't know if JAVA provides this in the System or Runtime classes, but you could pass this information as system properties on start up.
Of course this is totally orthogonal to what you should normally be doing in JAVA :)
Best regards
You may require information about the various caches of your CPU, you can access it from Java using Cachesize (currently supporting Intel CPUs). This can help to develop cache-aware algorithms.
Disclaimer : author of the lib.
I'm filling up the JVM Heap Space.
Changing parameters to give more heap space to the JVM, or changing something in my algorithm in the code not to use so much space are two of the most recommended options.
But, if those two have already been tried and applied, and I still get out of memory exceptions, I'd like to see what the other options are.
I found out about this example of "Using a memory mapped file for a huge matrix" and a library called HugeCollections which are an interesting way to solve my problem. Unluckily, the library hasn't seen an update for over a year, and it's not in any Maven repo - so for me it's not a really reliable one.
My question is, is there any other library doing this, or a good way of achieving it (having collection objects (lists and sets) memory mapped)?
You don't say what sort of collections you're using, or the way that you're using them, so it's hard to give recommendations. However, here are a few things to keep in mind:
Keeping the objects on the Java heap will always be the simplest option, and RAM is relatively cheap.
Blindly moving to memory-mapped data is very likely to give horrendous performance, especially if you're moving around in the file and/or making lots of changes. Hash-based collection types are the worst, as they work by distributing data. Tree-based collection types are generally a better choice, and linear collections can go both ways.
Once you move off-heap, you need a way to translate your objects to/from Java. Object serialization is the easiest, but adds lots of overhead. Binary objects accessed via byte buffers are usually a better choice, but you need to be thread-conscious.
You also have to manage your own garbage collection for off-heap objects. Not a problem if all you're doing is creating/updating, but quickly becomes a pain if you're deleting.
If you have a lot of data, and need to access that data in varied ways, a database is probably your best bet.
Unluckily, the library hasn't seen an update for over a year, and it's not in any Maven repo - so for me it's not a really reliable one I agree and I wrote it. ;)
I suggest you look at https://github.com/peter-lawrey/Java-Chronicle which is higher performance has been used a bit. It really design for List & Queue but you could use it for a Map or Set with additional data structures.
Depending on your requirements, you could write your own library. e.g. for time series data I wrote a different library which is not open source unfortunately but can load tables of 500+ GB pretty cleanly.
it's not in any Maven repo
Neither is this one but would be happy for someone to add it.
Sounds like you're either having trouble with a memory leak, or trying to put too large an Object into memory.
Have you tried making a rough estimate of the amount of memory needed to load your data?
Assuming you have no memory leaks or other issues and really need that much storage that you can't fit it in the heap (which I find unlikely) you have basically only one option:
Don't put your data on the heap. Simple as that. Now which method you use to move your data out is very dependend on your requirements (what kind of data, frequency of updates and how much is it really?).
Note: You can use very large heaps with a 64-bit VM and if necessary enlarge the swap space of the OS. It may be the simplest solution to just brutally increase the maximum heap size (even if it means lots of swapping). I certainly would try that first in the situation you outlined.
Context is: Java EE 5.
I have a server running some huge app. I need to refactor the classes, so that their memory footprint is low (towards lowest possible), in exchange for CPU time (of which there's plenty).
I already know of ways to use bit operations to stuff multiple booleans, shorts or bites into an int (for example).
I'd need from you other optimization ideas, like, what do i do with Strings, what collections are better to use, and anything else that you happen to know.
Thx,
you guys rule!
This pdf about memory efficiency in java might be of interest to you.
Especially the standard collections seem to be huge memory wasters. But the first step before doing any micro-optimizations would be to profile your application, create heap dumps and analyze these.
A couple of things to consider
If you are done with an object and it will remain in scope, set it to null
Use StringBuilder (or StringBuffer if you need thread safety) instead
of concatenating Strings.
However, if your memory usage is such an issue it may be an architectural problem with the code.
When writing a Java program, do I have influence on how the CPU will utilize its cache to store my data? For example, if I have an array that is accessed a lot, does it help if it's small enough to fit in one cache line (typically 128 byte on a 64-bit machine)? What if I keep a much used object within that limit, can I expect the memory used by it's members to be close together and staying in cache?
Background: I'm building a compressed digital tree, that's heavily inspired by the Judy arrays, which are in C. While I'm mostly after its node compression techniques, Judy has CPU cache optimization as a central design goal and the node types as well as the heuristics for switching between them are heavily influenced by that. I was wondering if I have any chance of getting those benefits, too?
Edit: The general advice of the answers so far is, don't try to microoptimize machine-level details when you're so far away from the machine as you're in Java. I totally agree, so felt I had to add some (hopefully) clarifying comments, to better explain why I think the question still makes sense. These are below:
There are some things that are just generally easier for computers to handle because of the way they are built. I have seen Java code run noticeably faster on compressed data (from memory), even though the decompression had to use additional CPU cycles. If the data were stored on disk, it's obvious why that is so, but of course in RAM it's the same principle.
Now, computer science has lots to say about what those things are, for example, locality of reference is great in C and I guess it's still great in Java, maybe even more so, if it helps the optimizing runtime to do more clever things. But how you accomplish it might be very different. In C, I might write code that manages larger chunks of memory itself and uses adjacent pointers for related data.
In Java, I can't (and don't want to) know much about how memory is going to be managed by a particular runtime. So I have to take optimizations to a higher level of abstraction, too. My question is basically, how do I do that? For locality of reference, what does "close together" mean at the level of abstraction I'm working on in Java? Same object? Same type? Same array?
In general, I don't think that abstraction layers change the "laws of physics", metaphorically speaking. Doubling your array in size every time you run out of space is a good strategy in Java, too, even though you don't call malloc() anymore.
The key to good performance with Java is to write idiomatic code, rather than trying to outwit the JIT compiler. If you write your code to try to influence it to do things in a certain way at the native instruction level, you are more likely to shoot yourself in the foot.
That isn't to say that common principles like locality of reference don't matter. They do, but I would consider the use of arrays and such to be performance-aware, idiomatic code, but not "tricky."
HotSpot and other optimizing runtimes are extremely clever about how they optimize code for specific processors. (For an example, check out this discussion.) If I were an expert machine language programmer, I'd write machine language, not Java. And if I'm not, it would be unwise to think that I could do a better job of optimizing my code than the experts.
Also, even if you do know the best way to implement something for a particular CPU, the beauty of Java is write-once-run-anywhere. Clever tricks to "optimize" Java code tend to make optimization opportunities harder for the JIT to recognize. Straight-forward code that adheres to common idioms is easier for an optimizer to recognize. So even when you get the best Java code for your testbed, that code might perform horribly on a different architecture, or at best, fail to take advantages of enhancements in future JITs.
If you want good performance, keep it simple. Teams of really smart people are working to make it fast.
If the data you're crunching is primarily or wholly made up of primitives (eg. in numeric problems), I would advise the following.
Allocate a flat structure of fixed size arrays-of-primitives at initialisation-time, and make sure the data therein is periodically compacted/defragmented (0->n where n is the smallest max index possible given your element count), to be iterated over using a for-loop. This is the only way to guarantee contiguous allocation in Java, and compaction further serves to improves locality of reference. Compaction is beneficial, as it reduces the need to iterate over unused elements, reducing the number of conditionals: As the for loop iterates, the termination occurs earlier, and less iteration = less movement through the heap = fewer chances for a cache miss. While compaction creates an overhead in and of itself, this may be done only periodically (with respect to your primary areas of processing) if you so choose.
Even better, you can interleave values in these pre-allocated arrays. For instance, if you are representing spatial transforms for many thousands of entities in 2D space, and are processing the equations of motion for each such, you might have a tight loop like
int axIdx, ayIdx, vxIdx, vyIdx, xIdx, yIdx;
//Acceleration, velocity, and displacement for each
//of x and y totals 6 elements per entity.
for (axIdx = 0; axIdx < array.length; axIdx += 6)
{
ayIdx = axIdx+1;
vxIdx = axIdx+2;
vyIdx = axIdx+3;
xIdx = axIdx+4;
yIdx = axIdx+5;
//velocity1 = velocity0 + acceleration
array[vxIdx] += array[axIdx];
array[vyIdx] += array[ayIdx];
//displacement1 = displacement0 + velocity
array[xIdx] += array[vxIdx];
array[yIdx] += array[vxIdx];
}
This example ignores such issues as rendering of those entities using their associated (x,y)... rendering always requires non-primitives (thus, references/pointers). If you do need such object instances, then you can no longer guarantee locality of reference, and will likely be jumping around all over the heap. So if you can split your code into sections where you have primitive-intensive processing as shown above, then this approach will help you a lot. For games at least, AI, dynamic terrain, and physics can be some of the most processor-intensives aspect, and are all numeric, so this approach can be very beneficial.
If you are down to where an improvement of a few percent makes a difference, use C where you'll get an improvement of 50-100%!
If you think that the ease of use of Java makes it a better language to use, then don't screw it up with questionable optimizations.
The good news is that Java will do a lot of stuff beneath the covers to improve your code at runtime, but it almost certainly won't do the kind of optimizations you're talking about.
If you decide to go with Java, just write your code as clearly as you can, don't take minor optimizations into account at all. (Major ones like using the right collections for the right job, not allocating/freeing objects inside a loop, etc. are still worth while)
So far the advice is pretty strong, in general it's best not to try and outsmart the JIT. But as you say some knowledge about the details is useful sometimes.
Regarding memory layout for objects, Sun's Jvm (now Oracle's) lays objects into memory by type (i.e. doubles and longs first, then ints and floats, then shorts and chars, after that bytes and booleans and finally object references). You can get more details here..
Local variables are usually kept in the stack (that is references and primitive types).
As Nick mentions, the best way to ensure the memory layout in Java is by using primitive arrays. That way you can make sure that data is contiguous in memory. Be careful about array sizes though, GCs have trouble with large arrays. It also has the downside that you have to do some memory management yourself.
On the upside, you can use a Flyweight pattern to get Object-like usability while keeping fast performance.
If you need the extra oomph in performance, generating your own bytecode on the fly helps with some problems, as long as the generated code is executed enough times and your VM's native code cache doesn't get full (which disables the JIT for all practical purposes).
To the best of my knowledge: No. You pretty much have to be writing in machine code to get that level of optimization. With assembly you're a step away because you no longer control where things are stored. With a compiler you're two steps away because you don't even control the details of the generated code. With Java you're three steps away because there's a JVM interpreting your code on the fly.
I don't know of any constructs in Java that let you control things on that level of detail. In theory you could indirectly influence it by how you organize your program and data, but you're so far away that I don't see how you could do it reliably, or even know whether or not it was happening.