Optimising Java objects for CPU cache line efficiency - java

I'm writing a library where:
It will need to run on a wide range of different platforms / Java implementations (the common case is likely to be OpenJDK or Oracle Java on Intel 64 bit machines with Windows or Linux)
Achieving high performance is a priority, to the extent that I care about CPU cache line efficiency in object access
In some areas, quite large graphs of small objects will be traversed / processed (let's say around 1GB scale)
The main workload is almost exclusively reads
Reads will be scattered across the object graph, but not totally randomly (i.e. there will be significant hotspots, with occasional reads to less frequently accessed areas)
The object graph will be accessed concurrently (but not modified) by multiple threads. There is no locking, on the assumption that concurrent modification will not occur.
Are there some rules of thumb / guidelines for designing small objects so that they utilise CPU cache lines effectively in this kind of environment?
I'm particularly interested in sizing and structuring the objects correctly, so that e.g. the most commonly accessed fields fit in the first cache line etc.
Note: I am fully aware that this is implementation dependent, that I will need to benchmark, and of the general risks of premature optimization. No need to waste any further bandwidth pointing this out. :-)

A first step towards cache line efficiency is to provide for referential locality (i.e. keeping your data close to each other). This is hard to do in JAVA where almost everything is system allocated and accessed by reference.
To avoid references, the following might be obvious:
have non-reference types (i.e. int, char, etc.) as fields in your
objects
keep your objects in arrays
keep your objects small
These rules will at least ensure some referential locality when working on a single object and when traversing the object references in your object graph.
Another approach might be to not use object for your data at all, but have global non-ref typed arrays (of same size) for each item that would normally be a field in your class and then each instance would be identified by a common index into these arrays.
Then for optimizing the size of the arrays or chunks thereof, you have to know the MMU characteristics (page/cache size, number of cache lines, etc). I don't know if JAVA provides this in the System or Runtime classes, but you could pass this information as system properties on start up.
Of course this is totally orthogonal to what you should normally be doing in JAVA :)
Best regards

You may require information about the various caches of your CPU, you can access it from Java using Cachesize (currently supporting Intel CPUs). This can help to develop cache-aware algorithms.
Disclaimer : author of the lib.

Related

Why does an arrays memory in java need to be divisible by 8 bytes?

While researching about arrays in java, I read the memory usage contains 12 bytes of header object plus the storage of the element of the particular data type. But if the final value is not divisible by 8 bytes padding needs to be added. Why is this? I tried to search about this but did not find an answer.
https://study.com/academy/lesson/java-arrays-memory-use-performance.html#:~:text=The%20memory%20allocation%20for%20an,a%20multiple%20of%208%20bytes.
This was the website where I read about this.
The details you’re specifying here are specific to particular implementations of the JVM and are not generally required to be true. For example, as a project many years back I put together an implementation of a JVM written purely in JavaScript, where the implementation relied on underlying JS primitives and therefore couldn’t control the exact sizes of the objects being created. I have no idea how big the underlying array objects I allocated in JavaScript to represent a single Java array was, and I certainly didn’t manually pad it. :-)
As #cpvmrd mentioned, a specific implementation of the JVM might decide to pad things this way either for performance reasons (to align objects to a particular boundary to make 64-bit loads fast) or for reasons of processor alignment (some operations either lead to performance degradation or to bus errors if data of size N are loaded from an address that isn’t a multiple of N). But that’s up to the implementation to sort out.
It is for performance.
The processor reads the memory in "chunks" of a definite size (word). On 64-bits CPU, memory is generally configured to return one 64-bits word per address access. Intel CPUs can perform accesses on non-word boundary,but there is a performance penalty as internally the CPU performs two memory accesses and a math operation to load one word.

How is off heap memory read/written in Java?

In my Spark program, I'm interested in allocating and using data that is not touched by Java's garbage collector. Basically, I want to do the memory management of such data myself like you would do in C++. Is this a good case of using off heap memory? Secondly, how do you read and write to off heap memory in Java or Scala. I tried searching for examples, but couldn't find any.
Manual memory management is a viable optimization strategy for garbage collected languages. Garbage collection is a known source of overhead and algorithms can be tailored to minimize it. For example, when picking a hash table implementation one might prefer Open Addressing because it allocates its entries manually on the main array instead of handling them to the language memory allocation and its GC. As another example, here's a Trie searcher that packs the Trie into a single byte array in order to mimimize the GC overhead. Similar optimization can be used for regular expressions.
That kind of optimization, when the Java arrays are used as a low-level storage for the data, goes hand in hand with the Data-oriented design, where data is stored in arrays in order to achieve better cache locality. Data-oriented design is widely used in gamedev, where the performance matters.
In JavaScript this kind of array-backed data storage is an important part of asm.js.
The array-backed approach is sufficiently supported by most garbage collectors used in the Java world, as they'll try to avoid moving the large arrays around.
If you want to dig deeper, in Linux you can create a file inside the "/dev/shm" filesystem. This filesystem is backed by RAM and won't be flushed to disk unless your operating system is out of memory. Memory-mapping such files (with FileChannel.map) is a good enough way to get the off-heap memory directly from the operating system. (MappedByteBuffer operations are JIT-optimized to direct memory access, minus the boundary checks).
If you want to go even deeper, then you'll have to resort to JNI libraries in order to access the C-level memory allocator, malloc.
If you are not able to achieve "Efficiency with Algorithms, Performance with Data Structures", and if efficiency and performance are so critical, you could consider using "sun.misc.Unsafe". As the name suggests it is unsafe!!!
Spark is already using it as mentioned in project-tungsten.
Also, you can start here, to understand it better!!!
Note: Spark provides a highly concurrent for execution of application and with multiple JVMs mostlikely across multiple machines, manual memory management will be extreamly complex. Fundamemtally spark promotes re-computation over global shared memory. So, perhaps, you could store partially computed data/result in another store like HDFS/Kafka/cassandra!!!
Have a look at ByteBuffer.allocateDirect(int bytes). You don't need to memory map files to make use of them.
Off heap can be a good choice if the objects will stick there for a while (i.e. are reused). If you'll be allocating/deallocating them as you go, that's going to be slower.
Unsafe is cool but it's going to be removed. Probably in Java 9.

Why my JDBC call is consuming memory 4 times more that actual size of data

I wrote a small java program which loads data from DB2 database using simple JDBC call. I am using select query to get data and using java statement for this purpose. I have properly closed statement and connection objects. I am using 64 bit JVM for compilation and for running the program.
The query is returning 52 million records, each row having 24 columns, which takes me around 4 minutes to load complete data in Unix (having multiprocessor environment). I am using HashMap as data-structure to load the data: Map<String, Map<String, GridTradeStatus>>. The bean GridTradeStatus is a simple getter/setter bean with 24 properties in it.
The memory required for the program is alarmingly high. Java heap size goes up to 5.8 - 6GB to load complete data while actual used heap size remains between 4.7 - 4.9GB. I know that we should not load this much data into memory but my business requirements are in that way only.
The question is that when I put whole data of my table in a flat file it comes out to be roughly equivalent to ~1.2GB. I want to know why my java program is consuming memory 4 times more that its actual size.
There is nothing surprising here (to me at least).
a.) Strings in java consume double the space compared to most common text formats (because Strings are always represented as UTF-16 in the heap). Also, String as an object has quite some overhead (String object itself, reference to the char[] it contains, hashCode etc.). For small strings the String object costs easily as much memory as the data it contains.
b.) You put stuff into a HashMap. HashMap is not exactly memory efficient. First it uses a default load factor of 75%, which means a map with many entries has also a big bucket array. Then, each entry in the map is an object itself, which costs at least two references (key and value) plus object overhead.
In conclusion you pretty much have to expect the memory requirements to increase quite a bit. A factor of 4 is reasonable if your average data String is relatively short.
If you think you cannot afford a ratio 1:4 between the size of data in a flat file and the memory necessary to load the Strings in a HashMap, you should considere not using Java but a lower level language such as C++ or even C.
Of course there are possible optimizations :
use byte[] instead of String (about half the size)
do not use default HashMap parameters (initial size / load factor) but tweak them to meet your actual requirements.
What follows is mainly experience opinion based. I generally use 4 language levels :
high level scripting language (Python, Ruby, or even bash ...) when performance
is not a requirement and speed of developpement is
mid level language (Java, less frequently high level C++) when performance matters but when I also want simplicity of developpement and robustness (strong typing, ...)
low level language (low level C++, or C) what performance is a high requirement and when I accept to spend much more time in writing and testing individual modules
assembly language for the small parts where performance is critical and has been proved to be by profiling.
IMHO you can tweak Java code to highly reduce the memory footprint, but you risk to lose a great part of the interest of Java by losing the excellent string and collections support. It might be as easy and perhaps more efficient to code a small part of the application in C++ and use JNI to tie all together.

Memory footprint minimization in Java EE 5, for classes, primitive data types and Strings

Context is: Java EE 5.
I have a server running some huge app. I need to refactor the classes, so that their memory footprint is low (towards lowest possible), in exchange for CPU time (of which there's plenty).
I already know of ways to use bit operations to stuff multiple booleans, shorts or bites into an int (for example).
I'd need from you other optimization ideas, like, what do i do with Strings, what collections are better to use, and anything else that you happen to know.
Thx,
you guys rule!
This pdf about memory efficiency in java might be of interest to you.
Especially the standard collections seem to be huge memory wasters. But the first step before doing any micro-optimizations would be to profile your application, create heap dumps and analyze these.
A couple of things to consider
If you are done with an object and it will remain in scope, set it to null
Use StringBuilder (or StringBuffer if you need thread safety) instead
of concatenating Strings.
However, if your memory usage is such an issue it may be an architectural problem with the code.

What can I do in Java code to optimize for CPU caching?

When writing a Java program, do I have influence on how the CPU will utilize its cache to store my data? For example, if I have an array that is accessed a lot, does it help if it's small enough to fit in one cache line (typically 128 byte on a 64-bit machine)? What if I keep a much used object within that limit, can I expect the memory used by it's members to be close together and staying in cache?
Background: I'm building a compressed digital tree, that's heavily inspired by the Judy arrays, which are in C. While I'm mostly after its node compression techniques, Judy has CPU cache optimization as a central design goal and the node types as well as the heuristics for switching between them are heavily influenced by that. I was wondering if I have any chance of getting those benefits, too?
Edit: The general advice of the answers so far is, don't try to microoptimize machine-level details when you're so far away from the machine as you're in Java. I totally agree, so felt I had to add some (hopefully) clarifying comments, to better explain why I think the question still makes sense. These are below:
There are some things that are just generally easier for computers to handle because of the way they are built. I have seen Java code run noticeably faster on compressed data (from memory), even though the decompression had to use additional CPU cycles. If the data were stored on disk, it's obvious why that is so, but of course in RAM it's the same principle.
Now, computer science has lots to say about what those things are, for example, locality of reference is great in C and I guess it's still great in Java, maybe even more so, if it helps the optimizing runtime to do more clever things. But how you accomplish it might be very different. In C, I might write code that manages larger chunks of memory itself and uses adjacent pointers for related data.
In Java, I can't (and don't want to) know much about how memory is going to be managed by a particular runtime. So I have to take optimizations to a higher level of abstraction, too. My question is basically, how do I do that? For locality of reference, what does "close together" mean at the level of abstraction I'm working on in Java? Same object? Same type? Same array?
In general, I don't think that abstraction layers change the "laws of physics", metaphorically speaking. Doubling your array in size every time you run out of space is a good strategy in Java, too, even though you don't call malloc() anymore.
The key to good performance with Java is to write idiomatic code, rather than trying to outwit the JIT compiler. If you write your code to try to influence it to do things in a certain way at the native instruction level, you are more likely to shoot yourself in the foot.
That isn't to say that common principles like locality of reference don't matter. They do, but I would consider the use of arrays and such to be performance-aware, idiomatic code, but not "tricky."
HotSpot and other optimizing runtimes are extremely clever about how they optimize code for specific processors. (For an example, check out this discussion.) If I were an expert machine language programmer, I'd write machine language, not Java. And if I'm not, it would be unwise to think that I could do a better job of optimizing my code than the experts.
Also, even if you do know the best way to implement something for a particular CPU, the beauty of Java is write-once-run-anywhere. Clever tricks to "optimize" Java code tend to make optimization opportunities harder for the JIT to recognize. Straight-forward code that adheres to common idioms is easier for an optimizer to recognize. So even when you get the best Java code for your testbed, that code might perform horribly on a different architecture, or at best, fail to take advantages of enhancements in future JITs.
If you want good performance, keep it simple. Teams of really smart people are working to make it fast.
If the data you're crunching is primarily or wholly made up of primitives (eg. in numeric problems), I would advise the following.
Allocate a flat structure of fixed size arrays-of-primitives at initialisation-time, and make sure the data therein is periodically compacted/defragmented (0->n where n is the smallest max index possible given your element count), to be iterated over using a for-loop. This is the only way to guarantee contiguous allocation in Java, and compaction further serves to improves locality of reference. Compaction is beneficial, as it reduces the need to iterate over unused elements, reducing the number of conditionals: As the for loop iterates, the termination occurs earlier, and less iteration = less movement through the heap = fewer chances for a cache miss. While compaction creates an overhead in and of itself, this may be done only periodically (with respect to your primary areas of processing) if you so choose.
Even better, you can interleave values in these pre-allocated arrays. For instance, if you are representing spatial transforms for many thousands of entities in 2D space, and are processing the equations of motion for each such, you might have a tight loop like
int axIdx, ayIdx, vxIdx, vyIdx, xIdx, yIdx;
//Acceleration, velocity, and displacement for each
//of x and y totals 6 elements per entity.
for (axIdx = 0; axIdx < array.length; axIdx += 6)
{
ayIdx = axIdx+1;
vxIdx = axIdx+2;
vyIdx = axIdx+3;
xIdx = axIdx+4;
yIdx = axIdx+5;
//velocity1 = velocity0 + acceleration
array[vxIdx] += array[axIdx];
array[vyIdx] += array[ayIdx];
//displacement1 = displacement0 + velocity
array[xIdx] += array[vxIdx];
array[yIdx] += array[vxIdx];
}
This example ignores such issues as rendering of those entities using their associated (x,y)... rendering always requires non-primitives (thus, references/pointers). If you do need such object instances, then you can no longer guarantee locality of reference, and will likely be jumping around all over the heap. So if you can split your code into sections where you have primitive-intensive processing as shown above, then this approach will help you a lot. For games at least, AI, dynamic terrain, and physics can be some of the most processor-intensives aspect, and are all numeric, so this approach can be very beneficial.
If you are down to where an improvement of a few percent makes a difference, use C where you'll get an improvement of 50-100%!
If you think that the ease of use of Java makes it a better language to use, then don't screw it up with questionable optimizations.
The good news is that Java will do a lot of stuff beneath the covers to improve your code at runtime, but it almost certainly won't do the kind of optimizations you're talking about.
If you decide to go with Java, just write your code as clearly as you can, don't take minor optimizations into account at all. (Major ones like using the right collections for the right job, not allocating/freeing objects inside a loop, etc. are still worth while)
So far the advice is pretty strong, in general it's best not to try and outsmart the JIT. But as you say some knowledge about the details is useful sometimes.
Regarding memory layout for objects, Sun's Jvm (now Oracle's) lays objects into memory by type (i.e. doubles and longs first, then ints and floats, then shorts and chars, after that bytes and booleans and finally object references). You can get more details here..
Local variables are usually kept in the stack (that is references and primitive types).
As Nick mentions, the best way to ensure the memory layout in Java is by using primitive arrays. That way you can make sure that data is contiguous in memory. Be careful about array sizes though, GCs have trouble with large arrays. It also has the downside that you have to do some memory management yourself.
On the upside, you can use a Flyweight pattern to get Object-like usability while keeping fast performance.
If you need the extra oomph in performance, generating your own bytecode on the fly helps with some problems, as long as the generated code is executed enough times and your VM's native code cache doesn't get full (which disables the JIT for all practical purposes).
To the best of my knowledge: No. You pretty much have to be writing in machine code to get that level of optimization. With assembly you're a step away because you no longer control where things are stored. With a compiler you're two steps away because you don't even control the details of the generated code. With Java you're three steps away because there's a JVM interpreting your code on the fly.
I don't know of any constructs in Java that let you control things on that level of detail. In theory you could indirectly influence it by how you organize your program and data, but you're so far away that I don't see how you could do it reliably, or even know whether or not it was happening.

Categories