Data alignment vs. cache locality

Data alignment vs. cache locality - java

From memory, data may only be read in the natural word size of the architecture. For example, on a 32-bit system, data is read from memory in 4-byte chunks. If a 2-byte or 1-byte value is added to memory, their reading will still require accessing a 4-byte word. (In case of the 2-byte value, two 4-byte accesses might be required, if the value was stored on a word boundary.)
Therefore, an access to an individual value is the fastest when it requires accessing a single word, and minimal additional work (such as masking) is required. If I'm correct, this is the reason virtual machines (such as JVM or Android's Dalvik) lay out member variables in 4-byte boundaries in Object instances.
Another concept is cache friendliness, i.e. locality (e.g. L1, L2). If many values must be traversed/processed directly after each other, it is beneficial that they are stored close to each other (ideally, in a contiguous block). This is spatial locality. If this isn't possible, at least the operations on the same value should be done in the same time period (temporal locality -- i.e. it has a high chance that the value is retained in cache while the operations are performed on it).
As far as I can see, the above two concepts can be "contradictory" in some cases, and the choice between them depends on their usage scenario. For example, a smaller amount of contiguous data is more cache friendly than a greater amount (trivial), yet if random access is commonly required on some data, the word-aligned (but greater-sized) structure might be beneficial -- unless the whole structure fits in the cache. Therefore, whether locality (~arrays) or alignment benefits should be preferred depends on how the values will be manipulated, I think.
There is a scenario which is interesting for me: let's assume a pathfinding algorithm which receives the input graph (and other auxiliary structures) as arrays. (Most of its input arrays store values that are <= 32767.)
The pathfinding algorithm performs very many random accesses on the arrays (in several loops). In this sense, an int[] might be desired for input data (on Android/ARM), because the values will be on word boundary when accessed. (On the other hand, if sequential traversals were needed, then a smaller datatype would be recommended -- especially for large arrays -- because of a higher probability of cache-friendliness.)
However, what if the (randomly accessed) input data would fit L1/L2 if specified as a short[], but not fit if specified as int[]? In such a case, would the advantage of 4-byte alignment of int[] for random access be outweighted by the cache-friendliness of short[]?
In a concrete application, of course, I'd make measurements for comparison. That wouldn't necessarily answer the above questions, however.

If you can assure that moving to short leads to a significatn better locality (aka everything is in cache), this outweights alignment penalties.
access to cache is in the low nanos <10ns, access to ram is 60-80ns

Related

Growable multidimensional data structure supporting range queries

Let me put the question first: considering the situation and requirements I'll describe further down, what data structures would make sense/help achieving the non-functional requirements?
I tried to look up several structures but wasn't very successful so far, which might be due to me missing some terminology.
Since we'll implement that in Java any answers should take that into account (e.g. no pointer-magic, assume 8-byte references etc.).
The situation
We have somewhat large set of values that are mapped via a 4-dimensional key (let's call those dimensions A, B, C and D). Each dimension can have a different size, so we'll assume the following:
A: 100
B: 5
C: 10000
D: 2
This means a completely filled structure would contain 10 million elements. Not considering their size the space needed to hold the references alone would be like 80 megabytes, so that would be considered a lower bound for memory consumption.
We further can assume that the structure won't be completely filled but quite densely.
The requirements
Since we build and query that structure quite often we have the following requirements:
constructing the structure should be fast
queries on single elements and ranges (e.g. [A1-A5, B3, any C, D0]) should be efficient
fast deletion of elements isn't required (won't happen too often)
the memory footprint should be low
What we already considered
kd-trees
Building such a tree takes some time since it can get quite deep and we'd either have to accept slower queries or take rebalancing measures. Additonally the memory footprint is quite high since we need to hold the complete key in each node (there might be ways to reduce that though).
Nested maps/map tree
Using nested maps we could store only the key for each dimension as well as a reference to the next dimension map or the values - effectively building a tree out of those maps. To support range queries we'd keep sorted sets of the possible keys and access those while traversing the tree.
Construction and queries were way faster than with kd-trees but the memory footprint was much higher (as expected).
A single large map
An alternative would be to keep the sets for individual available keys and use a single large map instead.
Construction and queries were fast as well but memory consumption was even higher due to each map node being larger (they need to hold all dimensions of a key now).
What we're thinking of at the moment
Building insertion-order index-maps for the dimension keys, i.e. we map each incoming key to a new integer index as it comes in. Thus we can make sure that those indices grow one step a time without any gaps (not considering deletions).
With those indices we'd then access a tree of n-dimensional arrays (flattened to a 1-d array of course) - aka n-ary tree. That tree would grow on demand, i.e. if we need a new array then instead of creating a larger one and copying all the data we'd just create the new block. Any needed non-leaf nodes would be created on demand, replacing the root if needed.
Let me illustrate that with an example of 2 dimensions A and B. We'll allocate 2 elements for each dimension resulting in a 2x2 matrix (array of length 4).
Adding the first element A1/B1 we'd get something like this:
[A1/B1,null,null,null]
Now we add element A2/B2:
[A1/B1,null,A2/B2,null]
Now we add element A3/B3. Since we can't map the new element to the existing array we'll create a new one as well as a common root:
[x,null,x,null]
/ \
[A1/B1,null,A2/B2,null] [A3/B3,null,null,null]
Memory consumption for densely filled matrices should be rather low depending on the size of each array (having 4 dimensions and 4 values per dimension in an array we'd have arrays of length 256 and thus get a maximum tree depth of 2-4 in most cases).
Does this make sense?

If the structure will be "quite densely" filled, then I think it makes sense to assume that it will be full. That simplifies things quite a bit. And it's not like you're going to save a lot (or anything) using a sparse matrix representation of a densely filled matrix.
I'd try the simplest possible structure first. It might not be the most memory efficient, but it should be reasonable and quite easy to work with.
First, a simple array of 10,000,000 references. That is (and please pardon the C#, as I'm not really a Java programmer):
MyStructure[] theArray = new MyStructure[](10000000);
As you say, that's going to consume 80 megabytes.
Next is four different dictionaries (maps, I think, in Java), one for each key type:
Dictionary<KeyAType, int> ADict;
Dictionary<KeyBType, int> BDict;
Dictionary<KeyCType, int> CDict;
Dictionary<KeyDType, int> DDict;
When you add an element at {A,B,C,D}, you look up the respective keys in the dictionary to get their indexes (or add a new index if that key doesn't exist), and do the math to compute an index into the array. The math is, I think:
DIndex + 2*(CIndex + 10000*(BIndex + 5*AIndex));
In .NET, dictionary overhead is something like 24 bytes per key. But you only have 11,007 total keys, so the dictionaries are going to consume something like 250 kilobytes.
This should be very quick to query directly, and range queries should be as fast as a single lookup and then some array manipulation.
One thing I'm not clear on is if you want a key, to resolve to the same index with every build. That is, if "foo" maps to index 1 in one build, will it always map to index 1?
If so, you probably should statically construct the dictionaries. I guess it depends on if your range queries always expect things in the same key order.
Anyway, this is a very simple and very effective data structure. If you can afford 81 megabytes as the maximum size of the structure (minus the actual data), it seems like a good place to start. You could probably have it working in a couple of hours.
At best it's all you'll have to do. And if you end up having to replace it, at least you have a working implementation that you can use to verify the correctness of whatever new structure you come up with.

There are other multidimensional trees that are usually better than kd-trees:quadtrees, R*Trees (like R-Tree, but much faster for updates) or PH-Tree.
The PH-Tree is like a quadtree, but much more space efficient, scales better with dimensions and depth is limited by maximum bitwidth of values, i.e. maximum '10000' requires 14 bit, so the depth will not be more than 14.
Java implementations of all trees can be found on my repo, either here (quadtree may be a bit buggy) or here.
EDIT
The following optimization can probably be ignored. Of course the described query will result in a full scan, but that may not be as bad as it sounds, because it will on average anyway return 33%-50% of the whole tree.
Possible optimisation (not tested, but might work for the PH-Tree):
One problem with range queries is the different selectivity of your dimensions, which may result in something to a full scan of the tree. For example when querying for [0..100][0..5][0..10000][1..1], i.e. constraining only the last dimension (with least selectivity).
To avoid this, especially for the PH-Tree, I would try to multiply your values by a fixed constant. For example multiply A by 100, B by 2000, C by 1 and D by 5000. This allows all values to range from 0 to 10000, which may improve query performance when constraining only dimensions with low selectivity (the 2nd or 4th).

Why is the Minimum granularity defined as 8192 in Java8 in order to switch from Parallel Sort to Arrays.sort regardless of type of data

I was going through the concepts of parallel sort introduced in Java 8.
As per the doc.
If the length of the specified array is less than the minimum
granularity, then it is sorted using the appropriate Arrays.sort
method.
The spec however doesn't specify this minimum limit.
When I looked up the Code in java.util.Arrays it was defined as
private static final int MIN_ARRAY_SORT_GRAN = 1 << 13;
i.e., 8192 values in the array
As per the explanation provided here.
I understand why the value was Hard-coded as 8192.
It was designed keeping the current CPU architecture in mind.
With the -XX:+UseCompressedOops option being enabled by default, any
system with less than 32GB RAM would be using 32bit(4bytes) pointers.
Now, with a L1 Cache size of 32KB for data portion, we can pass
32KB/4Bytes = 8KB of data at once to CPU for computation. That's
equivalent to 8192 bytes of data being processed at once.
So for a function which is working on sorting a byte array parallelSort(byte[]) this makes sense. You can keep minimum parallel sort limit as 8192 values (each value = 1 byte for byte array).
But If you consider public static void parallelSort(int[] a)
An Integer Variable is of 4Bytes(32-bit). So ideally of the 8192 bytes, we can store 8192/4 = 2048 numbers in CPU cache at once.
So the minimum granularity in this case is suppose to be 2048.
Why are all parallelSort functions in Java (be it byte[], int[], long[], etc.) using 8192 as the default min. number of values needed in order to perform parallel sorting?
Shouldn't it vary according to the types passed to the parallelSort function?

First, it seems that you've misread the linked explanation. L1 data cache is 32Kb, so for int[] it fits ideally: 32768/4=8192 ints could be placed into L1 cache while.
Second, I don't think the given explanation is correct. It concentrates on pointers, so it says mainly about sorting object array, but when you compare the data in the objects array, you always need to dereference these pointers accessing the real data. And in case if your objects have non-primitive fields, you'll have to dereference them even further. For example, if you sort an array of strings, you have to access not only array itself, but also String objects and char[] arrays which are stored inside them. All of these would require many additional cache lines.
I did not find any explicit explanation about this particular value in review thread for this change. Previously it was 256, then it was changed to 8192 as part of JDK-8014076 update. I think it just shown best performance on some reasonable test suite. Keeping separate thresholds for different cases would add more complexity. Probably tests show that it's not paying off. Note that ideal threshold is impossible for Object[] arrays as compare function is user-specified and could have arbitrary complexity. For sufficiently complex compare function it's probably reasonable to parallelize even very small arrays.

Serialization: Converting bytes to bytes?

The object itself is a sequence of bytes and that is how does the machine understand all the data, whether it's object, text, images..etc. Could you clear this idea for me why we are converting a sequence of bytes (object) into another byte? Do we restructure the bytes when we do serialization, or make a template that holds this object to give it a special meaning when transmitted over the network? suppose a certain method, that takes the object from memory as it is, and put that object into an IP datagrams and send it through the network, what issue that may arise?

First: compression.
You must understand, that image file on disk and image file rendered from memory - are not the same. On disk they (usually, forget about BMP) are compressed. With current network throughput and hdd's capacities, compressing is essential.
Second: architecture.
Number in memory is just a sequence of bits, yes. But, what bit-count is counted as number? 8? 16? 32? 64? Any of them. There are bytes, words, integers, longs, floats (hell, floats!) and another couple of dozens of them. And bitorder also matters, so-called big-endian and little-endian. So 123456789 on one (x86) machine is not the same number on another machine (x64, for example).
Third: file (read: transmission) format != object-in-memory format.
Well, there is difference between data structure in file (or network packet), and when object loaded from that file in memory. And additionally, object-in-memory structure can differ from program version to version. Loaded-to-memory image in Win 3.1 and, f.e., Vista is a hell big difference. Also, structures packing and 4-, 8-, 16-, 32-bit-boundary aligning etc, etc, etc.

The object itself includes many references, which are pointers to where another component of the object happens to exist in memory on this particular machine at this particular moment. The point of serialization is that it converts objects into bytes that can be read at some other time, possibly on some other machine.
Additionally, object representations in memory are optimized for fast access and modification, not necessarily taking the minimum number of bytes. Some serialization protocols, especially for use in RPCs or data storage, optimize for how many bytes have to be transmitted or stored using compression algorithms that make it more difficult to access or modify the properties of the object in exchange for using fewer bytes to do it.

The object itself is a sequence of bytes
No. The object itself isn't just a 'sequence of bytes', unless it contains nothing but primitive data. It can contain
references to other objects
those objects may already have been serialized, in which case a back-reference needs to be serialized, not the referenced object all over again
those references may be null
there may be no object at all, just primitive data
All these things increase the complexity of the task well beyond the naive notion of just serializing 'a sequence of bytes'.

fastest way to map a large number of longs

I'm writing a java application that transforms numbers (long) into a small set of result objects. This mapping process is very critical to the app's performance as it is needed very often.
public static Object computeResult(long input) {
Object result;
// ... calculate
return result;
}
There are about 150,000,000 different key objects, and about 3,000 distinct values.
The transformation from the input number (long) to the output (immutable object) can be computed by my algorithm with a speed of 4,000,000 transformations per second. (using 4 threads)
I would like to cache the mapping of the 150M different possible inputs to make the translation even faster but i found some difficulties creating such a cache:
public class Cache {
private static long[] sortedInputs; // 150M length
private static Object[] results; // 150M length
public static Object lookupCachedResult(long input) {
int index = Arrays.binarySearch(sortedInputs, input);
return results[index];
}
}
i tried to create two arrays with a length of 150M. the first array holds all possible input longs, and it is sorted numerically. the second array holds a reference to one of the 3000 distinct, precalculated result objects at the index corresponding to the first array's input.
to get to the cached result, i do a binary search for the input number on the first array. the cached result is then looked up in the second array at the same index.
sadly, this cache method is not faster than computing the results. not even half, only about 1.5M lookups per second. (also using 4 threads)
Can anyone think of a faster way to cache results in such a scenario?
I doubt there is a database engine that is able to answer more than 4,000,000 queries per second on, let's say an average workstation.

Hashing is the way to go here, but I would avoid using HashMap, as it only works with objects, i.e. must build a Long each time you insert a long, which can slow it down. Maybe this performance issue is not significant due to JIT, but I would recommend at least to try the following and measure performance against the HashMap-variant:
Save your longs in a long-array of some length n > 3000 and do the hashing by hand via a very simple hash-function (and thus efficient) like
index = key % n. Since you know your 3000 possible values before hand you can empirically find an array-length n such that this trivial hash-function won't cause collisions. So you circumvent rehashing etc. and have true O(1)-performance.
Secondly I would recommend you to look at Java-numerical libraries like
https://github.com/mikiobraun/jblas
https://github.com/fommil/matrix-toolkits-java
Both are backed by native Lapack and BLAS implementations that are usually highly optimized by very smart people. Maybe you can formulate your algorithm in terms of matrix/vector-algebra such that it computes the whole long-array at one time (or chunk-wise).

There are about 150,000,000 different key objects, and about 3,000 distinct values.
With the few values, you should ensure that they get re-used (unless they're pretty small objects). For this an Interner is perfect (though you can run your own).
i tried hashmap and treemap, both attempts ended in an outOfMemoryError.
There's a huge memory overhead for both of them. And there isn't much point is using a TreeMap as it uses a sort of binary search which you've already tried.
There are at least three implementations of a long-to-object-map available, google for "primitive collections". This should use slightly more memory than your two arrays. With hashing being usually O(1) (let's ignore the worst case as there's no reason for it to happen, is it?) and much better memory locality, it'll beat(*) your binary search by a factor of 20. You binary search needs log2(150e6), i.e., about 27 steps and hashing may need on the average maybe two. This depends on how tightly you pack the hash table; this is usually a parameter given when it gets created.
In case you run your own (which you most probably shouldn't), I'd suggest to use an array of size 1 << 28, i.e., 268435456 entries, so that you can use bitwise operations for indexing.
(*) Such predictions are hard, but I'm sure it's worth trying.

Is there a way to efficiently store a sequence of enum values in Java?

I'm looking for a way to encode a sequence of enum values in Java that packs better than one object reference per element. In fantasy-code:
List<MyEnum> list = new EnumList<MyEnum>(MyEnum.class);
In principle it should be possible to encode each element using log2(MyEnum.values().length) bits per element. Is there an existing implementation for this, or a simple way to do it?
It would be sufficient to have a class that encodes a sequence of numbers of arbitrary radix (i.e. if there are 5 possible enum values then use base 5) into a sequence of bytes, since a simple wrapper class could be used to implement List<MyEnum>.
I would prefer a general, existing solution, but as a poor man's solution I might just use an array of longs and radix-encode as many elements as possible into each long. With 5 enum values, 27 elements will fit into a long and waste only ~1.3 bits, which is pretty good.
Note: I'm not looking for a set implementation. That wouldn't preserve the sequence.

You can store bits in an int (32 bits, 32 "switches"). But aside from the exercise value, what's the point?- you're really talking about a very small amount of memory. A better question might be, why do you want to save a few bytes in enum references? Other parts of your program are likely to be using much more memory.
If you're concerned with transferring data efficiently, you could consider leaving the Enums alone but using custom serialization, though again, it'd be an unusual situation where it'd be worth the effort.

One object reference typically occupies one 32-bit or 64-bit word. To do better than that, you need to convert the enum values into numbers that are smaller than 32 bits, and hold them in an array.
Converting to a number is as simple as calling getOrdinal(). From there you could:
cast to a byte or short, then represent the sequence as an array of byte / short values, or
use a suitable compression algorithm on the array of int values.
Of course, all of this comes at the cost of making your code more complicated. For instance you cannot make use of the collection APIs, and you have to do your own sequence management. I doubt that this will be worth it unless you have to deal with very large sequences or huge numbers of sequences.
In principle it should be possible to encode each element using log2(MyEnum.values().length) bits.
In fact you may be able to do better than that ... by compressing the sequences. It depends on how much redundancy there is.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.