How are Integer arrays stored internally, in the JVM?

How are Integer arrays stored internally, in the JVM? - java

An array of ints in java is stored as a block of 32-bit values in memory. How is an array of Integer objects stored? i.e.
int[] vs. Integer[]
I'd imagine that each element in the Integer array is a reference to an Integer object, and that the Integer object has object storage overheads, just like any other object.
I'm hoping however that the JVM does some magical cleverness under the hood given that Integers are immutable and stores it just like an array of ints.
Is my hope woefully naive? Is an Integer array much slower than an int array in an application where every last ounce of performance matters?

No VM I know of will store an Integer[] array like an int[] array for the following reasons:
There can be null Integer objects in the array and you have no bits left for indicating this in an int array. The VM could store this 1-bit information per array slot in a hiden bit-array though.
You can synchronize in the elements of an Integer array. This is much harder to overcome as the first point, since you would have to store a monitor object for each array slot.
The elements of Integer[] can be compared for identity. You could for example create two Integer objects with the value 1 via new and store them in different array slots and later you retrieve them and compare them via ==. This must lead to false, so you would have to store this information somewhere. Or you keep a reference to one of the Integer objects somewhere and use this for comparison and you have to make sure one of the == comparisons is false and one true. This means the whole concept of object identity is quiet hard to handle for the optimized Integer array.
You can cast an Integer[] to e.g. Object[] and pass it to methods expecting just an Object[]. This means all the code which handles Object[] must now be able to handle the special Integer[] object too, making it slower and larger.
Taking all this into account, it would probably be possible to make a special Integer[] which saves some space in comparison to a naive implementation, but the additional complexity will likely affect a lot of other code, making it slower in the end.
The overhead of using Integer[] instead of int[] can be quiet large in space and time. On a typical 32 bit VM an Integer object will consume 16 byte (8 byte for the object header, 4 for the payload and 4 additional bytes for alignment) while the Integer[] uses as much space as int[]. In 64 bit VMs (using 64bit pointers, which is not always the case) an Integer object will consume 24 byte (16 for the header, 4 for the payload and 4 for alignment). In addition a slot in the Integer[] will use 8 byte instead of 4 as in the int[]. This means you can expect an overhead of 16 to 28 byte per slot, which is a factor of 4 to 7 compared to plain int arrays.
The performance overhead can be significant too for mainly two reasons:
Since you use more memory, you put on much more pressure on the memory subsystem, making it more likely to have cache misses in the case of Integer[]. For example if you traverse the contents of the int[] in a linear manner, the cache will have most of the entries already fetched when you need them (since the layout is linear too). But in case of the Integer array, the Integer objects itself might be scattered randomly in the heap, making it hard for the cache to guess where the next memory reference will point to.
The garbage collection has to do much more work because of the additional memory used and because it has to scan and move each Integer object separately, while in the case of int[] it is just one object and the contents of the object doesn't have to be scanned (they contain no reference to other objects).
To sum it up, using an int[] in performance critical work will be both much faster and memory efficient than using an Integer array in current VMs and it is unlikely this will change much in the near future.

John Rose working on fixnums in the JVM to fix this problem.

I think your hope is woefully naive. Specifically, it needs to deal with the issue that Integer can potentially be null, whereas int can not be. That alone is reason enough to store the object pointer.
That said, the actual object pointer will be to a immutable int instance, notably for a select subset of integers.

It won't be much slower, but because an Integer[] must accept "null" as an entry and int[] doesn't have to, there will be some amount of bookkeeping involved, even if Integer[] is backed by an int[].
So if every last ounce of performance matters, user int[]

The reason that Integer can be null, whereas int cannot, is because Integer is a full-fledged Java object, with all of the overhead that includes. There's value in this since you can write
Integer foo = new Integer();
foo = null;
which is good for saying that foo will have a value, but it doesn't yet.
Another difference is that int performs no overflow calculation. For instance,
int bar = Integer.MAX_VALUE;
bar++;
will merrily increment bar and you end up with a very negative number, which is probably not what you intended in the first place.
foo = Integer.MAX_VALUE;
foo++;
will complain, which I think would be better behavior.
One last point is that Integer, being a Java object, carries with it the space overhead of an object. I think that someone else may need to chime in here, but I believe that every object consumes 12 bytes for overhead, and then the space for the data storage itself. If you're after performance and space, I wonder whether Integer is the right solution.

Related

Best data structure to hold large amounts of data?

Reading in a lot of data from a file. There may be 100 different data objects with necessary headings, but there can be well over 300,000 values stored in each of these data objects. The values need to be stored in the same order that they are read in. This is the constructor for the data object:
public Data(String heading, ArrayList<Float> values) {
this.heading = heading;
this.values = values;
}
What would be the quickest way to store and retrieve these values sequentially in RAM?

Although in your comments you mention "quickness", without specifying what operation needs to be "quick", your main concern seems to be heap memory consumption.
Let's assume 100 groups of 300,000 numbers (you've used words like "may be" and "well over" but this will do as an example).
That's 30,000,000 numbers to store, plus 100 headings and some structural overhead for grouping.
A primitive Java float is 32 bits, that is 4 bytes. So at an absolute minimum, you're going to need 30,000,000 * 4 bytes == 120MB.
An array of primitives - float[30000000] - is just all the values concatenated into a contiguous chunk of memory, so will consume this theoretical minumum of 120MB -- plus a few bytes of once-per-array overhead that I won't go into detail about here.
A java Float wrapper object is 12 bytes. When you store an object (rather than a primitive) in an array, the reference itself is 4 bytes. So an array of Float - Float[30000000] will consume 30,000,000 * (12 + 4) == 480MB.
So, you can cut your memory use by more than half by using primitives rather than wrappers.
An ArrayList is quite a light wrapper around an array of Object and so has about the same memory costs. The once-per-list overheads are too small to have an impact compared to the elements, at these list sizes. But there are some caveats:
ArrayList can only store Objects, not primitives, so if you choose a List you're stuck with the 12-bytes-per-element overhead of Float.
There are some third-party libraries that provide lists of primitives - see: Create a List of primitive int?
The capacity of an ArrayList is dynamic, and to achieve this, if you grow the list to be bigger than its backing array, it will:
create a new array, 50% bigger than the old array
copy the contents of the old array into the new array (this sounds expensive, but hardware is very fast at doing this)
discard the old array
This means that if the backing array happens to have 30 million elements, and is full, ArrayList.add() will replace the array with one of 45 million elements, even if your List only needs 30,000,001.
You can avoid this if you know the needed capacity in advance, by providing the capacity in the constructor.
You can use ArrayList.trimToSize() to drop unneeded capacity and claw some memory back after you've filled the ArrayList.
If I was striving to use as little heap memory as possible, I would aim to store my lists of numbers as arrays of primitives:
class Data {
String header;
float[] values;
}
... and I would just put these into an ArrayList<Data>.
With this structure, you have O(1) access to arbitrary values, and you can use Arrays.binarySearch() (if the values are sorted) to find by value within a group.
If at all possible, I would find out the size of each group before reading the values, and initialise the array to the right size. If you can, make your input file format facilitate this:
while(line = readLine()) {
if(isHeader(line)) {
ParsedHeader header = new ParsedHeader(line);
currentArray = new float[header.size()];
arrayIndex = 0;
currentGroup = new Group(header.name(), currentArray);
groups.add(currentGroup);
} else if (isValue(line)) {
currentArray[arrayIndex++] = parseValue(line);
}
}
If you can't change the input format, consider making two passes through the file - once to discover group lengths, once again to fill your arrays.
If you have to consume the file in one pass, and the file format can't provide group lengths before groups, then you'll have to do something that allows a "list" to grow arbitrarily. There are several options:
Consume each group into an ArrayList<Float> - when the group is complete, convert it into an array[float]:
float[] array = new float[list.size()];
int i = 0;
for (Float f : list) {
array[i] = f; // auto-unboxes Float to float
}
Use a third-party list-of-float library class
Copy the logic used by ArrayList to replace your array with a bigger one when needed -- http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/ArrayList.java#ArrayList.ensureCapacity%28int%29
Any number of approaches discussed in Computer Science textbooks, for example a linked list of arrays.
However none of this considers your reasons for slurping all these numbers into memory in the first place, nor whether this store meets your needs when it comes to processing the numbers.
You should step back and consider what your actual data processing requirement is, and whether slurping into memory is the best approach.
See whether you can do your processing by storing only a slice of data at a time, rather than storing the whole thing in memory. For example, to calculate max/min/mean, you don't need every number to be in memory -- you just need to keep a running total.
Or, consider using a lightweight database library.

You could use a RedBlack BST, which will be an extremely efficient way to store/retrieve data. This relies on nodes that link to other nodes, so there's no limit to the size of the input, as long as you have enough memory for java.

Why is the Minimum granularity defined as 8192 in Java8 in order to switch from Parallel Sort to Arrays.sort regardless of type of data

I was going through the concepts of parallel sort introduced in Java 8.
As per the doc.
If the length of the specified array is less than the minimum
granularity, then it is sorted using the appropriate Arrays.sort
method.
The spec however doesn't specify this minimum limit.
When I looked up the Code in java.util.Arrays it was defined as
private static final int MIN_ARRAY_SORT_GRAN = 1 << 13;
i.e., 8192 values in the array
As per the explanation provided here.
I understand why the value was Hard-coded as 8192.
It was designed keeping the current CPU architecture in mind.
With the -XX:+UseCompressedOops option being enabled by default, any
system with less than 32GB RAM would be using 32bit(4bytes) pointers.
Now, with a L1 Cache size of 32KB for data portion, we can pass
32KB/4Bytes = 8KB of data at once to CPU for computation. That's
equivalent to 8192 bytes of data being processed at once.
So for a function which is working on sorting a byte array parallelSort(byte[]) this makes sense. You can keep minimum parallel sort limit as 8192 values (each value = 1 byte for byte array).
But If you consider public static void parallelSort(int[] a)
An Integer Variable is of 4Bytes(32-bit). So ideally of the 8192 bytes, we can store 8192/4 = 2048 numbers in CPU cache at once.
So the minimum granularity in this case is suppose to be 2048.
Why are all parallelSort functions in Java (be it byte[], int[], long[], etc.) using 8192 as the default min. number of values needed in order to perform parallel sorting?
Shouldn't it vary according to the types passed to the parallelSort function?

First, it seems that you've misread the linked explanation. L1 data cache is 32Kb, so for int[] it fits ideally: 32768/4=8192 ints could be placed into L1 cache while.
Second, I don't think the given explanation is correct. It concentrates on pointers, so it says mainly about sorting object array, but when you compare the data in the objects array, you always need to dereference these pointers accessing the real data. And in case if your objects have non-primitive fields, you'll have to dereference them even further. For example, if you sort an array of strings, you have to access not only array itself, but also String objects and char[] arrays which are stored inside them. All of these would require many additional cache lines.
I did not find any explicit explanation about this particular value in review thread for this change. Previously it was 256, then it was changed to 8192 as part of JDK-8014076 update. I think it just shown best performance on some reasonable test suite. Keeping separate thresholds for different cases would add more complexity. Probably tests show that it's not paying off. Note that ideal threshold is impossible for Object[] arrays as compare function is user-specified and could have arbitrary complexity. For sufficiently complex compare function it's probably reasonable to parallelize even very small arrays.

Java - Integer[] and int[] memory difference

I'm trying to get a general idea of the memory cost difference between an Integer array and int array. While there seems to be a lot of information out there about the differences between a primitive int and Integer object, I'm still a little confused as to how to calculate the memory costs of an int[] and Integer[] array (overhead costs, padding, etc).
Any help would be appreciated. Thanks!

In addition to storing the length of the array, an array of ints needs space for N 4-byte elements, while an array of Integers needs space for N references, whose size is platform-dependent; commonly, that would be 4 bytes on 32-bit platforms or 8 bytes on 64-bit platforms.
As far as int[] goes, there is no additional memory required to store data. Integer[], on the other hand, needs objects of type Integer, which could be all distinct or shared (e.g. through interning of small numbers implemented by the Java platform itself). Therefore, Integer[] requires up to N additional objects, each one containing a 4-byte int.
Assuming that all Integers in an Integer[] array are distinct objects, the array and its content will take two to three times the space of an int[] array. On the other hand, if all objects are shared, and the memory costs of shared objects are accounted for, there may be no additional overhead at all (on 32-bit platforms) or the there would be a 2x overhead on 64-bit platforms.

Here is a comparison on jdk6u26 of the size of an array of 1024 Integers as opposed to 1024 ints. Note that in the case of anInteger[] array containing low number Integers, these can be shared with other uses of these Integers in the JVM by the auto-box cache.

min heap java implementation arraylist vs array

So I'm working on project for my algorithms class that I'm currently taking. I'm doing some research online and see that some people are using an ArrayList<Integer> and some people are using an int array[]. My question is, what's better to use for a min heap and why. The project requires me to keep the top 10000 largest numbers in the list from a very large list of numbers

If you know the array size at compile time, using a bare int[] array is faster. Of course, the performance difference is probably negligible -- but the idea is that ArrayList is internally implemented as an Object[] array, so you're saving yourself that overhead, plus the overhead of dealing with Integer vs int.

An int[] will consume less memory than an ArrayList<Integer>. Part of that is simply the overhead which is added from an Integer which adds ~16 bytes per instance. This video goes through memory impact of various objects and collections in 32bit and 64bit jvms. At about the 9:30 mark it talks about memory associated with each object. At about the 11:15 mark it talks about how much memory various types (including Object references) take.
For an int[], you have 1 Object (the int[]) and it will actually contain all of the individual int values as contiguous memory.
For an ArrayList<Integer>, you have the ArrayList object, the Object[] object and all of the Integer objects. Additionally, the Object[] doesn't actually contain the Integer objects in contiguous memory, rather it contains object references in contiguous memory. The Integer objects themselves are elsewhere on the heap.
So the end result is that an ArrayList<Integer> requires ~6x the amount of memory as an int[]. The backing Object[] and the int[] take the same amount of memory (~40,000 bytes). The 10k Integer objects take ~20 bytes each for a total of 200,000 bytes. So the ArrayList will be a minimum of 240,000 bytes compared to the int[] at approximately 40,000 bytes.

ArrayList<Double> to double[] with 300 million entries

I'm using a java program to get some data from a DB. I then calculate some numbers and start storing them in an array. The machine I'm using has 4 gigs of RAM. Now, I don't know how many numbers there will be in advance, so I use an ArrayList<Double>. But I do know there will be roughly 300 million numbers.
So, since one double is 8 bytes a rough estimate of the memory this array will consume is 2.4 gigs (probably more because of the overheads of an ArrayList). After this, I want to calculate the median of this array and am using the org.apache.commons.math3.stat.descriptive.rank.Median library which takes as input a double[] array. So, I need to convert the ArrayList<Double> to double[].
I did see many questions where this is raised and they all mention there is no way around looping through the entire array. Now this is fine, but since they also maintain both objects in memory, this brings my memory requirements up to 4.8 gigs. Now we have a problem since the total RAM available us 4 gigs.
First of all, is my suspicion that the program will at some point give me a memory error correct (it is currently running)? And if so, how can I calculate the median without having to allocate double the memory? I want to avoid sorting the array as calculating the median is O(n).

Your problem is even worse than you realize, because ArrayList<Double> is much less efficient than 8 bytes per entry. Each entry is actually an object, to which the ArrayList keeps an array of references. A Double object is probably about 12 bytes (4 bytes for some kind of type identifier, 8 bytes for the double itself), and the reference to it adds another 4, bringing the total up to 16 bytes per entry, even excluding overhead for memory management and such.
If the constraints were a little wider, you could implement your own DoubleArray that is backed by a double[] but knows how to resize itself. However, the resizing means you'll have to keep a copy of both the old and the new array in memory at the same time, also blowing your memory limit.
That still leaves a few options though:
Loop through the input twice; once to count the entries, once to read them into a right-sized double[]. It depends on the nature of your input whether this is possible, of course.
Make some assumption on the maximum input size (perhaps user-configurable), and allocate a double[] up front that is this fixed size. Use only the part of it that's filled.
Use float instead of double to cut memory requirements in half, at the expense of some precision.
Rethink your algorithm to avoid holding everything in memory at once.

There are many open source libraries that create dynamic arrays for primitives. One of these:
http://trove.starlight-systems.com/

The Median value is the value at the middle of a sorted list. So you don't have to use a second array, you can just do:
Collections.sort(myArray);
final double median = myArray.get(myArray.size() / 2);
And since you get that data from a DB anyways, you could just tell the DB to give you the median instead of doing it in Java, which will save all the time (and memory) for transmitting the data as well.

I agree, use Trove4j TDoubleArrayList class (see javadoc) to store double or TFloatArrayList for float. And by combining previous answers, we gets :
// guess initialcapacity to remove requirement for resizing
TDoubleArrayList data = new TDoubleArrayList(initialcapacity);
// fill data
data.sort();
double median = data.get(data.size()/2);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How are Integer arrays stored internally, in the JVM? - java

John Rose working on fixnums in the JVM to fix this problem.

It won't be much slower, but because an Integer[] must accept "null" as an entry and int[] doesn't have to, there will be some amount of bookkeeping involved, even if Integer[] is backed by an int[]. So if every last ounce of performance matters, user int[]

Related

Best data structure to hold large amounts of data?

Why is the Minimum granularity defined as 8192 in Java8 in order to switch from Parallel Sort to Arrays.sort regardless of type of data

Java - Integer[] and int[] memory difference

min heap java implementation arraylist vs array

ArrayList<Double> to double[] with 300 million entries

Categories

Resources