Buffer is an abstract class having concrete subclasses such as ByteBuffer, IntBuffer, etc. It seems to be a container of data of a specific primitive type. What are the benefits of a Buffer? Why wouldn't I just use an array or a list?
A buffer can be defined, in its simplest form, as a contiguous block of memory of some type. Hence a byte buffer of size 4K (4096 bytes) may occupy memory locations 0xf000 through 0xffff inclusive.
As to why a buffer type may be used instead of an array or list, neither of those two alternatives have the in-built features of limit, position or mark.
On the first item, a buffer separates the capacity from the limit in that you can have a capacity of 1000 with a current limit of 10. In other words, it enforces the ability to have a variable size up to and including the capacity.
For the other two features, the current position provides an in-built way to read or write the next element, easing sequential processing, and the mark provides a way to save the current position for later reset.
All these features would require extra variables if you needed them in conjunction with an array or list.
Of course, if you don't need any of these features then, by all means, use an array.
Related
When I read the source code of SparseArray in Android SDK, I met a method named ArrayUtils.newUnpaddedObjectArray(capacity). I don't understand what the unpadded array means?
You can find more information in class VMRuntime, which is used by class ArrayUtils.
The JavaDoc of VMRuntime.newUnpaddedArray(...) says:
Returns an array of at least minLength, but potentially larger. The increased size comes from avoiding any padding after the array. The amount of padding varies depending on the componentType and the memory allocator implementation.
Java data types use different size in memory. If only the needed capacity would be allocated, some space at then end would be left unused before the next memory allocation. This space is called padding. So, in order to not waste this space, the array made a little bit larger.
I am parsing data where precision is not my main concern. I often get java.lang.OutOfMemoryError even if I use maximum Java heap size. So my main concern here is memory usage, and java heap space. Should I use double or float data type?
I consistently get OOM exceptions because I use a great number of ArrayLists with numbers.
Well that is your problem!
An ArrayList of N 32-bit floating point values takes at least1 20 * N bytes in a 32-bit JVM and 24 * N bytes in a 64-bit JVM2.
An ArrayList of N 64-bit floating point values takes the same amount of space3.
The above only accounts for the backing array and the list elements. If you have huge numbers of small ArrayList objects, the overhead of the ArrayList object itself may be significant. (Add 16 or 24 bytes for each ArrayList object`.)
If you make use of dynamic resizing, this may generate object churn as the backing array grows. At some points, the backing array may be as much as twice as large as it needs to be.
By contrast:
An array of 32-bit floating point values takes approximately 4 * N bytes4.
An array of 64-bit floating point values takes approximately 8 * N bytes4.
There is no wastage due to dynamic resizing.
Solutions:
ArrayList<Float> versus ArrayList<Double> makes no difference. It is NOT a solution
For maximal saving, use float[] or double[] depending on your precision requirements. Preallocate the arrays to hold the exact number of elements required.
If you want the flexibility of dynamic resizing there are 3rd-party libraries that implement space efficient lists of primitive types. Alternatively implement your own. However, you won't be able to use the standard List<...> API because that forces you down the path of using Float OR Double.
1 - The actual space used depends on how the ArrayList was created and populated. If you pre-allocate an ArrayList with exactly the correct capacity, you will use the space I said above. If you build the array by repeatedly appending to an ArrayList with the default initial capacity, you will use on average N * 2 bytes extra space for a 32-bit JVM. This is due to the heuristic that ArrayList uses to grow the backing array when it is full.
2 - On a 64-bit JVM, a pointer occupies 8 bytes rather than 4 ... unless you are using compressed oops.
3 - The reason it takes the same amount of bytes is that on a typical JVM a Float and a Double are both 16 bytes due to heap node padding.
4 - There is a header overhead of (typically) 12 bytes per array, and the array's heap node size is padded to a multiple of 8 bytes.
If your memory usage is related to a huge amount (many millions) of floating-point numbers (which can be verified with a decent memory profiler), then you're most probably storing them in some data structures like arrays or lists.
Recommendations (I guess, you are already following most of them...):
Prefer float over double if number range and precision are sufficient, as that consumes only half the size.
Do not use the java.lang.Float or java.lang.Double classes for storage, as they hav a considerable memory overhead compared to the naked scalar values.
Be sure to use arrays, not collections like java.util.List, as they store boxed java.lang.Float instances instead of the naked numbers.
But above that, have a decent memory profiler show you which instances occupy most of your memory. Maybe there are other memory consumers besides the float/double data.
EDIT:
The OP's recent comment "I consistently get OOM exceptions because I use a great number of ArrayLists with numbers" makes it clear. ArrayList<Float> wastes a lot of memory when compared to float[] (Stephen C gave detailed numbers in his answer), but gives the benefit of dynamic resizing.
So, I see the following possibilities:
If you can tell the array size from the beginning, then immediately use float[] arrays.
If you need the dynamic size while initializing instances, use ArrayList<Float> while building one object (when size still increases), and then copy the contents to a float[] array for long-term storage. Then the wasteful ArrayLists exist only for a limited timespan.
If you need dynamic sizes over the whole lifespan of your data, create your own FloatArrayList class based on a float[] array, resembling the ArrayList<Float> as far as your code needs it (that can range from a very shallow implementation up to a full-featured List, maybe based on AbstractList).
I was going through the concepts of parallel sort introduced in Java 8.
As per the doc.
If the length of the specified array is less than the minimum
granularity, then it is sorted using the appropriate Arrays.sort
method.
The spec however doesn't specify this minimum limit.
When I looked up the Code in java.util.Arrays it was defined as
private static final int MIN_ARRAY_SORT_GRAN = 1 << 13;
i.e., 8192 values in the array
As per the explanation provided here.
I understand why the value was Hard-coded as 8192.
It was designed keeping the current CPU architecture in mind.
With the -XX:+UseCompressedOops option being enabled by default, any
system with less than 32GB RAM would be using 32bit(4bytes) pointers.
Now, with a L1 Cache size of 32KB for data portion, we can pass
32KB/4Bytes = 8KB of data at once to CPU for computation. That's
equivalent to 8192 bytes of data being processed at once.
So for a function which is working on sorting a byte array parallelSort(byte[]) this makes sense. You can keep minimum parallel sort limit as 8192 values (each value = 1 byte for byte array).
But If you consider public static void parallelSort(int[] a)
An Integer Variable is of 4Bytes(32-bit). So ideally of the 8192 bytes, we can store 8192/4 = 2048 numbers in CPU cache at once.
So the minimum granularity in this case is suppose to be 2048.
Why are all parallelSort functions in Java (be it byte[], int[], long[], etc.) using 8192 as the default min. number of values needed in order to perform parallel sorting?
Shouldn't it vary according to the types passed to the parallelSort function?
First, it seems that you've misread the linked explanation. L1 data cache is 32Kb, so for int[] it fits ideally: 32768/4=8192 ints could be placed into L1 cache while.
Second, I don't think the given explanation is correct. It concentrates on pointers, so it says mainly about sorting object array, but when you compare the data in the objects array, you always need to dereference these pointers accessing the real data. And in case if your objects have non-primitive fields, you'll have to dereference them even further. For example, if you sort an array of strings, you have to access not only array itself, but also String objects and char[] arrays which are stored inside them. All of these would require many additional cache lines.
I did not find any explicit explanation about this particular value in review thread for this change. Previously it was 256, then it was changed to 8192 as part of JDK-8014076 update. I think it just shown best performance on some reasonable test suite. Keeping separate thresholds for different cases would add more complexity. Probably tests show that it's not paying off. Note that ideal threshold is impossible for Object[] arrays as compare function is user-specified and could have arbitrary complexity. For sufficiently complex compare function it's probably reasonable to parallelize even very small arrays.
Does anyone know of a Java class to store bytes that satisfies the following conditions?
Stores bytes efficiently (i.e. not one object per bytes).
Grows automatically, like a StringBuilder.
Allows indexed access to all of its bytes (without copying everything to a byte[].
Nothing I've found so far satisfies these. Specifically:
byte[] : Doesn't satisfy 2.
ByteBuffer : Doesn't satisfy 2.
ByteArrayOutputStream : Doesn't satisfy 3.
ArrayList : Doesn't satisfy 1 (AFAIK, unless there's some special-case optimisation).
If I can efficiently remove bytes from the beginning of the array that would be nice. If I were writing it from scratch I would implement it as something like
{ ArrayList<byte[256]> data; int startOffset; int size; }
and then the obvious functions. Does something like this exist?
Most straightforward would be to subclass ByteArrayOutputStream and add functionality to access the underlying byte[].
Removal of bytes from the beginning can be implemented in different ways depending on your requirements. If you need to remove a chunk, System.arrayCopy should work fine, if you need to remove single bytes I would put a headIndex which would keep track of the beginning of the data (performing an arraycopy after enough data is "removed").
There are some implementations for high performance primitive collections such as:
hppc or Koloboke
You'd have to write one. Off the top of my head what I would do is create an ArrayList internally and store the bytes 4 to each int, with appropriate functions for masking off the bytes. Performance will be sub optimal for removing and adding individual bytes. However it will store the object in the minimal size if that is a real consideration, wasting no more than 3 bytes for storage (on top of the overhead for the ArrayList).
The laziest method will be ArrayList. Its not as inefficient as you seem to believe, since Byte instances can and will be shared, meaning there will be only 256 byte objects in the entire VM unless you yourself do a "new Byte()" somewhere.
What is the maximum size of HashSet, Vector, LinkedList? I know that ArrayList can store more than 3277000 numbers.
However the size of list depends on the memory (heap) size. If it reaches maximum the JDK throws an OutOfMemoryError.
But I don't know the limit for the number of elements in HashSet, Vector and LinkedList.
There is no specified maximum size of these structures.
The actual practical size limit is probably somewhere in the region of Integer.MAX_VALUE (i.e. 2147483647, roughly 2 billion elements), as that's the maximum size of an array in Java.
A HashSet uses a HashMap internally, so it has the same maximum size as that
A HashMap uses an array which always has a size that is a power of two, so it can be at most 230 = 1073741824 elements big (since the next power of two is bigger than Integer.MAX_VALUE).
Normally the number of elements is at most the number of buckets multiplied by the load factor (0.75 by default). However, when the HashMap stops resizing, then it will still allow you to add elements, exploiting the fact that each bucket is managed via a linked list. Therefore the only limit for elements in a HashMap/HashSet is memory.
A Vector uses an array internally which has a maximum size of exactly Integer.MAX_VALUE, so it can't support more than that many elements
A LinkedList doesn't use an array as the underlying storage, so that doesn't limit the size. It uses a classical doubly linked list structure with no inherent limit, so its size is only bounded by the available memory. Note that a LinkedList will report the size wrongly if it is bigger than Integer.MAX_VALUE, because it uses a int field to store the size and the return type of size() is int as well.
Note that while the Collection API does define how a Collection with more than Integer.MAX_VALUE elements should behave. Most importantly it states this the size() documentation:
If this collection contains more than Integer.MAX_VALUE elements, returns Integer.MAX_VALUE.
Note that while HashMap, HashSet and LinkedList seem to support more than Integer.MAX_VALUE elements, none of those implement the size() method in this way (i.e. they simply let the internal size field overflow).
This leads me to believe that other operations also aren't well-defined in this condition.
So I'd say it's safe to use those general-purpose collections with up to Integer.MAX_VLAUE elements. If you know that you'll need to store more than that, then you should switch to dedicated collection implementations that actually support this.
In all cases, you're likely to be limited by the JVM heap size rather than anything else. Eventually you'll always get down to arrays so I very much doubt that any of them will manage more than 231 - 1 elements, but you're very, very likely to run out of heap before then anyway.
It very much depends on the implementation details.
A HashSet uses an array as an underlying store which by default it attempt to grow when the collection is 75% full. This means it will fail if you try to add more than about 750,000,000 entries. (It cannot grow the array from 2^30 to 2^31 entries)
Increasing the load factor increases the maximum size of the collection. e.g. a load factor of 10 allows 10 billion elements. (It is worth noting that HashSet is relatively inefficient past 100 million elements as the distribution of the 32-bit hashcode starts to look less random, and the number of collisions increases)
A Vector doubles its capacity and starts at 10. This means it will fail to grow above approx 1.34 billion. Changing the initial size to 2^n-1 gives you slightly more head room.
BTW: Use ArrayList rather than Vector if you can.
A LinkedList has no inherent limit and can grow beyond 2.1 billion. At this point size() could return Integer.MAX_VALUE, however some functions such as toArray will fail as it cannot put all objects into an array, in will instead give you the first Integer.MAX_VALUE rather than throw an exception.
As #Joachim Sauer notes, the current OpenJDK could return an incorrect result for sizes above Integer.MAX_VALUE. e.g. it could be a negative number.
The maximum size depends on the memory settings of the JVM and of course the available system memory. Specific size of memory consumption per list entry also differs between platforms, so the easiest way might be to run simple tests.
As stated in other answers, an array cannot reach 2^31 entries. Other data types are limited either by this or they will likely misreport their size() eventually. However, these theoretical limits cannot be reached on some systems:
On a 32 bit system, the number of bytes available never exceeds 2^32 exactly. And that is assuming that you have no operating system taking up memory. A 32 bit pointer is 4 bytes. Anything which does not rely on arrays must include at least one pointer per entry: this means that the maximum number of entries is 2^32/4 or 2^30 for things that do not utilize arrays.
A plain array can achieve it's theoretical limit, but only a byte array, a short array of length 2^31-1 would use up about 2^32+38 bytes.
Some java VMs have introduced a new memory model that uses compressed pointers. By adjusting pointer alignment, slightly more than 2^32 bytes may be referenced with 32 byte pointers. Around four times more. This is enough to cause a LinkedList size() to become negative, but not enough to allow it to wrap around to zero.
A sixty four bit system has sixty four bit pointers, making all pointers twice as big, making non array lists a bunch fatter. This also means that the maximum capacity supported jumps to 2^64 bytes exactly. This is enough for a 2D array to reach its theoretical maximum. byte[0x7fffffff][0x7fffffff] uses memory apporximately equal to 40+40*(2^31-1)+(2^31-1)(2^31-1)=40+40(2^31-1)+(2^62-2^32+1)