In java, an array can have at most Integer.MAX_VALUE items as it uses integers as keys for the array.
What is the best Object to use when I want to use a long as an index?
For example, if I want to calculate all prime numbers below 5 billion using a prime sieve, I cannot use an array as 5000000000 is too large to store in an integer.
A sieve of 5000000000 elements does not need an array of 5,000,000,000 values; it needs 5,000,000,000 bits. Unfortunately, BitSet uses an int index as well, but you can implement your own bit set by allocating 5000000000 / 32 integers, and then using bit operations to access the corresponding bit:
Use long as the actual position pos
The location of the int inside the int array is (int)(pos / 32)
The location of the bit inside the int is (int)(pos % 32)
Anther approach would be to switch to segmented sieve, which reduces memory requirements to √N. A good explanation of how that works is given here.
Related
According to the BitSet implementation, it internally uses an array of longs:
/**
* The internal field corresponding to the serialField "bits".
*/
private long[] words;
But for the set method it uses int:
public void set(int bitIndex) {...}
So basically we can store (2^31 - 1) * 64 * 8 = 2,147,483,642 * 64 * 8 = 137,438,953,088 bits, but using int indexing we have access only to the first 2,147,483,648 bits.
Which means that 137,438,953,088 - 2,147,483,648 = 135,291,469,440 bits are unavailable.
But if developers of this class used long instead of int for bits indexing, it would solve all the problems, since with long we can navigate trough 2^63 - 1 = 9,223,372,036,854,775,807 bits
It does not make any sense even from performance point of view.
What the reasoning behind the logic of using int instead of long for indexing and missing billions of bits?
P.S. One can say that the problem is 2 GiB of heap size, but today it is not an issue anymore.
The documentation of java.util.BitSet states:
The bits of a BitSet are indexed by nonnegative integers.
This is what it is supposed to do, so no long indexes needed.
That it's internal data structure could support more than 2^31 individual bits is an implementation detail that has no relevance for the public interface of the class (they could have used a boolean[] array and the class would still work, albeit with a bigger memory footprint and more runtime for some methods.)
The question remains: will the public interface of this class change to support long indexes?
This is highly unlikely, because supporting long indexes would mean that methods like
int cardinality()
int nextClearBit() (and similar methods: next/previous clear/set bit)
int size()
IntStream stream()
would also need to be changed, which would break existing code.
The only way I can think of a BitSet like class with long indexes would be an additional class BigBitSet (or LongBitSet or whatever you like) so that people needing bitsets with more then 2^31 bits could switch to that new class.
Whether such a class would ever be added to the java.util package is another question - for that you would have to convince the JCP executive board that this is a important addition / gaping hole in the current Java ecosystem.
Each chunk of 64 bits is packed into long, not one long per bit index, so length of the long[] words array will use up to 268,435,456 bytes with int index when calling set(2147483647) or just one long if calling only bitset.set(1). Example in jshell:
BitSet b = new BitSet();
b.size();
==> 64 (ie words is length 1 can store 64 bits)
b.set(1);
b.size();
==> 64 (ie words is still length 1)
b.set(64)
==> 128 (ie words array is length 2, can store up to 128 bits)
Usually you use bit sets to index into something else. Let’s say you use this bitset to index into an array.
BitSet b = new BitSet();
b.set(2147483647);
ArrayList<X> items = new ArrayList<X>();
// ...add a looot of elements to the ArrayList...
// then:
X item = items.get(b.nextSetBit(0));
To make this work, the array list must contain 2,147,483,648 elements, and it would at least use 2GB of RAM (assuming each element requires at least 1 byte of storage), which would crash Java.
i have arrays containing random unique numbers from 0 to integer.max value.
How can i generate a unique id/signature(int) to identify each array uniquely rather than searching through each array and checking each number.
e.g
int[] x = {2,4,8,1,88,12....};
int[] y = {123,456,64,87,1,12...};
int[] z = {2,4,8,1...};
int[] xx = {213,3534,778,1,2,234....};
..................
..................
and so on.
Each array can have different length, but the numbers are not repeated within the array and can be repeated in other arrays. The purpose of unique id for each array is to identify it through the id so that the searching can be made fast. The arrays contain ids of components and the unique signature/id for the array will identify the components contained in it.
Also the id generated sould be same regardless of the order of the values in the array. Like {1,5} and {5,1} should generate the same id.
I have looked up different number paring methods, but the resulting number grows as the length of the array increases to a point it can not fit in an int.
The IDS assigned to components can be adjusted, they don't have to be a sequence of integers as long as a good range of numbers is available. The only requirement is that once an id is generated for an array (a collection of component ids) they should not collide. And can be generated at runtime if the collection in that array changes.
This can be approximately solved with a hash function h() with an order-normalization function (such as sort()). A hash function is lossy, since the number of unique hashes (2^32 or 2^64) is smaller than the number of possible variable length sets of integers, resulting in a small chance of two distinct sets having the same ID (hash collision). Typically this won't be a problem if
you use a good hash function, and
your data set is not ridiculously large.
The order normalization function would ensure that sets {x, y} and {y, x} are hashed to the same value.
For the hash function you have many options, but choose a hash that minimizes collision probability, such as a cryptographic hash (SHA-256, MD5) or if you need bleeding edge performance use MurmurHash3 or other hash du jour. MurmurHash3 can produce an integer as output, while the cryptographic hashes require an extra step of extracting 4 or 8 bytes from the binary output and unpacking to an integer. (Use any consistent selection of bytes such as first or last.)
In pseudocode:
int getId(setOfInts) {
intList = convert setOfInts to integer list
sortedIntList = sort(intList)
ilBytes = cast sortedIntList to byte array
hashdigest = hash(ilBytes)
leadingBytes = extract 4 or 8 leading bytes of hashdigest
idInt = cast leadingBytes to integer
return idInt
}
Strictly speaking, what you ask for is not possible: even with arrays of just two elements, there are many more possible arrays (about 261 after ignoring order) than possible signatures (232). And your arrays aren't limited to two elements, so your situation is exponentially worse.
However, if you can accept a low rate of duplicates and false matches, a simple approach is to just add together all of the elements with the + operator (which essentially computes the sum modulo 232). This is the approach taken by java.util.Set<Integer>'s hashCode() method. It doesn't completely eliminate the need to compare whole arrays (because you'll need to detect false matches), but it will radically reduce the number of such comparisons (because very few arrays will match any given array).
You want {1, 5} and {5, 1} to have the same ID. That rules out standard hash functions, which will give different results in that situation. One option is to sort the array before hashing. Be aware that cryptographic hashes are slow; you may find that a non-crypto hash like FNV is sufficient. It will certainly be faster.
To avoid sorting simply add all the numbers mod 2^32 or mod 2^64, as #ruakh suggests and accept that you will have a proportion of collisions. Adding in the array length will avoid some collisions: {5, 1} will not match {1, 2, 3} in that case as (2+(5+1)) != (3+(1+2+3)). You may want to test with your real data to see if this gives enough of an advantage.
I am analysis the source code of HashMap in jdk7,and I found that when we invoke put()method to add elements,it will use indexFor()to calculate the index and store the element in the array,the method is listed as below.
Now I am wondering why it use h & (length-1) to get the index? Is it used for to get more random array index? Can we use length or length-2(if exists) instead?
Can anyone help me to understand this? Thanks in advance!
static int indexFor(int h, int length) {
// assert Integer.bitCount(length) == 1 : "length must be a non-zero power of 2";
return h & (length-1);
}
Is it used for to get more random array index?
No
Can we use length or length-2(if exists) instead?
No
This is just a mathematical trick. To map your hash to the range [0, L) you calculate hash % L. Division can be expensive, so the developers choose to take advantage of how numbers are stored in binary. This works only for powers of two because only a single bit is set, subtracting one sets all less-significant bits and unsets the original bit. Binary-AND-ing this to our hash has the same result as calculating the modulo, every bit which is more-significant than we can handle is just dropped.
Using Java, I have to create a program that takes an ordered set of numbers and returns the length of the longest consecutive number sequence. For example, for the set (1,18,12,6,8,7,13,2,3,4,9,10) the method should return 5 because the longest consecutive sequence is (6,7,8,9,10).
it should be as efficient as possible, you can't use hashmaps, actually for just iterations I'm guessing the best option will be sorting the array (nlogn) and then running through the array once more (n) will be the best option?
If you have such a large input that an O(n log n) algorithm would be too slow and you want an algorithm without using hashmap, you could use radix sort and still get the same O(n) performance.
Radix Sort: http://en.wikipedia.org/wiki/Radix_sort
Basically it sorts the input by applying bucket sort on lowest k(I usually use 4 or 8) bits, and then on the next lowest k bits, and so on, until all of the bits are sorted on.
The code would be like below(Sorry I'm not so familiar with Java so it may contain some mistakes, but I hope you can get what I mean.)
static final int RADIX_POW2=4;//you could also use 8 if you want it
//twice as fast and 16 times as space taking.
static final int RADIX=1<<RADIX_POW2;
static void radix_sort_part(int[] input, ArrayList<int>[] buckets, int shift){
for(int x:input) buckets[x>>shift & (RADIX-1)].add(x);
int count=0;
for(ArrayList<int> bucket:buckets){
for(int x:bucket)
input[count++]=x;
bucket.clear();
}
}
static void radix_sort_full(int[] input){
ArrayList<int>[] buckets=new ArrayList<int>[RADIX];
for(int i=0;i<RADIX;i++)
buckets[i]=new ArrayList<int>();
//I'm performing radix sorts on full 32 bits, but if the range of
//your inputs are smaller, you only need to perform it on the range.
for(int i=0;i<sizeof(int)*8/RADIX_POW2;i++)
radix_sort_part(input,buckets,i*RADIX_POW2);
}
static int find_max_consecutive(int[] input){
radix_sort_full(input);
int maxconsecutive=1;
int currentconsecutive=1;
for(int i=1; i<input.size();i++){
if(input[i]=input[i-1]+1)currentconsecutive++;
if(currentconsecutive>maxconsecutive)maxconsecutive=currentconsecutive;
}
return maxconsecutive;
}
However, I think this algorithm is slow if you have many relatively small inputs and you need to solve this problem again and again.
And for a large input, this algorithm could be as memory-consuming as hashmap and not as fast. So if I was asked to choose, I would rather use hashmap.
EDIT:
I forgot to mention that radix sort takes time that is proportional to how many times it has to perform bucket sort, that is, ( (number of bits of the integers)/RADIX_POW2 ).
So the exact time complexity of this algorithm is O(dn), where d is (number of bits of the integers)/RADIX_POW2.
This means that if you want to use algorithm to long numbers, it takes twice as long, and if you want to use this to BigInteger or String or something like that, it would take as long as how the integers(strings) are big.
how do you return number of distinct/unique values in an array for example
int[] a = {1,2,2,4,5,5};
Set<Integer> s = new HashSet<Integer>();
for (int i : a) s.add(i);
int distinctCount = s.size();
A set stores each unique (as defined by .equals()) element in it only once, and you can use this to simplify the problem. Create a Set (I'd use a HashSet), iterate your array, adding each integer to the Set, then return .size() of the Set.
An efficient method: Sort the array with Arrays.sort. Write a simple loop to count up adjacent equal values.
Really depends on the numbers of elements in the array. If you're not dealing with a large amount of integers, a HashSet or a binary tree would probably be the best approach. On the other hand, if you have a large array of diverse integers (say, more than a billion) it might make sense to allocate a 2^32 / 2^8 = 512 MByte byte array in which each bit represents the existence or non-existence of an integer and then count the number of set bits in the end.
A binary tree approach would take n * log n time, while an array approach would take n time. Also, a binary tree requires two pointers per node, so your memory usage would be a lot higher as well. Similar consideration apply to hash tables as well.
Of course, if your set is small, then just use the inbuilt HashSet.