Combination Algorithm from multiple sets

Combination Algorithm from multiple sets - java

I am trying to write an algorithm that tells me how many pairs I could generate with items coming from multiple set of values. For example I have the following sets:
{1,2,3} {4,5} {6}
From these sets I can generate 11 pairs:
{1,4}, {1,5}, {1,6}, {2,4}, {2,5}, {2,6}, {3,4}, {3,5}, {3,6}, {4,6}, {5,6}
I wrote the following algorithm:
int result=0;
for(int k=0;k<numberOfSets;k++){ //map is a list where I store all my sets
int size1 = map.get(k);
for(int l=k+1;l<numberOfSets;l++){
int size2 = map.get(l);
result += size1*size2;
}
}
But as you can see the algorithm is not very scalable. If the number of sets increases the algorithm starts performing very poorly.
Am I missing something?, Is there an algorithm that can help me with this ? I have been looking to combination and permutation algorithms but I am not very sure if thats the right path for this.
Thank you very much in advance

First at all, if the order in the pairs does matter, then starting with int l=k+1 in the inner cycle is erroneous. E.g. you are missing {4,1} if you consider it equal with {1,4}, then the result is correct, otherwise it isn't.
Second, to complicate the matter further, you don't say if the the pairs need to be unique or not. E.g. {1,2} , {2,3}, {4} will generate {2,4} twice - if you need to count it as unique, the result of your code is incorrect (and you will need to keep a Set<Pair<int,int>> to remove the duplicates and you will need to scan those sets and actually generate the pairs).
The good news: while you can't do better than O(N2) just for counting the pairs, even if you have thousands of sets, the millions of integral multiplication/additions are fast enough on nowaday computers - e.g Eigen deals quite well with O(N^3) operations for floating multiplications (see matrix multiplication operations).

Assuming you only care about the number of pairs, and are counting duplicates, then there is a more efficient algorithm:
We will keep track of the current number of sets, and the number of elements which we encountered so far.
Go over the list from the end to the start
For each new set, the number of new pairs we can make is the size of the set * the size of encountered elements. Add this to the current number of sets.
Add the size of the new set to the number of elements which we encountered so far.
The code:
int numberOfPairs=0;
int elementsEncountered=0;
for(int k = numberOfSets - 1 ; k >= 0 ; k--) {
int sizeOfCurrentSet = map.get(k);
int numberOfNewPairs = sizeOfCurrentSet * elementsEncountered;
numberOfPairs += numberOfNewPairs;
elementsEncountered += sizeOfCurrentSet;
}
The key point to relize is that when we count the number of new pairs that each set contributes, it doesn't matter from which set we select the second element of the pair. That is, we don't need to keep track of any set which we have already analyzed.

Related

Find the only unique element in an array of a million elements

I was asked this question in a recent interview.
You are given an array that has a million elements. All the elements are duplicates except one. My task is to find the unique element.
var arr = [3, 4, 3, 2, 2, 6, 7, 2, 3........]
My approach was to go through the entire array in a for loop, and then create a map with index as the number in the array and the value as the frequency of the number occurring in the array. Then loop through our map again and return the index that has value of 1.
I said my approach would take O(n) time complexity. The interviewer told me to optimize it in less than O(n) complexity. I said that we cannot, as we have to go through the entire array with a million elements.
Finally, he didn't seem satisfied and moved onto the next question.
I understand going through million elements in the array is expensive, but how could we find a unique element without doing a linear scan of the entire array?
PS: the array is not sorted.

I'm certain that you can't solve this problem without going through the whole array, at least if you don't have any additional information (like the elements being sorted and restricted to certain values), so the problem has a minimum time complexity of O(n). You can, however, reduce the memory complexity to O(1) with a XOR-based solution, if every element is in the array an even number of times, which seems to be the most common variant of the problem, if that's of any interest to you:
int unique(int[] array)
{
int unpaired = array[0];
for(int i = 1; i < array.length; i++)
unpaired = unpaired ^ array[i];
return unpaired;
}
Basically, every XORed element cancels out with the other one, so your result is the only element that didn't cancel out.

Assuming the array is un-ordered, you can't. Every value is mutually exclusive to the next so nothing can be deduced about a value from any of the other values?
If it's an ordered array of values, then that's another matter and depends entirely on the ordering used.
I agree the easiest way is to have another container and store the frequency of the values.

In fact, since the number of elements in the array was fix, you could do much better than what you have proposed.
By "creating a map with index as the number in the array and the value as the frequency of the number occurring in the array", you create a map with 2^32 positions (assuming the array had 32-bit integers), and then you have to pass though that map to find the first position whose value is one. It means that you are using a large auxiliary space and in the worst case you are doing about 10^6+2^32 operations (one million to create the map and 2^32 to find the element).
Instead of doing so, you could sort the array with some n*log(n) algorithm and then search for the element in the sorted array, because in your case, n = 10^6.
For instance, using the merge sort, you would use a much smaller auxiliary space (just an array of 10^6 integers) and would do about (10^6)*log(10^6)+10^6 operations to sort and then find the element, which is approximately 21*10^6 (many many times smaller than 10^6+2^32).
PS: sorting the array decreases the search from a quadratic to a linear cost, because with a sorted array we just have to access the adjacent positions to check if a current position is unique or not.

Your approach seems fine. It could be that he was looking for an edge-case where the array is of even size, meaning there is either no unmatched elements or there are two or more. He just went about asking it the wrong way.

Determining the element that occurred the most in O(n) time and O(1) space

Let me start off by saying that this is not a homework question. I am trying to design a cache whose eviction policy depends on entries that occurred the most in the cache. In software terms, assume we have an array with different elements and we just want to find the element that occurred the most. For example: {1,2,2,5,7,3,2,3} should return 2. Since I am working with hardware, the naive O(n^2) solution would require a tremendous hardware overhead. The smarter solution of using a hash table works well for software because the hash table size can change but in hardware, I will have a fixed size hash table, probably not that big, so collisions will lead to wrong decisions. My question is, in software, can we solve the above problem in O(n) time complexity and O(1) space?

There can't be an O(n) time, O(1) space solution, at least not for the generic case.
As amit points out, by solving this, we find the solution to the element distinctness problem (determining whether all the elements of a list are distinct), which has been proven to take Θ(n log n) time when not using elements to index the computer's memory. If we were to use elements to index the computer's memory, given an unbounded range of values, this requires at least Θ(n) space. Given the reduction of this problem to that one, the bounds for that problem enforces identical bounds on this problem.
However, practically speaking, the range would mostly be bounded, if for no other reason than the type one typically uses to store each element in has a fixed size (e.g. a 32-bit integer). If this is the case, this would allow for an O(n) time, O(1) space solution, albeit possibly too slow and using too much space due to the large constant factors involved (as the time and space complexity would depend on the range of values).
2 options:
Counting sort
Keeping an array of the number of occurrences of each element (the array index being the element), outputting the most frequent.
If you have a bounded range of values, this approach would be O(1) space (and O(n) time). But technically so would the hash table approach, so the constant factors here is presumably too large for this to be acceptable.
Related options are radix sort (has an in-place variant, similar to quicksort) and bucket sort.
Quicksort
Repeatedly partitioning the data based on a selected pivot (through swapping) and recursing on the partitions.
After sorting we can just iterate through the array, keeping track of the maximum number of consecutive elements.
This would take O(n log n) time and O(1) space.

As you say maximum element in your cache may e a very big number but following is one of the solution.
Iterate over the array.
Lets say maximum element that the array holds is m.
For each index i get the element it holds let it be array[i]
Now go to the index array[i] and add m to it.
Do above for all the indexes in array.
Finally iterate over the array and return index with maximum element.
TC -> O(N)
SC -> O(1)
It may not be feasible for large m as in your case. But see if you can optimize or alter this algo.

A solution on top off my head :
As the numbers can be large , so i consider hashing , instead of storing them directly in array .
Let there are n numbers 0 to n-1 .
Suppose the number occcouring maximum times , occour K times .
Let us create n/k buckets , initially all empty.
hash(num) tells whether num is present in any of the bucket .
hash_2(num) stores number of times num is present in any of the bucket .
for(i = 0 to n-1)
if the number is already present in one of the buckets , increase the count of input[i] , something like Hash_2(input[i]) ++
else find an empty bucket , insert input[i] in 1st empty bucket . Hash(input[i]) = true
else , if all buckets full , decrease count of all numbers in buckets by 1 , don't add input[i] in any of buckets .
If count of any number becomes zero [see hash_2(number)], Hash(number) = false .
This way , finally you will get atmost k elements , and the required number is one of them , so you need to traverse the input again O(N) to finally find the actual number .
The space used is O(K) and time complexity is O(N) , considering implementaion of hash O(1).
So , the performance really depends on K . If k << n , this method perform poorly .

I don't think this answers the question as stated in the title, but actually you can implement a cache with the Least-Frequently-Used eviction policy having constant average time for put, get and remove operations. If you maintain your data structure properly, there's no need to scan all items in order to find the item to evict.
The idea is having a hash table which maps keys to value records. A value record contains the value itself plus a reference to a "counter node". A counter node is a part of a doubly linked list, and consists of:
An access counter
The set of keys having this access count (as a hash set)
next pointer
prev pointer
The list is maintained such that it's always sorted by the access counter (where the head is min), and the counter values are unique. A node with access counter C contains all keys having this access count. Note that this doesn't increment the overall space complexity of the data structure.
A get(K) operation involves promoting K by migrating it to another counter record (either a new one or the next one in the list).
An eviction operation triggered by a put operation roughly consists of checking the head of the list, removing an arbitrary key from its key set, and then removing it from the hash table.

It is possible if we make reasonable (to me, anyway) assumptions about your data set.
As you say you could do it if you could hash, because you can simply count-by-hash. The problem is that you may get non-unique hashes. You mention 20bit numbers, so presumably 2^20 possible values and a desire for a small and fixed amount of working memory for the actual hash counts. This, one presumes, will therefore lead to hash collisions and thus a breakdown of the hashing algorithm. But you can fix this by doing more than one pass with complementary hashing algorithms.
Because these are memory addresses, it's likely not all of the bits are actually going to be capable of being set. For example if you only ever allocate word (4 byte) chunks you can ignore the two least significant bits. I suspect, but don't know, that you're actually only dealing with larger allocation boundaries so it may be even better than this.
Assuming word aligned; that means we have 18 bits to hash.
Next, you presumably have a maximum cache size which is presumably pretty small. I'm going to assume that you're allocating a maximum of <=256 items because then we can use a single byte for the count.
Okay, so to make our hashes we break up the number in the cache into two nine bit numbers, in order of significance highest to lowest and discard the last two bits as discussed above. Take the first of these chunks and use it as a hash to give a first part count. Then we take the second of these chunks and use it as a hash but this time we only count if the first part hash matches the one we identified as having the highest hash. The one left with the highest hash is now uniquely identified as having the highest count.
This runs in O(n) time and requires a 512 byte hash table for counting. If that's too large a table you could divide into three chunks and use a 64 byte table.
Added later
I've been thinking about this and I've realised it has a failure condition: if the first pass counts two groups as having the same number of elements, it cannot effectively distinguish between them. Oh well

Assumption: all the element is integer,for other data type we can also achieve this if we using hashCode()
We can achieve a time complexity O(nlogn) and space is O(1).
First, sort the array , time complexity is O(nlog n) (we should use in - place sorting algorithm like quick sort in order to maintain the space complexity)
Using four integer variable, current which indicates the value we are referring to,count , which indicate the number of occurrences of current, result which indicates the finale result and resultCount, which indicate the number of occurrences of result
Iterating from start to end of the array data
int result = 0;
int resultCount = -1;
int current = data[0];
int count = 1;
for(int i = 1; i < data.length; i++){
if(data[i] == current){
count++;
}else{
if(count > resultCount){
result = current;
resultCount = count;
}
current = data[i];
count = 1;
}
}
if(count > resultCount){
result = current;
resultCount = count;
}
return result;
So, in the end, there is only 4 variables is used.

Finding pairs of duplicates in a Java ArrayList

I'm looking to find the number of duplicate pairs in a Java ArrayList.
I can work it out on paper but I don't know if there is some form of mathematical formula for working this out easily as I'm trying to avoid nested for loops in my code.
An example using the data set [2,2,3,2,2]:
0:1, 0:3, 0:4, 1:3, 1:4, 3:4. So the answer is six duplicate pairs?

You just need to count how many times each number appears (I would go with a map here) and calculate 2-combinations ( http://en.wikipedia.org/wiki/Combination ) of that count for each number with a count > 1.
So basically you need a method to calculate n!/k!(n-k)! with k being 2 and n being the count.
Taking your example [2,2,3,2,2], the number 2 appears 4 times, so the math would go:
4!/2!(4-2)! = 24/4 = 6 --> 6 pairs
If you don't want to implement the factorial function, you can use the ArithmeticUtils from Apache Commons, they already have the factorial implemented.

If you want to avoid nested loops (at the expense of having 2 loops), you could:
for each number in the list, find how many times each number is repeated (maybe use a Map with key = number, value = times that number occurred in the List)
for each number in the map, calculate the number of possible combinations based on the times that it occurred (0 or 1 times = no duplicate pairs, 2 or more = n!/(2*(n-2)!) = (n*(n-1))/2 duplicate pairs)
sum all the possible combinations
Doing a sort like ElKamina suggests would allow for some optimization on this method.

Sort the numbers first. Later, if there k copies of a given number, there will be k*(k-1)/2 pairs from that number. Now sum it over all the numbers.

Using Guava, if your elements were Strings:
Multiset<String> multiset = HashMultiset.create(list);
int pairs = 0;
for(Multiset.Entry<String> entry : multiset.entrySet()) {
pairs += IntMath.binomial(entry.getCount(), 2);
}
return pairs;
That uses Guava's Multiset and math utilities.

Find all differences in an array in O(n)

Question: Given a sorted array A find all possible difference of elements from A.
My solution:
for (int i=0; i<n-1; ++i) {
for (int j=i+1; j<n; ++j) {
System.out.println(Math.abs(ai-aj));
}
}
Sure, it's O(n^2), but I don't over count things at all. I looked online and I found this: http://www.careercup.com/question?id=9111881. It says you can't do better, but at an interview I was told you can do O(n). Which is right?

A first thought is that you aren't using the fact that the array is sorted. Let's assume it's in increasing order (decreasing can be handled analogously).
We can also use the fact that the differences telescope (i>j):
a_i - a_j = (a_i - a_(i-1)) + (a_(i-1) - a_(i-2)) + ... + (a_(j+1) - a_j)
Now build a new sequence, call it s, that has the simple difference, meaning (a_i - a_(i-1)). This takes only one pass (O(n)) to do, and you may as well skip over repeats, meaning skip a_i if a_i = a_(i+1).
All possible differences a_i-a_j with i>j are of the form s_i + s_(i+1) + ... + s_(j+1). So maybe if you count that as having found them, then you did it in O(n) time. To print them, however, may take as many as n(n-1)/2 calls, and that's definitely O(n^2).

For example for an array with the elements {21, 22, ..., 2n} there are n⋅(n-1)/2 possible differences, and no two of them are equal. So there are O(n2) differences.
Since you have to enumerate all of them, you also need at least O(n2) time.

sorted or unsorted doesn't matter, if you have to calculate each difference there is no way to do it in less then n^2,
the question was asked wrong, or you just do O(n) and then print 42 the other N times :D

You can get another counter-example by assuming the array contents are random integers before sorting. Then the chance that two differences, Ai - Aj vs Ak - Al, or even Ai - Aj vs Aj - Ak, are the same is too small for there to be only O(n) distinct differences Ai - Aj.
Given that, the question to your interviewer is to explain the special circumstances that allow an O(n) solution. One possibility is that the array values are all numbers in the range 0..n, because in this case the maximum absolute difference is only n.
I can do this in O(n lg n) but not O(n). Represent the array contents by an array of size n+1 with element i set to 1 where there is a value i in the array. Then use FFT to convolve the array with itself - there is a difference Ai - Aj = k where the kth element of the convolution is non-zero.

If the interviewer is fond of theoretical games, perhaps he was thinking of using a table of inputs and results? Any problem with a limit on the size of the input, and that has a known solution, can be solved by table lookup. Given that you have first created and stored that table, which might be large.
So if the array size is limited, the problem can be solved by table lookup, which (given some assumptions) can even be done in constant time. Granted, even for a maximum array size of two (assuming 32-bit integers) the table will not fit in a normal computer's memory, or on the disks. For larger max sizes of the array, you're into "won't fit in the known universe" size. But, theoretically, it can be done.
(But in reality, I think that Jens Gustedt's comment is more likely.)

Yes you can surely do that its a little tricky method.
to find differances in O(n) you will need to use BitSet(C++) or any similar Data Structure in respective language.
Initialize two bitset say A and B
You can do as follows:
For each iteration through array:
1--store consecutive differance in BitSet A
2--LeftShift B
3--store consecutive differance in BitSet B
4--take A=A or B
for example I have given code-
Here N is Size of array
for (int i=1;i<N;i++){
int diff = arr[i]-arr[i-1];
A[diff]=1;
B<<=diff;
B[diff]=1;
A=A | B;
}
Bits in A which are 1 will be the differances.

First of all the array need to be sorted
lets think a sorted array ar = {1,2,3,4}
so what we were doing at the O(n^2)
for(int i=0; i<n; i++)
for(int j=i+1; j<n; j++) sum+=abs(ar[i]-ar[j]);
if we do the operations here elaborately then it will look like below
when i = 0 | sum = sum + {(2-1)+(3-1)+(4-1)}
when i = 1 | sum = sum + {(3-2)+(4-2)}
when i = 2 | sum = sum + {(4-3)}
if we write them all
sum = ( -1-1-1) + (2+ -2-2) + (3+3 -3) + (4+4+4 )
we can see that
the number at index 0 is added to the sum for 0 time and substracted from the sum for 3 time.
the number at index 1 is added to the sum for 1 time and substracted from the sum for 2 time.
the number at index 2 is added to the sum for 2 time and substracted from the sum for 1 time.
the number at index 3 is added to the sum for 3 time and substracted from the sum for 0 time.
so for we can say that,
the number at index i will be added to the sum for i time
and will be substracted from the sum for (n-i)-1 time
Then the generalized expression for
each element will be
sum = sum + (i*a[i]) – ((n-i)-1)*a[i];

Java ( Counting Distinct Integers )

how do you return number of distinct/unique values in an array for example
int[] a = {1,2,2,4,5,5};

Set<Integer> s = new HashSet<Integer>();
for (int i : a) s.add(i);
int distinctCount = s.size();

A set stores each unique (as defined by .equals()) element in it only once, and you can use this to simplify the problem. Create a Set (I'd use a HashSet), iterate your array, adding each integer to the Set, then return .size() of the Set.

An efficient method: Sort the array with Arrays.sort. Write a simple loop to count up adjacent equal values.

Really depends on the numbers of elements in the array. If you're not dealing with a large amount of integers, a HashSet or a binary tree would probably be the best approach. On the other hand, if you have a large array of diverse integers (say, more than a billion) it might make sense to allocate a 2^32 / 2^8 = 512 MByte byte array in which each bit represents the existence or non-existence of an integer and then count the number of set bits in the end.
A binary tree approach would take n * log n time, while an array approach would take n time. Also, a binary tree requires two pointers per node, so your memory usage would be a lot higher as well. Similar consideration apply to hash tables as well.
Of course, if your set is small, then just use the inbuilt HashSet.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.