Java ( Counting Distinct Integers )

Java ( Counting Distinct Integers ) - java

how do you return number of distinct/unique values in an array for example
int[] a = {1,2,2,4,5,5};

Set<Integer> s = new HashSet<Integer>();
for (int i : a) s.add(i);
int distinctCount = s.size();

A set stores each unique (as defined by .equals()) element in it only once, and you can use this to simplify the problem. Create a Set (I'd use a HashSet), iterate your array, adding each integer to the Set, then return .size() of the Set.

An efficient method: Sort the array with Arrays.sort. Write a simple loop to count up adjacent equal values.

Really depends on the numbers of elements in the array. If you're not dealing with a large amount of integers, a HashSet or a binary tree would probably be the best approach. On the other hand, if you have a large array of diverse integers (say, more than a billion) it might make sense to allocate a 2^32 / 2^8 = 512 MByte byte array in which each bit represents the existence or non-existence of an integer and then count the number of set bits in the end.
A binary tree approach would take n * log n time, while an array approach would take n time. Also, a binary tree requires two pointers per node, so your memory usage would be a lot higher as well. Similar consideration apply to hash tables as well.
Of course, if your set is small, then just use the inbuilt HashSet.

Related

Is there a more efficient way to reduce Java TreeSet to a subset based on index?

I am currently implementing the Mash algorithm for genome comparison. For this I need to create a sketch S(A) for each genome, which is a set of certain size s containing the lowest hash values corresponding to k-mers in the genome. To compare two genomes, I am computing the Jaccard index, for which I need an additional sketch of the union of the two genomes, i.e. S(A u B) for two genomes A and B. This should contain the s lowest hash values found in the union of A and B.
I am computing the sketches as a TreeSet because in the original algorithm to compute a sketch I need to remove the biggest value from the set whenever I add a new value that is lower and the sketch has already reached the maximum size s. This is very easily accomplished using TreeSet because the largest value will be in the last position of the set.
I have computed a union of two sketches and now want to remove the larger elements to reduce the sketch size to size s. I first implemented this using a while loop and removing the last element until reaching the desired size.
The following is my code using an example TreeSet and size s = 10:
SortedSet<Integer> example = new TreeSet<>();
for (int i = 0; i < 15; i++) {
example.add(i);
}
while (example.size() > 10) example.remove(example.last());
However, in the real application sketch sizes will be much larger and the size of the union can be up to two times the size of a single sketch. Trying to find a more efficient way to reduce the sketch size I found that you could convert the TreeSet to an array. Thus, my second approach would be the following:
Object[] temp = example.toArray();
int value = (int) temp[10];
example = example.headSet(value);
So, here I am getting the value at index s from the array, which I can then use to create a headSet from the TreeSet.
Now, I am wondering if there is a more efficient way to reduce the size of the TreeSet, where I don't need to iterate over the size of the TreeSet over and over again or generate an extra array.

Count distinct elements in an array in O(n) only with loops and arrays in java

I found this solution
https://www.geeksforgeeks.org/count-distinct-elements-in-an-array/
The problem is that time complexity must be at O(n), space complexity must be at O(1), but i can't import any additional libraries and the code must be maximally short. I wasn't able to find a solution with sorting faster than O(nlog n), so i guess i need to find a clever way. And the answer is the third solution from the link above, but it requires additional library. Is it even possible to find a better way?
Edit:
In fact, i need to create a function that works exactly like
java.util.Arrays.stream(myarray).distinct().count();
It must have time complexity at O(n) and space complexity at O(1).
Basically i have to create it using only loops, arrays and if statements. Also it is forbidden to import anything other than import java.util.Scanner; and because of that i can't do it with any ready to use methods like java.util.Arrays.*;.
For example:
Input:
{1,12,3,0,1,3,15,6}
Output:
6

Maximally short solution with O(n) time complexity, using only Java 8+ built-in APIs, i.e. no additional libraries needed.
The code assumes myarray is an array of int, long, double, or object1.
long count = java.util.Arrays.stream(myarray).distinct().count();
1) Object must have valid equals() and hashCode() implementation.

A solution in O(n) time complexity and O(1) space complexity is possible in theory, but it might not be very practical. The basic idea is this:
let aMin be the minimum value of an entry in arr
let aMax be the maximum value of an entry in arr
let seenOnce and seenTwice be boolean arrays
whose indices are in the range [aMin..aMax]
initialize all elements of seenOnce and seenTwice to FALSE
countUnique = 0;
for a in arr {
if (!seenOnce[a - aMin]) {
// seeing `a` for the first time, so count it
seenOnce[a - aMin] = TRUE
countUnique = countUnique + 1
} else if (!seenTwice[a - aMin]) {
// seeing `a` for a second time, so un-count it
countUnique = countUnique - 1
seenTwice[a - aMin] = TRUE
}
}
If the values in arr could be any ints at all, then each of the boolean arrays will contain 2^32 entries, for a total of over 8 billion booleans. That's 1Gb of memory, provided we're careful to implement all those booleans in one bit each. But it is O(1): same 1Gb consumed regardless of whether arr contains two elements or a billion...

Find the only unique element in an array of a million elements

I was asked this question in a recent interview.
You are given an array that has a million elements. All the elements are duplicates except one. My task is to find the unique element.
var arr = [3, 4, 3, 2, 2, 6, 7, 2, 3........]
My approach was to go through the entire array in a for loop, and then create a map with index as the number in the array and the value as the frequency of the number occurring in the array. Then loop through our map again and return the index that has value of 1.
I said my approach would take O(n) time complexity. The interviewer told me to optimize it in less than O(n) complexity. I said that we cannot, as we have to go through the entire array with a million elements.
Finally, he didn't seem satisfied and moved onto the next question.
I understand going through million elements in the array is expensive, but how could we find a unique element without doing a linear scan of the entire array?
PS: the array is not sorted.

I'm certain that you can't solve this problem without going through the whole array, at least if you don't have any additional information (like the elements being sorted and restricted to certain values), so the problem has a minimum time complexity of O(n). You can, however, reduce the memory complexity to O(1) with a XOR-based solution, if every element is in the array an even number of times, which seems to be the most common variant of the problem, if that's of any interest to you:
int unique(int[] array)
{
int unpaired = array[0];
for(int i = 1; i < array.length; i++)
unpaired = unpaired ^ array[i];
return unpaired;
}
Basically, every XORed element cancels out with the other one, so your result is the only element that didn't cancel out.

Assuming the array is un-ordered, you can't. Every value is mutually exclusive to the next so nothing can be deduced about a value from any of the other values?
If it's an ordered array of values, then that's another matter and depends entirely on the ordering used.
I agree the easiest way is to have another container and store the frequency of the values.

In fact, since the number of elements in the array was fix, you could do much better than what you have proposed.
By "creating a map with index as the number in the array and the value as the frequency of the number occurring in the array", you create a map with 2^32 positions (assuming the array had 32-bit integers), and then you have to pass though that map to find the first position whose value is one. It means that you are using a large auxiliary space and in the worst case you are doing about 10^6+2^32 operations (one million to create the map and 2^32 to find the element).
Instead of doing so, you could sort the array with some n*log(n) algorithm and then search for the element in the sorted array, because in your case, n = 10^6.
For instance, using the merge sort, you would use a much smaller auxiliary space (just an array of 10^6 integers) and would do about (10^6)*log(10^6)+10^6 operations to sort and then find the element, which is approximately 21*10^6 (many many times smaller than 10^6+2^32).
PS: sorting the array decreases the search from a quadratic to a linear cost, because with a sorted array we just have to access the adjacent positions to check if a current position is unique or not.

Your approach seems fine. It could be that he was looking for an edge-case where the array is of even size, meaning there is either no unmatched elements or there are two or more. He just went about asking it the wrong way.

Determining the element that occurred the most in O(n) time and O(1) space

Let me start off by saying that this is not a homework question. I am trying to design a cache whose eviction policy depends on entries that occurred the most in the cache. In software terms, assume we have an array with different elements and we just want to find the element that occurred the most. For example: {1,2,2,5,7,3,2,3} should return 2. Since I am working with hardware, the naive O(n^2) solution would require a tremendous hardware overhead. The smarter solution of using a hash table works well for software because the hash table size can change but in hardware, I will have a fixed size hash table, probably not that big, so collisions will lead to wrong decisions. My question is, in software, can we solve the above problem in O(n) time complexity and O(1) space?

There can't be an O(n) time, O(1) space solution, at least not for the generic case.
As amit points out, by solving this, we find the solution to the element distinctness problem (determining whether all the elements of a list are distinct), which has been proven to take Θ(n log n) time when not using elements to index the computer's memory. If we were to use elements to index the computer's memory, given an unbounded range of values, this requires at least Θ(n) space. Given the reduction of this problem to that one, the bounds for that problem enforces identical bounds on this problem.
However, practically speaking, the range would mostly be bounded, if for no other reason than the type one typically uses to store each element in has a fixed size (e.g. a 32-bit integer). If this is the case, this would allow for an O(n) time, O(1) space solution, albeit possibly too slow and using too much space due to the large constant factors involved (as the time and space complexity would depend on the range of values).
2 options:
Counting sort
Keeping an array of the number of occurrences of each element (the array index being the element), outputting the most frequent.
If you have a bounded range of values, this approach would be O(1) space (and O(n) time). But technically so would the hash table approach, so the constant factors here is presumably too large for this to be acceptable.
Related options are radix sort (has an in-place variant, similar to quicksort) and bucket sort.
Quicksort
Repeatedly partitioning the data based on a selected pivot (through swapping) and recursing on the partitions.
After sorting we can just iterate through the array, keeping track of the maximum number of consecutive elements.
This would take O(n log n) time and O(1) space.

As you say maximum element in your cache may e a very big number but following is one of the solution.
Iterate over the array.
Lets say maximum element that the array holds is m.
For each index i get the element it holds let it be array[i]
Now go to the index array[i] and add m to it.
Do above for all the indexes in array.
Finally iterate over the array and return index with maximum element.
TC -> O(N)
SC -> O(1)
It may not be feasible for large m as in your case. But see if you can optimize or alter this algo.

A solution on top off my head :
As the numbers can be large , so i consider hashing , instead of storing them directly in array .
Let there are n numbers 0 to n-1 .
Suppose the number occcouring maximum times , occour K times .
Let us create n/k buckets , initially all empty.
hash(num) tells whether num is present in any of the bucket .
hash_2(num) stores number of times num is present in any of the bucket .
for(i = 0 to n-1)
if the number is already present in one of the buckets , increase the count of input[i] , something like Hash_2(input[i]) ++
else find an empty bucket , insert input[i] in 1st empty bucket . Hash(input[i]) = true
else , if all buckets full , decrease count of all numbers in buckets by 1 , don't add input[i] in any of buckets .
If count of any number becomes zero [see hash_2(number)], Hash(number) = false .
This way , finally you will get atmost k elements , and the required number is one of them , so you need to traverse the input again O(N) to finally find the actual number .
The space used is O(K) and time complexity is O(N) , considering implementaion of hash O(1).
So , the performance really depends on K . If k << n , this method perform poorly .

I don't think this answers the question as stated in the title, but actually you can implement a cache with the Least-Frequently-Used eviction policy having constant average time for put, get and remove operations. If you maintain your data structure properly, there's no need to scan all items in order to find the item to evict.
The idea is having a hash table which maps keys to value records. A value record contains the value itself plus a reference to a "counter node". A counter node is a part of a doubly linked list, and consists of:
An access counter
The set of keys having this access count (as a hash set)
next pointer
prev pointer
The list is maintained such that it's always sorted by the access counter (where the head is min), and the counter values are unique. A node with access counter C contains all keys having this access count. Note that this doesn't increment the overall space complexity of the data structure.
A get(K) operation involves promoting K by migrating it to another counter record (either a new one or the next one in the list).
An eviction operation triggered by a put operation roughly consists of checking the head of the list, removing an arbitrary key from its key set, and then removing it from the hash table.

It is possible if we make reasonable (to me, anyway) assumptions about your data set.
As you say you could do it if you could hash, because you can simply count-by-hash. The problem is that you may get non-unique hashes. You mention 20bit numbers, so presumably 2^20 possible values and a desire for a small and fixed amount of working memory for the actual hash counts. This, one presumes, will therefore lead to hash collisions and thus a breakdown of the hashing algorithm. But you can fix this by doing more than one pass with complementary hashing algorithms.
Because these are memory addresses, it's likely not all of the bits are actually going to be capable of being set. For example if you only ever allocate word (4 byte) chunks you can ignore the two least significant bits. I suspect, but don't know, that you're actually only dealing with larger allocation boundaries so it may be even better than this.
Assuming word aligned; that means we have 18 bits to hash.
Next, you presumably have a maximum cache size which is presumably pretty small. I'm going to assume that you're allocating a maximum of <=256 items because then we can use a single byte for the count.
Okay, so to make our hashes we break up the number in the cache into two nine bit numbers, in order of significance highest to lowest and discard the last two bits as discussed above. Take the first of these chunks and use it as a hash to give a first part count. Then we take the second of these chunks and use it as a hash but this time we only count if the first part hash matches the one we identified as having the highest hash. The one left with the highest hash is now uniquely identified as having the highest count.
This runs in O(n) time and requires a 512 byte hash table for counting. If that's too large a table you could divide into three chunks and use a 64 byte table.
Added later
I've been thinking about this and I've realised it has a failure condition: if the first pass counts two groups as having the same number of elements, it cannot effectively distinguish between them. Oh well

Assumption: all the element is integer,for other data type we can also achieve this if we using hashCode()
We can achieve a time complexity O(nlogn) and space is O(1).
First, sort the array , time complexity is O(nlog n) (we should use in - place sorting algorithm like quick sort in order to maintain the space complexity)
Using four integer variable, current which indicates the value we are referring to,count , which indicate the number of occurrences of current, result which indicates the finale result and resultCount, which indicate the number of occurrences of result
Iterating from start to end of the array data
int result = 0;
int resultCount = -1;
int current = data[0];
int count = 1;
for(int i = 1; i < data.length; i++){
if(data[i] == current){
count++;
}else{
if(count > resultCount){
result = current;
resultCount = count;
}
current = data[i];
count = 1;
}
}
if(count > resultCount){
result = current;
resultCount = count;
}
return result;
So, in the end, there is only 4 variables is used.

Most frequently repeated numbers in a huge list of numbers

I have a file which has a many random integers(around a million) each seperated by a white space. I need to find the top 10 most frequently occurring numbers in that file. What is the most efficient way of doing this in java?
I can think of
1. Create a hash map, key is the integer from the file and the value is the count. For every number in the file, check if that key already exists in the hash map, if yes, value++, else make a new entry in hash
2. Make a BST, each node is the integer from the file. For every integer from the file see if there is a node in the BST if yes, do value++, value is part of the node.
I feel hash map is better option if i can come up with good hashing function,
Can some one pl suggest me what is the best of doing this ? Is there is anyother efficient algo that i can use?

Edit #2:
Okay, I screwed up my own first rule--never optimize prematurely. The worst case for this is probably using a stock HashMap with a wide range--so I just did that. It still runs in like a second, so forget everything else here and just do that.
And I'll make ANOTHER note to myself to ALWAYS test speed before worrying about tricky implementations.
(Below is older obsolete post that could still be valid if someone had MANY more points than a million)
A HashSet would work, but if your integers have a reasonable range (say, 1-1000), it would be more efficient to create an array of 1000 integers, and for each of your million integers, increment that element of the array. (Pretty much the same idea as a HashMap, but optimizing out a few of the unknowns that a Hash has to make allowances for should make it a few times faster).
You could also create a tree. Each node in the tree would contain (value, count) and the tree would be organized by value (lower values on the left, higher on the right). Traverse to your node, if it doesn't exist--insert it--if it does, then just increment the count.
The range and distribution of your values would determine which of these two (or a regular hash) would perform better. I think a regular hash wouldn't have many "winning" cases though (It would have to be a wide range and "grouped" data, and even then the tree might win.
Since this is pretty trivial--I recommend you implement more than one solution and test speeds against the actual data set.
Edit: RE the comment
TreeMap would work, but would still add a layer of indirection (and it's so amazingly easy and fun to implement yourself). If you use the stock implementation, you have to use Integers and convert constantly to and from int for every increase. There is the indirection of the pointer to the Integer, and the fact that you are storing at least 2x as many objects. This doesn't even count any overhead for the method calls since they should be inlined with any luck.
Normally this would be an optimization (evil), but when you start to get near hundreds of thousands of nodes, you occasionally have to ensure efficiency, so the built-in TreeMap is going to be inefficient for the same reasons the built-in HashSet will.

Java handles hashing. You don't need to write a hash function. Just start pushing stuff in the hash map.
Also, if this is something that only needs to run once (or only occasionally), then don't both optimizing. It will be fast enough. Only bother if it's something that's going to run within an application.

HashMap
A million integers is not really a lot, even for interpreted languages, but especially for a speedy language like Java. You'll probably barely even notice the execution time. I'd try this first and move to something more complicated if you deem this too slow.
It will probably take longer to do string splitting and parsing to convert to integers than even the simplest algorithm to find frequencies using a HashMap.

Why use a hashtable? Just use an array that is the same size as the range of your numbers. Then you don't waste time executing the hashing function. Then sort the values after you're done. O(N log N)

Allocate an array / vector of the same size as the number of input items you have
Fill the array from your file with numbers, one number per element
Put the list in order
Iterate through the list and keep track of the the top 10 runs of numbers that you have encountered.
Output the top ten runs at the end.
As a refinement on step 4, you only need to step forward through the array in steps equilivent to your 10th longest run. Any run longer than that will overlap with your sampling. If the tenth longest run is 100 elements long, you only need to sample element 100, 200, 300 and at each point count the run of the integer you find there (both forwards and backwards). Any run longer than your 10th longest is sure to overlap with your sampling.
You should apply this optimisation after your 10th run length is very long compared to other runs in the array.
A map is overkill for this question unless you have very few unique numbers each with a large number of repeats.
NB: Similar to gshauger's answer but fleshed out

If you have to make it as efficient as possible, use an array of ints, with the position representing the value and the content representing the count. That way you avoid autoboxing and unboxing, the most likely killer of a standard Java collection.
If the range of numbers is too large then take a look at PJC and its IntKeyIntMap implementations. It will avoid the autoboxing as well. I don't know if it will be fast enough for you, though.

If the range of numbers is small (e.g. 0-1000), use an array. Otherwise, use a HashMap<Integer, int[]>, where the values are all length 1 arrays. It should be much faster to increment a value in an array of primitives than create a new Integer each time you want to increment a value. You're still creating Integer objects for the keys, but that's hard to avoid. It's not feasible to create an array of 2^31-1 ints, after all.
If all of the input is normalized so you don't have values like 01 instead of 1, use Strings as keys in the map so you don't have to create Integer keys.

Use a HashMap to create your dataset (value-count pairs) in memory as you traverse the file. The HashMap should give you close to O(1) access to the elements while you create the dataset (technically, in the worst case HashMap is O(n)). Once you are done searching the file, use Collections.sort() on the value Collection returned by HashMap.values() to create a sorted list of value-count pairs. Using Collections.sort() is guaranteed O(nLogn).
For example:
public static class Count implements Comparable<Count> {
int value;
int count;
public Count(int value) {
this.value = value;
this.count = 1;
}
public void increment() {
count++;
}
public int compareTo(Count other) {
return other.count - count;
}
}
public static void main(String args[]) throws Exception {
Scanner input = new Scanner(new FileInputStream(new File("...")));
HashMap<Integer, Count> dataset = new HashMap<Integer, Count>();
while (input.hasNextInt()) {
int tempInt = input.nextInt();
Count tempCount = dataset.get(tempInt);
if (tempCount != null) {
tempCount.increment();
} else {
dataset.put(tempInt, new Count(tempInt));
}
}
List<Count> counts = new ArrayList<Count>(dataset.values());
Collections.sort(counts);

Actually, there is an O(n) algorithm for doing exactly what you want to do. Your use case is similar to an LFU cache where the element's access count determines whether it syays in the cache or is evicted from it.
http://dhruvbird.blogspot.com/2009/11/o1-approach-to-lfu-page-replacement.html

This is the source for java.lang.Integer.hashCode(), which is the hashing function that will be used if you store your entries as a HashMap<Integer, Integer>:
public int hashCode() {
return value;
}
So in other words, the (default) hash value of a java.lang.Integer is the integer itself.
What is more efficient than that?

The correct way to do it is with a linked list. When you insert an element, you go down the linked list, if its there you increment the nodes count, otherwise create a new node with count of 1. After you inserted each element, you would have a sorted list of elements in O(n*log(n)).
For your methods, you are doing n inserts and then sorting in O(n*log(n)), so your coefficient on the complexity is higher.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.