Selecting top 10 most frequently occurring strings from an array, java

Selecting top 10 most frequently occurring strings from an array, java - java

I have an array of strings from which I want to find the top 10 most frequently occurring strings.
One primitive way of doing this is to of course loop through the array once, get a stack/queue of all the distinct strings, store these distinct strings in an array, then check the number of times each string in this new array occurs in the original array, and finally store the values in 'n' distinct integers, where n is the number of distinct strings.
Obviously this is a horrible method when it comes to time efficiency, so I was wondering if there is a better way of doing this.

If you don't care about memory, you can build a hash map holding the count of each string: you loop through all your strings and for each one you do
myhash[mystring] += 1
if the string is already present in the hash, or
myhash[mystring] = 1
otherwise.
If you consider that looking up a value in a hash map is made in constant time (which could not be true) then this algorithm is "only" O(n) (but it takes up a lot of memory).

If you care about memory, you can sort the array and then count how many times each string appears easily (each string will appear firstly at position i, i+1, i+2, ..., i+k and nowhere else).
Sorting will take O(n log n), than O(n) for counting occurences of strings.

You could use a Guava Multiset adding all the strings then call Multisets.copyHighestCountFirst() only looking at the first 10
See this question for an example

Related

How to make zeros for next n-1 elements of n similar elements in an array?

I have an integer 667778 and I need to output it as 607008.
I used an array 6,6,7,7,7,8 and xor next similar elements.
I need to this in constant time.
suppose
int arr[]={6,6,7,7,7,8}
int ele=arr[0];
for (int i=1;i<arr.length;i++)
{
if(arr[i]==ele)
arr[i]=0;
else
ele=arr[i];
}
output array arr has [6,0,7,0,0,8]
It is taking O(n) n is size of the array
How can i do this in constant time?

Unless the number given will always be 6 digits (in which case you can hard code it, which is technically constant time, but will give equal performance to the loop), then you can't get constant time because the basis of the problem requires looping through the array in the first place.
Is there are reason you want it to work in constant time anyways, as O(n) is the fastest a program can read the data anyways.
Edit:
After reading your comments, I think you need to come up with a different approach so calculating the XORs won't be inside the loop. I can't provide much more help without the original problem.

Find the only unique element in an array of a million elements

I was asked this question in a recent interview.
You are given an array that has a million elements. All the elements are duplicates except one. My task is to find the unique element.
var arr = [3, 4, 3, 2, 2, 6, 7, 2, 3........]
My approach was to go through the entire array in a for loop, and then create a map with index as the number in the array and the value as the frequency of the number occurring in the array. Then loop through our map again and return the index that has value of 1.
I said my approach would take O(n) time complexity. The interviewer told me to optimize it in less than O(n) complexity. I said that we cannot, as we have to go through the entire array with a million elements.
Finally, he didn't seem satisfied and moved onto the next question.
I understand going through million elements in the array is expensive, but how could we find a unique element without doing a linear scan of the entire array?
PS: the array is not sorted.

I'm certain that you can't solve this problem without going through the whole array, at least if you don't have any additional information (like the elements being sorted and restricted to certain values), so the problem has a minimum time complexity of O(n). You can, however, reduce the memory complexity to O(1) with a XOR-based solution, if every element is in the array an even number of times, which seems to be the most common variant of the problem, if that's of any interest to you:
int unique(int[] array)
{
int unpaired = array[0];
for(int i = 1; i < array.length; i++)
unpaired = unpaired ^ array[i];
return unpaired;
}
Basically, every XORed element cancels out with the other one, so your result is the only element that didn't cancel out.

Assuming the array is un-ordered, you can't. Every value is mutually exclusive to the next so nothing can be deduced about a value from any of the other values?
If it's an ordered array of values, then that's another matter and depends entirely on the ordering used.
I agree the easiest way is to have another container and store the frequency of the values.

In fact, since the number of elements in the array was fix, you could do much better than what you have proposed.
By "creating a map with index as the number in the array and the value as the frequency of the number occurring in the array", you create a map with 2^32 positions (assuming the array had 32-bit integers), and then you have to pass though that map to find the first position whose value is one. It means that you are using a large auxiliary space and in the worst case you are doing about 10^6+2^32 operations (one million to create the map and 2^32 to find the element).
Instead of doing so, you could sort the array with some n*log(n) algorithm and then search for the element in the sorted array, because in your case, n = 10^6.
For instance, using the merge sort, you would use a much smaller auxiliary space (just an array of 10^6 integers) and would do about (10^6)*log(10^6)+10^6 operations to sort and then find the element, which is approximately 21*10^6 (many many times smaller than 10^6+2^32).
PS: sorting the array decreases the search from a quadratic to a linear cost, because with a sorted array we just have to access the adjacent positions to check if a current position is unique or not.

Your approach seems fine. It could be that he was looking for an edge-case where the array is of even size, meaning there is either no unmatched elements or there are two or more. He just went about asking it the wrong way.

Best way to compare - Using Sorting or adding to Set

I have two Strings like
String one = "one, two, three, four";
String two ="two,nine,ten";
Now, if any of the numbers two / nine / ten is present in the first string, I need to return true.
And, I Split both strings . splitOne[], splitTwo[] are present now.
Now, one way will be to compare each and every element more like a bubble sort.
This will give me a complexity of O(n^2).
Will adding the elements to a HashSet get me better complexity ?
For adding to set, I need to iterate through both lists and add each element.
Which of these will require lesser time ? Is there any significant difference ?

It really depends on your use-case.
There is no point in trying to sort it yourself. There are much more efficient methods available, one of which is, obviously, using a HashSet.
If you really are working with up to around 30 words then HashSet is certainly your way to go. However, as the number of strings gets bigger you are going to start running into space problems. For a start String.split will eat huge amounts of memory when you get into the thousands of strings, let alone the HashSet.
If you wish to avoid using database then there are solutions such as a Bloom Filter.
At the extreme end you would probably want to use a database of some sort.

You can use HashMap and also can maintain the number occurences as value.
Or instead of spliting strings, Split one string and conpare till the source String exits in distination one.
private boolean testArray(){
String one = "one, two, three,four,nine,ten";
String two ="two,nine,ten,11";
String strTwo[] = two.split(",");
for (String string : strTwo) {
if(!one.contains(string)) return false;
}
return true;
}

Lets say the no. of elements in the first set is N and the no. of elements in the second set is M.
Using a hashset will require O(N+M) as O(N) used for adding while O(M) is use for checking. (Assuming comparisons are O(1))
The 'bubble sort' way will take O(NM).
Theoretically, I think O(N+M) with a hashset will be faster in complexity. However, the constant factor of a hashset should be higher and hence you might not see any improvement for lower values of N and M.
Alternatively, since you are dealing with strings, comparison between strings aren't O(1). You can create a trie using the first set, taking O(A) time, where A is the number of characters in the first set. And then O(B) in total to traverse the trie to check, where O(B) is the number of characters in the second set. This might give you better performances than a HashSet as it is independent of any hashing function (and hence collision checking).

Running time of insertion into 2 hashtables with iteration and printing

I have a program that does the following:
Iterates through a string, placing words into a HashMap<String, Integer> where the key represents the unique word, and the value represents a running total occurrences (incremented each time the word is found).
I believe up to this point we are O(n) since each of the insertions is constant time.
Then, I iterate through the hashmap and insert the values into a new HashMap<Integer, List<String>>. The String goes into the List in the value where the count matches. I think that we are still at O(n) because the operations used on HashMaps and Lists are constant time.
Then, I iterate through the HashMap and print the Strings in each List.
Does anything in this program cause me to go above O(n) complexity?

That is O(n), unless your word-parsing algorithm is not linear (but it should be).

You're correct, with a caveat. In a hash table, insertions and lookups take expected O(1) time each, so the expected runtime of your algorithm is O(n). If you have a bad hash function, there's a chance it will take longer than that, usually (for most reasonable hash table implementations) O(n2) in the worst-case.
Additionally, as #Paul Draper pointed out, this assumes that the computation of the hash code for each string takes time O(1) and that comparing the strings in the table takes time O(1). If you have strings whose lengths aren't bounded from above by some constant, it might take longer to compute the hash codes. In fact, a more accurate analysis would be that the runtime is O(n + L), where L is the total length of all the strings.
Hope this helps!

Beyond the two issues that Paul Draper and templatetypedef point out, there's another potential one. You write that your second map is a hashmap < int,list < string > >. This allows for a total linear complexity only if the implementation you choose for the list allows for (amortized) constant time appending. This is the case if you use an ArrayList and you add entries at the end, or you choose a LinkedList and add entries at either end.
I think this covers the default choices for most developers, so it's not really an obstacle.

Most frequently repeated numbers in a huge list of numbers

I have a file which has a many random integers(around a million) each seperated by a white space. I need to find the top 10 most frequently occurring numbers in that file. What is the most efficient way of doing this in java?
I can think of
1. Create a hash map, key is the integer from the file and the value is the count. For every number in the file, check if that key already exists in the hash map, if yes, value++, else make a new entry in hash
2. Make a BST, each node is the integer from the file. For every integer from the file see if there is a node in the BST if yes, do value++, value is part of the node.
I feel hash map is better option if i can come up with good hashing function,
Can some one pl suggest me what is the best of doing this ? Is there is anyother efficient algo that i can use?

Edit #2:
Okay, I screwed up my own first rule--never optimize prematurely. The worst case for this is probably using a stock HashMap with a wide range--so I just did that. It still runs in like a second, so forget everything else here and just do that.
And I'll make ANOTHER note to myself to ALWAYS test speed before worrying about tricky implementations.
(Below is older obsolete post that could still be valid if someone had MANY more points than a million)
A HashSet would work, but if your integers have a reasonable range (say, 1-1000), it would be more efficient to create an array of 1000 integers, and for each of your million integers, increment that element of the array. (Pretty much the same idea as a HashMap, but optimizing out a few of the unknowns that a Hash has to make allowances for should make it a few times faster).
You could also create a tree. Each node in the tree would contain (value, count) and the tree would be organized by value (lower values on the left, higher on the right). Traverse to your node, if it doesn't exist--insert it--if it does, then just increment the count.
The range and distribution of your values would determine which of these two (or a regular hash) would perform better. I think a regular hash wouldn't have many "winning" cases though (It would have to be a wide range and "grouped" data, and even then the tree might win.
Since this is pretty trivial--I recommend you implement more than one solution and test speeds against the actual data set.
Edit: RE the comment
TreeMap would work, but would still add a layer of indirection (and it's so amazingly easy and fun to implement yourself). If you use the stock implementation, you have to use Integers and convert constantly to and from int for every increase. There is the indirection of the pointer to the Integer, and the fact that you are storing at least 2x as many objects. This doesn't even count any overhead for the method calls since they should be inlined with any luck.
Normally this would be an optimization (evil), but when you start to get near hundreds of thousands of nodes, you occasionally have to ensure efficiency, so the built-in TreeMap is going to be inefficient for the same reasons the built-in HashSet will.

Java handles hashing. You don't need to write a hash function. Just start pushing stuff in the hash map.
Also, if this is something that only needs to run once (or only occasionally), then don't both optimizing. It will be fast enough. Only bother if it's something that's going to run within an application.

HashMap
A million integers is not really a lot, even for interpreted languages, but especially for a speedy language like Java. You'll probably barely even notice the execution time. I'd try this first and move to something more complicated if you deem this too slow.
It will probably take longer to do string splitting and parsing to convert to integers than even the simplest algorithm to find frequencies using a HashMap.

Why use a hashtable? Just use an array that is the same size as the range of your numbers. Then you don't waste time executing the hashing function. Then sort the values after you're done. O(N log N)

Allocate an array / vector of the same size as the number of input items you have
Fill the array from your file with numbers, one number per element
Put the list in order
Iterate through the list and keep track of the the top 10 runs of numbers that you have encountered.
Output the top ten runs at the end.
As a refinement on step 4, you only need to step forward through the array in steps equilivent to your 10th longest run. Any run longer than that will overlap with your sampling. If the tenth longest run is 100 elements long, you only need to sample element 100, 200, 300 and at each point count the run of the integer you find there (both forwards and backwards). Any run longer than your 10th longest is sure to overlap with your sampling.
You should apply this optimisation after your 10th run length is very long compared to other runs in the array.
A map is overkill for this question unless you have very few unique numbers each with a large number of repeats.
NB: Similar to gshauger's answer but fleshed out

If you have to make it as efficient as possible, use an array of ints, with the position representing the value and the content representing the count. That way you avoid autoboxing and unboxing, the most likely killer of a standard Java collection.
If the range of numbers is too large then take a look at PJC and its IntKeyIntMap implementations. It will avoid the autoboxing as well. I don't know if it will be fast enough for you, though.

If the range of numbers is small (e.g. 0-1000), use an array. Otherwise, use a HashMap<Integer, int[]>, where the values are all length 1 arrays. It should be much faster to increment a value in an array of primitives than create a new Integer each time you want to increment a value. You're still creating Integer objects for the keys, but that's hard to avoid. It's not feasible to create an array of 2^31-1 ints, after all.
If all of the input is normalized so you don't have values like 01 instead of 1, use Strings as keys in the map so you don't have to create Integer keys.

Use a HashMap to create your dataset (value-count pairs) in memory as you traverse the file. The HashMap should give you close to O(1) access to the elements while you create the dataset (technically, in the worst case HashMap is O(n)). Once you are done searching the file, use Collections.sort() on the value Collection returned by HashMap.values() to create a sorted list of value-count pairs. Using Collections.sort() is guaranteed O(nLogn).
For example:
public static class Count implements Comparable<Count> {
int value;
int count;
public Count(int value) {
this.value = value;
this.count = 1;
}
public void increment() {
count++;
}
public int compareTo(Count other) {
return other.count - count;
}
}
public static void main(String args[]) throws Exception {
Scanner input = new Scanner(new FileInputStream(new File("...")));
HashMap<Integer, Count> dataset = new HashMap<Integer, Count>();
while (input.hasNextInt()) {
int tempInt = input.nextInt();
Count tempCount = dataset.get(tempInt);
if (tempCount != null) {
tempCount.increment();
} else {
dataset.put(tempInt, new Count(tempInt));
}
}
List<Count> counts = new ArrayList<Count>(dataset.values());
Collections.sort(counts);

Actually, there is an O(n) algorithm for doing exactly what you want to do. Your use case is similar to an LFU cache where the element's access count determines whether it syays in the cache or is evicted from it.
http://dhruvbird.blogspot.com/2009/11/o1-approach-to-lfu-page-replacement.html

This is the source for java.lang.Integer.hashCode(), which is the hashing function that will be used if you store your entries as a HashMap<Integer, Integer>:
public int hashCode() {
return value;
}
So in other words, the (default) hash value of a java.lang.Integer is the integer itself.
What is more efficient than that?

The correct way to do it is with a linked list. When you insert an element, you go down the linked list, if its there you increment the nodes count, otherwise create a new node with count of 1. After you inserted each element, you would have a sorted list of elements in O(n*log(n)).
For your methods, you are doing n inserts and then sorting in O(n*log(n)), so your coefficient on the complexity is higher.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.