removing duplicate strings from a massive array in java efficiently? - java

I'm considering the best possible way to remove duplicates from an (Unsorted) array of strings - the array contains millions or tens of millions of stringz..The array is already prepopulated so the optimization goal is only on removing dups and not preventing dups from initially populating!!
I was thinking along the lines of doing a sort and then binary search to get a log(n) search instead of n (linear) search. This would give me nlogn + n searches which althout is better than an unsorted (n^2) search = but this still seems slow. (Was also considering along the lines of hashing but not sure about the throughput)
Please help! Looking for an efficient solution that addresses both speed and memory since there are millions of strings involved without using Collections API!

Until your last sentence, the answer seemed obvious to me: use a HashSet<String> or a LinkedHashSet<String> if you need to preserve order:
HashSet<String> distinctStrings = new HashSet<String>(Arrays.asList(array));
If you can't use the collections API, consider building your own hash set... but until you've given a reason why you wouldn't want to use the collections API, it's hard to give a more concrete answer, as that reason could rule out other answers too.

ANALYSIS
Let's perform some analysis:
Using HashSet. Time complexity - O(n). Space complexity O(n). Note, that it requires about 8 * array size bytes (8-16 bytes - a reference to a new object).
Quick Sort. Time - O(n*log n). Space O(log n) (the worst case O(n*n) and O(n) respectively).
Merge Sort (binary tree/TreeSet). Time - O(n * log n). Space O(n)
Heap Sort. Time O(n * log n). Space O(1). (but it is slower than 2 and 3).
In case of Heap Sort you can through away duplicates on fly, so you'll save a final pass after sorting.
CONCLUSION
If time is your concern, and you don't mind allocating 8 * array.length bytes for a HashSet - this solution seems to be optimal.
If space is a concern - then QuickSort + one pass.
If space is a big concern - implement a Heap with throwing away duplicates on fly. It's still O(n * log n) but without additional space.

I would suggest that you use a modified mergesort on the array. Within the merge step, add logic to remove duplicate values. This solution is n*log(n) complexity and could be performed in-place if needed (in this case in-place implementation is a bit harder than with normal mergesort because adjacent parts could contain gaps from the removed duplicates which also need to be closed when merging).
For more information on mergesort see http://en.wikipedia.org/wiki/Merge_sort

Creating a hashset to handle this task is way too expensive. Demonstrably, in fact the whole point of them telling you not to use the Collections API is because they don't want to hear the word hash. So that leaves the code following.
Note that you offered them binary search AFTER sorting the array: that makes no sense, which may be the reason your proposal was rejected.
OPTION 1:
public static void removeDuplicates(String[] input){
Arrays.sort(input);//Use mergesort/quicksort here: n log n
for(int i=1; i<input.length; i++){
if(input[i-1] == input[i])
input[i-1]=null;
}
}
OPTION 2:
public static String[] removeDuplicates(String[] input){
Arrays.sort(input);//Use mergesort here: n log n
int size = 1;
for(int i=1; i<input.length; i++){
if(input[i-1] != input[i])
size++;
}
System.out.println(size);
String output[] = new String[size];
output[0]=input[0];
int n=1;
for(int i=1;i<input.length;i++)
if(input[i-1]!=input[i])
output[n++]=input[i];
//final step: either return output or copy output into input;
//here I just return output
return output;
}
OPTION 3: (added by 949300, based upon Option 1). Note that this mangles the input array, if that is unacceptable, you must make a copy.
public static String[] removeDuplicates(String[] input){
Arrays.sort(input);//Use mergesort/quicksort here: n log n
int outputLength = 0;
for(int i=1; i<input.length; i++){
// I think equals is safer, but are nulls allowed in the input???
if(input[i-1].equals(input[i]))
input[i-1]=null;
else
outputLength++;
}
// check if there were zero duplicates
if (outputLength == input.length)
return input;
String[] output = new String[outputLength];
int idx = 0;
for ( int i=1; i<input.length; i++)
if (input[i] != null)
output[idx++] = input[i];
return output;
}

Hi do you need to put them into an array. It would be faster to use a collection using hash values like a set. Here each value is unique because of its hash value.
If you put all entries to a set collection type. You can use the
HashSet(int initialCapacity)
constructor to prevent memory expansion while run time.
Set<T> mySet = new HashSet<T>(Arrays.asList(someArray))
Arrays.asList() has runtime O(n) if memory do not have to be expanded.

Since this is an interview question, I think they want you to come up with your own implementation instead of using the set api.
Instead of sorting it first and compare it again, you can build a binary tree and create an empty array to store the result.
The first element in the array will be the root.
If the next element is equals to the node, return. -> this remove the duplicate elements
If the next element is less than the node, compare it to the left, else compare it to the right.
Keep doing the above the 2 steps until you reach to the end of the tree, then you can create a new node and know this has no duplicate yet.
Insert this new node value to the array.
After the traverse of all elements of the original array, you get a new copy of an array with no duplicate in the original order.
Traversing takes O(n) and searching the binary tree takes O(logn) (insertion should only take O(1) since you are just attaching it and not re-allocating/balancing the tree) so the total should be O(nlogn).

O.K., if they want super speed, let's use the hashcodes of the Strings as much as possible.
Loop through the array, get the hashcode for each String, and add it to your favorite data structure. Since you aren't allowed to use a Collection, use a BitSet. Note that you need two, one for positives and one for negatives, and they will each be huge.
Loop again through the array, with another BitSet. True means the String passes. If the hashcode for the String does not exist in the Bitset, you can just mark it as true. Else, mark it as possibly duplicate, as false. While you are at it, count how many possible duplicates.
Collect all the possible duplicates into a big String[], named possibleDuplicates. Sort it.
Now go through the possible duplicates in the original array and binary Search in the possibleDuplicates. If present, well, you are still stuck, cause you want to include it ONCE but not all the other times. So you need yet another array somewhere. Messy, and I've got to go eat dinner, but this is a start...

Related

Best way to retrieve K largest elements from large unsorted arrays?

I recently had a coding test during an interview. I was told:
There is a large unsorted array of one million ints. User wants to retrieve K largest elements. What algorithm would you implement?
During this, I was strongly hinted that I needed to sort the array.
So, I suggested to use built-in sort() or maybe a custom implementation if performance really mattered. I was then told that using a Collection or array to store the k largest and for-loop it is possible to achieve approximately O(N), in hindsight, I think it's O(N*k) because each iteration needs to compare to the K sized array to find the smallest element to replace, while the need to sort the array would cause the code to be at least O(N log N).
I then reviewed this link on SO that suggests priority queue of K numbers, removing the smallest number every time a larger element is found, which would also give O(N log N). Write a program to find 100 largest numbers out of an array of 1 billion numbers
Is the for-loop method bad? How should I justify pros/cons of using the for-loop or the priorityqueue/sorting methods? I'm thinking that if the array is already sorted, it could help by not needing to iterate through the whole array again, i.e. if some other method of retrieval is called on the sorted array, it should be constant time. Is there some performance factor when running the actual code that I didn't consider when theorizing pseudocode?
Another way of solving this is using Quickselect. This should give you a total average time complexity of O(n). Consider this:
Find the kth largest number x using Quickselect (O(n))
Iterate through the array again (or just through the right-side partition) (O(n)) and save all elements ≥ x
Return your saved elements
(If there are repeated elements, you can avoid them by keeping count of how many duplicates of x you need to add to the result.)
The difference between your problem and the one in the SO question you linked to is that you have only one million elements, so they can definitely be kept in memory to allow normal use of Quickselect.
There is a large unsorted array of one million ints. The user wants to retrieve the K largest elements.
During this, I was strongly hinted that I needed to sort the array.
So, I suggested using a built-in sort() or maybe a custom
implementation
That wasn't really a hint I guess, but rather a sort of trick to deceive you (to test how strong your knowledge is).
If you choose to approach the problem by sorting the whole source array using the built-in Dual-Pivot Quicksort, you can't obtain time complexity better than O(n log n).
Instead, we can maintain a PriorytyQueue which would store the result. And while iterating over the source array for each element we need to check whether the queue has reached the size K, if not the element should be added to the queue, otherwise (is size equals to K) we need to compare the next element against the lowest element in the queue - if the next element is smaller or equal we should ignore it if it is greater the lowest element has to be removed and the new element needs to be added.
The time complexity of this approach would be O(n log k) because adding a new element into the PriorytyQueue of size k costs O(k) and in the worst-case scenario this operation can be performed n times (because we're iterating over the array of size n).
Note that the best case time complexity would be Ω(n), i.e. linear.
So the difference between sorting and using a PriorytyQueue in terms of Big O boils down to the difference between O(n log n) and O(n log k). When k is much smaller than n this approach would give a significant performance gain.
Here's an implementation:
public static int[] getHighestK(int[] arr, int k) {
Queue<Integer> queue = new PriorityQueue<>();
for (int next: arr) {
if (queue.size() == k && queue.peek() < next) queue.remove();
if (queue.size() < k) queue.add(next);
}
return toIntArray(queue);
}
public static int[] toIntArray(Collection<Integer> source) {
return source.stream().mapToInt(Integer::intValue).toArray();
}
main()
public static void main(String[] args) {
System.out.println(Arrays.toString(getHighestK(new int[]{3, -1, 3, 12, 7, 8, -5, 9, 27}, 3)));
}
Output:
[9, 12, 27]
Sorting in O(n)
We can achieve worst case time complexity of O(n) when there are some constraints regarding the contents of the given array. Let's say it contains only numbers in the range [-1000,1000] (sure, you haven't been told that, but it's always good to clarify the problem requirements during the interview).
In this case, we can use Counting sort which has linear time complexity. Or better, just build a histogram (first step of Counting Sort) and look at the highest-valued buckets until you've seen K counts. (i.e. don't actually expand back to a fully sorted array, just expand counts back into the top K sorted elements.) Creating a histogram is only efficient if the array of counts (possible input values) is smaller than the size of the input array.
Another possibility is when the given array is partially sorted, consisting of several sorted chunks. In this case, we can use Timsort which is good at finding sorted runs. It will deal with them in a linear time.
And Timsort is already implemented in Java, it's used to sort objects (not primitives). So we can take advantage of the well-optimized and thoroughly tested implementation instead of writing our own, which is great. But since we are given an array of primitives, using built-in Timsort would have an additional cost - we need to copy the contents of the array into a list (or array) of wrapper type.
This is a classic problem that can be solved with so-called heapselect, a simple variation on heapsort. It also can be solved with quickselect, but like quicksort has poor quadratic worst-case time complexity.
Simply keep a priority queue, implemented as binary heap, of size k of the k smallest values. Walk through the array, and insert values into the heap (worst case O(log k)). When the priority queue is too large, delete the minimum value at the root (worst case O(log k)). After going through the n array elements, you have removed the n-k smallest elements, so the k largest elements remain. It's easy to see the worst-case time complexity is O(n log k), which is faster than O(n log n) at the cost of only O(k) space for the heap.
Here is one idea. I will think for creating array (int) with max size (2147483647) as it is max value of int (2147483647). Then for every number in for-each that I get from the original array just put the same index (as the number) +1 inside the empty array that I created.
So in the end of this for each I will have something like [1,0,2,0,3] (array that I created) which represent numbers [0, 2, 2, 4, 4, 4] (initial array).
So to find the K biggest elements you can make backward for over the created array and count back from K to 0 every time when you have different element then 0. If you have for example 2 you have to count this number 2 times.
The limitation of this approach is that it works only with integers because of the nature of the array...
Also the representation of int in java is -2147483648 to 2147483647 which mean that in the array that need to be created only the positive numbers can be placed.
NOTE: if you know that there is max number of the int then you can lower the created array size with that max number. For example if the max int is 1000 then your array which you need to create is with size 1000 and then this algorithm should perform very fast.
I think you misunderstood what you needed to sort.
You need to keep the K-sized list sorted, you don't need to sort the original N-sized input array. That way the time complexity would be O(N * log(K)) in the worst case (assuming you need to update the K-sized list almost every time).
The requirements said that N was very large, but K is much smaller, so O(N * log(K)) is also smaller than O(N * log(N)).
You only need to update the K-sized list for each record that is larger than the K-th largest element before it. For a randomly distributed list with N much larger than K, that will be negligible, so the time complexity will be closer to O(N).
For the K-sized list, you can take a look at the implementation of Is there a PriorityQueue implementation with fixed capacity and custom comparator? , which uses a PriorityQueue with some additional logic around it.
There is an algorithm to do this in worst-case time complexity O(n*log(k)) with very benign time constants (since there is just one pass through the original array, and the inner part that contributes to the log(k) is only accessed relatively seldomly if the input data is well-behaved).
Initialize a priority queue implemented with a binary heap A of maximum size k (internally using an array for storage). In the worst case, this has O(log(k)) for inserting, deleting and searching/manipulating the minimum element (in fact, retrieving the minimum is O(1)).
Iterate through the original unsorted array, and for each value v:
If A is not yet full then
insert v into A,
else, if v>min(A) then (*)
insert v into A,
remove the lowest value from A.
(*) Note that A can return repeated values if some of the highest k values occur repeatedly in the source set. You can avoid that by a search operation to make sure that v is not yet in A. You'd also want to find a suitable data structure for that (as the priority queue has linear complexity), i.e. a secondary hash table or balanced binary search tree or something like that, both of which are available in java.util.
The java.util.PriorityQueue helpfully guarantees the time complexity of its operations:
this implementation provides O(log(n)) time for the enqueing and dequeing methods (offer, poll, remove() and add); linear time for the remove(Object) and contains(Object) methods; and constant time for the retrieval methods (peek, element, and size).
Note that as laid out above, we only ever remove the lowest (first) element from A, so we enjoy the O(log(k)) for that. If you want to avoid duplicates as mentioned above, then you also need to search for any new value added to it (with O(k)), which opens you up to a worst-case overall scenario of O(n*k) instead of O(n*log(k)) in case of a pre-sorted input array, where every single element v causes the inner loop to fire.

Count distinct elements in an array in O(n) only with loops and arrays in java

I found this solution
https://www.geeksforgeeks.org/count-distinct-elements-in-an-array/
The problem is that time complexity must be at O(n), space complexity must be at O(1), but i can't import any additional libraries and the code must be maximally short. I wasn't able to find a solution with sorting faster than O(nlog n), so i guess i need to find a clever way. And the answer is the third solution from the link above, but it requires additional library. Is it even possible to find a better way?
Edit:
In fact, i need to create a function that works exactly like
java.util.Arrays.stream(myarray).distinct().count();
It must have time complexity at O(n) and space complexity at O(1).
Basically i have to create it using only loops, arrays and if statements. Also it is forbidden to import anything other than import java.util.Scanner; and because of that i can't do it with any ready to use methods like java.util.Arrays.*;.
For example:
Input:
{1,12,3,0,1,3,15,6}
Output:
6
Maximally short solution with O(n) time complexity, using only Java 8+ built-in APIs, i.e. no additional libraries needed.
The code assumes myarray is an array of int, long, double, or object1.
long count = java.util.Arrays.stream(myarray).distinct().count();
1) Object must have valid equals() and hashCode() implementation.
A solution in O(n) time complexity and O(1) space complexity is possible in theory, but it might not be very practical. The basic idea is this:
let aMin be the minimum value of an entry in arr
let aMax be the maximum value of an entry in arr
let seenOnce and seenTwice be boolean arrays
whose indices are in the range [aMin..aMax]
initialize all elements of seenOnce and seenTwice to FALSE
countUnique = 0;
for a in arr {
if (!seenOnce[a - aMin]) {
// seeing `a` for the first time, so count it
seenOnce[a - aMin] = TRUE
countUnique = countUnique + 1
} else if (!seenTwice[a - aMin]) {
// seeing `a` for a second time, so un-count it
countUnique = countUnique - 1
seenTwice[a - aMin] = TRUE
}
}
If the values in arr could be any ints at all, then each of the boolean arrays will contain 2^32 entries, for a total of over 8 billion booleans. That's 1Gb of memory, provided we're careful to implement all those booleans in one bit each. But it is O(1): same 1Gb consumed regardless of whether arr contains two elements or a billion...

More efficient alternative to these "for" loops?

I'm taking an introductory course to Java and one of my latest projects involve making sure an array doesn't contain any duplicate elements (has distinct elements). I used a for loop with an inner for loop, and it works, but I've heard that you should try to avoid using many iterations in a program (and other methods in my classes have a fair number of iterations as well). Is there any efficient alternative to this code? I'm not asking for code of course, just "concepts." Would there potentially be a recursive way to do this? Thanks!
The array sizes are generally <= 10.
/** Iterates through a String array ARRAY to see if each element in ARRAY is
* distinct. Returns false if ARRAY contains duplicates. */
boolean distinctElements(String[] array) { //Efficient?
for (int i = 0; i < array.length; i += 1) {
for (int j = i + 1; j < array.length; j += 1) {
if (array[i] == array[j]) {
return false;
}
}
} return true;
}
"Efficiency" is almost always a trade-off. Occasionally, there are algorithms that are simply better than others, but often they are only better in certain circumstances.
For example, this code above: it's got time complexity O(n^2).
One improvement might be to sort the strings: you can then compare the strings by comparing if an element is equal to its neighbours. The time complexity here is reduced to O(n log n), because of the sorting, which dominates the linear comparison of elements.
However - what if you don't want to change the elements of the array - for instance, some other bit of your code relies on them being in their original order - now you also have to copy the array and then sort it, and then look for duplicates. This doesn't increase the overall time or storage complexity, but it does increase the overall time and storage, since more work is being done and more memory is required.
Big-oh notation only gives you a bound on the time ignoring multiplicative factors. Maybe you only have access to a really slow sorting algorithm: actually, it turns out to be faster just to use your O(n^2) loops, because then you don't have to invoke the very slow sort.
This could be the case when you have very small inputs. An oft-cited example of an algorithm that has poor time complexity but actually is useful in practice is Bubble Sort: it's O(n^2) in the worst case, but if you have a small and/or nearly-sorted array, it can actually be pretty darn fast, and pretty darn simple to implement - never forget the inefficiency of you having to write and debug the code, and to have to ask questions on SO when it doesn't work as you expect.
What if you know that the elements are already sorted, because you know something about their source. Now you can simply iterate through the array, comparing neighbours, and the time complexity is now O(n). I can't remember where I read it, but I once saw a blog post saying (I paraphrase):
A given computer can never be made to go quicker; it can only ever do less work.
If you can exploit some property to do less work, that improves your efficiency.
So, efficiency is a subjective criterion:
Whenever you ask "is this efficient", you have to be able to answer the question: "efficient with respect to what?". It might be space; it might be time; it might be how long it takes you to write the code.
You have to know the constraints of the hardware that you're going to run it on - memory, disk, network requirements etc may influence your choices.
You need to know the requirements of the user on whose behalf you are running it. One user might want the results as soon as possible; another user might want the results tomorrow. There is never a need to find a solution better than "good enough" (although that can be a moving goal once the user sees what is possible).
You also have to know what inputs you want it to be efficient for, and what properties of that input you can exploit to avoid unnecessary work.
First, array[i] == array[j] tests reference equality. That's not how you test String(s) for value equality.
I would add each element to a Set. If any element isn't successfully added (because it's a duplicate), Set.add(E) returns false. Something like,
static boolean distinctElements(String[] array) {
Set<String> set = new HashSet<>();
for (String str : array) {
if (!set.add(str)) {
return false;
}
}
return true;
}
You could render the above without a short-circuit like
static boolean distinctElements(String[] array) {
Set<String> set = new HashSet<>(Arrays.asList(array));
return set.size() == array.length;
}

java linkedlist string sort algorithm

So, I have several lists of word pairs and I need to sort them in ascending or descending order. The method I'm using right now is the insertion sort algorithm. And this seems to work fine for the smaller lists. But every time I try to sort a large list, it freezes, with no errors. I tried to see what was going on by debugging with printing out "a was swapped for b"
and you can see it start working and it slows down and eventually stops like the computer just said, "there's too many, I give up". My question is, is there something wrong with my code, or do I simply need to use a more efficient method, and if so, which one and what would it look like?
for (int j=0; j < wordpair_list.size()-1; j++){
for (int i=0; i < wordpair_list.size()-1; i++){
String wordA_1 = wordpair_list.get(i).getWordA();
String wordA_2 = wordpair_list.get(i+1).getWordA();
if (wordA_1.compareToIgnoreCase(wordA_2) < 0){
WordPair temp = wordpair_list.get(i);
wordpair_list.set(i,wordpair_list.get(i+1));
wordpair_list.set(i+1, temp);
}
}
}
that's for descending. all i do for ascending is swap the '>' in the if statement to '<'
I think you are performing bubble sort. As others have pointed out, performing get() and set() operations are expensive with linked lists.
I am not conversant with Java, but it appears that you can use ListIterators to carry out bubble sort in O(N^2)
ListIterator listIterator(int index) Returns a list-iterator of the
elements in this list (in proper sequence), starting at the specified
position in the list. Throws IndexOutOfBoundsException if the
specified index is is out of range (index < 0 || index >= size()).
For bubble sort, you just need to swap the adjacent elements, so you can iterate through the list like an array and keep swapping if needed.
Moreover, you can skip the section of the list that is already sorted. Take a look at a good bubble sort algorithm.
http://en.wikipedia.org/wiki/Bubble_sort
Besides the fact that insertion sort is already O(N^2) algorithm, the access (both get and set) to the item in the linked list by the item index is O(N) operation, making your code O(N^3), in other words extremely slow.
Basically, you have two options:
copy the linked list into a temp array, sort an array using your algorithm (array access by index is O(1), overall algorithm will stay at O(N^2) (unless you choose a better one), create a new sorted linked list from the array.
use some other algorithm that does not need indexed access (for example, it is actually possible to implement insertion sort without indexed operations because you can swap two adjacent items in the linked list, through you will have to use a linked list implementation where you have direct access to the "previous" and "next" links).
First of all this is not insert sort, this is closer to bubble sort. Second, there is nothing wrong with your code, but as expected for this sorting algorithm it is quadratic. Thus for larger input it may take a long time to finish(e.g. for 200 000 elements it may take several minutes).
EDIT: as you are using List the complexity is even higher - up to cubic. As set in a list is not constant. You may try to implement the algorithm with Array to save this added complexity.

Most frequently repeated numbers in a huge list of numbers

I have a file which has a many random integers(around a million) each seperated by a white space. I need to find the top 10 most frequently occurring numbers in that file. What is the most efficient way of doing this in java?
I can think of
1. Create a hash map, key is the integer from the file and the value is the count. For every number in the file, check if that key already exists in the hash map, if yes, value++, else make a new entry in hash
2. Make a BST, each node is the integer from the file. For every integer from the file see if there is a node in the BST if yes, do value++, value is part of the node.
I feel hash map is better option if i can come up with good hashing function,
Can some one pl suggest me what is the best of doing this ? Is there is anyother efficient algo that i can use?
Edit #2:
Okay, I screwed up my own first rule--never optimize prematurely. The worst case for this is probably using a stock HashMap with a wide range--so I just did that. It still runs in like a second, so forget everything else here and just do that.
And I'll make ANOTHER note to myself to ALWAYS test speed before worrying about tricky implementations.
(Below is older obsolete post that could still be valid if someone had MANY more points than a million)
A HashSet would work, but if your integers have a reasonable range (say, 1-1000), it would be more efficient to create an array of 1000 integers, and for each of your million integers, increment that element of the array. (Pretty much the same idea as a HashMap, but optimizing out a few of the unknowns that a Hash has to make allowances for should make it a few times faster).
You could also create a tree. Each node in the tree would contain (value, count) and the tree would be organized by value (lower values on the left, higher on the right). Traverse to your node, if it doesn't exist--insert it--if it does, then just increment the count.
The range and distribution of your values would determine which of these two (or a regular hash) would perform better. I think a regular hash wouldn't have many "winning" cases though (It would have to be a wide range and "grouped" data, and even then the tree might win.
Since this is pretty trivial--I recommend you implement more than one solution and test speeds against the actual data set.
Edit: RE the comment
TreeMap would work, but would still add a layer of indirection (and it's so amazingly easy and fun to implement yourself). If you use the stock implementation, you have to use Integers and convert constantly to and from int for every increase. There is the indirection of the pointer to the Integer, and the fact that you are storing at least 2x as many objects. This doesn't even count any overhead for the method calls since they should be inlined with any luck.
Normally this would be an optimization (evil), but when you start to get near hundreds of thousands of nodes, you occasionally have to ensure efficiency, so the built-in TreeMap is going to be inefficient for the same reasons the built-in HashSet will.
Java handles hashing. You don't need to write a hash function. Just start pushing stuff in the hash map.
Also, if this is something that only needs to run once (or only occasionally), then don't both optimizing. It will be fast enough. Only bother if it's something that's going to run within an application.
HashMap
A million integers is not really a lot, even for interpreted languages, but especially for a speedy language like Java. You'll probably barely even notice the execution time. I'd try this first and move to something more complicated if you deem this too slow.
It will probably take longer to do string splitting and parsing to convert to integers than even the simplest algorithm to find frequencies using a HashMap.
Why use a hashtable? Just use an array that is the same size as the range of your numbers. Then you don't waste time executing the hashing function. Then sort the values after you're done. O(N log N)
Allocate an array / vector of the same size as the number of input items you have
Fill the array from your file with numbers, one number per element
Put the list in order
Iterate through the list and keep track of the the top 10 runs of numbers that you have encountered.
Output the top ten runs at the end.
As a refinement on step 4, you only need to step forward through the array in steps equilivent to your 10th longest run. Any run longer than that will overlap with your sampling. If the tenth longest run is 100 elements long, you only need to sample element 100, 200, 300 and at each point count the run of the integer you find there (both forwards and backwards). Any run longer than your 10th longest is sure to overlap with your sampling.
You should apply this optimisation after your 10th run length is very long compared to other runs in the array.
A map is overkill for this question unless you have very few unique numbers each with a large number of repeats.
NB: Similar to gshauger's answer but fleshed out
If you have to make it as efficient as possible, use an array of ints, with the position representing the value and the content representing the count. That way you avoid autoboxing and unboxing, the most likely killer of a standard Java collection.
If the range of numbers is too large then take a look at PJC and its IntKeyIntMap implementations. It will avoid the autoboxing as well. I don't know if it will be fast enough for you, though.
If the range of numbers is small (e.g. 0-1000), use an array. Otherwise, use a HashMap<Integer, int[]>, where the values are all length 1 arrays. It should be much faster to increment a value in an array of primitives than create a new Integer each time you want to increment a value. You're still creating Integer objects for the keys, but that's hard to avoid. It's not feasible to create an array of 2^31-1 ints, after all.
If all of the input is normalized so you don't have values like 01 instead of 1, use Strings as keys in the map so you don't have to create Integer keys.
Use a HashMap to create your dataset (value-count pairs) in memory as you traverse the file. The HashMap should give you close to O(1) access to the elements while you create the dataset (technically, in the worst case HashMap is O(n)). Once you are done searching the file, use Collections.sort() on the value Collection returned by HashMap.values() to create a sorted list of value-count pairs. Using Collections.sort() is guaranteed O(nLogn).
For example:
public static class Count implements Comparable<Count> {
int value;
int count;
public Count(int value) {
this.value = value;
this.count = 1;
}
public void increment() {
count++;
}
public int compareTo(Count other) {
return other.count - count;
}
}
public static void main(String args[]) throws Exception {
Scanner input = new Scanner(new FileInputStream(new File("...")));
HashMap<Integer, Count> dataset = new HashMap<Integer, Count>();
while (input.hasNextInt()) {
int tempInt = input.nextInt();
Count tempCount = dataset.get(tempInt);
if (tempCount != null) {
tempCount.increment();
} else {
dataset.put(tempInt, new Count(tempInt));
}
}
List<Count> counts = new ArrayList<Count>(dataset.values());
Collections.sort(counts);
Actually, there is an O(n) algorithm for doing exactly what you want to do. Your use case is similar to an LFU cache where the element's access count determines whether it syays in the cache or is evicted from it.
http://dhruvbird.blogspot.com/2009/11/o1-approach-to-lfu-page-replacement.html
This is the source for java.lang.Integer.hashCode(), which is the hashing function that will be used if you store your entries as a HashMap<Integer, Integer>:
public int hashCode() {
return value;
}
So in other words, the (default) hash value of a java.lang.Integer is the integer itself.
What is more efficient than that?
The correct way to do it is with a linked list. When you insert an element, you go down the linked list, if its there you increment the nodes count, otherwise create a new node with count of 1. After you inserted each element, you would have a sorted list of elements in O(n*log(n)).
For your methods, you are doing n inserts and then sorting in O(n*log(n)), so your coefficient on the complexity is higher.

Categories