How to efficiently sort one million elements? - java

I need to compare about 60.000 with a list of 935.000 elements and if they match I need to perform a calculation.
I already implemented everything needed but the process takes about 40 min. I have a unique 7-digit number in both lists. The 935.000 and the 60.000 files are unsorted. Is it more efficient to sort (which sort?) the big list before I try to find the element? Keep in mind that I have to do this calculation only once a month so I don't need to repeat the process every day.
Basically which is faster:
unsorted linear search
sort list first and then search with another algorithm

Try it out.
You've got Collections.sort() which will do the heavy lifting for you, and Collections.binarySearch() which will allow you to find the elements in the sorted list.

When you search the unsorted list, you have to look through half the elements on average before you find the one you're looking for. When you do that 60,000 times on a list of 935,000 elements, that works out to about
935,000 * 1/2 * 60,000 = 28,050,000,000 operations
If you sort the list first (using mergesort) it will take about n * log(n) operations. Then you can use binary search to find elements in log(n) lookups for each of the 60,000 elements in your shorted list. That's about
935,000 * log(935,000) + log(935,000) * 60,000 = 19,735,434 operations
It should be a lot faster if you sort the list first, then use a search algorithm that takes advantage of the sorted list.

What would work quite well is to sort both lists and then iterate over both at the same time.
Use collections.sort() to sort the lists.
You start with an index for each sorted list and just basically walk straight through it. You start with the first element on the short list and compare it to the first elements of the long list. If you reach an element on the long list with an higher 7 digit number than the current number in the short list, increment your index of the short list. This way there is no need to check elements twice.
But actually, since you want to find the intersection of two lists, you might be better off just using longList.retainAll(shortList) to just get the intersection of the two lists. Then you can perform whatever you want on both of the lists in about O(1) since there is no need to actually find anything.

You can sort both lists and compare them element by element incrementing first or second index (i and j in the example below) as needed:
List<Comparable> first = ....
List<Comparable> second = ...
Collections.sort(first);
Collections.sort(second);
int i = 0;
int j = 0;
while (i < first.size() && j < second.size()) {
if (first.get(i).compareTo(second.get(j)) == 0) {
// Action for equals
}
if (first.get(i).compareTo(second.get(j)) > 0) {
j++;
} else {
i++;
}
}
The complexity of this code is O(n log(n)) where n is the biggest list size.

Related

Adding the last n element of a sorted set to another

I have two TreeSets A and B. I want to merge the last n element of B into A in an efficient way, ie. using merging of sorted sets.
Is there an efficient way to do this implemented in Java.
The only way I see is to find n-th smallest element of B: b_n, get the view of the tailSet, then call A.addAll(B.tailSet(b_n)), but this is not good enough for us, as this would require additional n iterations in B and calling tailSet() is also not free.
The optimal scenario would be something like A.addFromTail(B, n) using the same merging technique as addAll but stopping after adding n element.
TreeSet has an descendingIterator which allow you to do something like
Iterator<String> iterator = b.descendingIterator();
for (int i = 0; i < n; i++) {
a.add(iterator.next());
}
Another way would be using Stream API (since Java 8) with
Iterator<String> iterator = b.descendingIterator();
a.addAll(Stream.generate(iterator::next).limit(Math.min(n, b.size())).collect(Collectors.toList()));
TreeSet.size() has a complexity of O(1) so this could be lightweight and you only go through the last N elements of B.

How to constantly find the two smallest values from two different lists?

I have two different lists that contain integers and I need to constantly find the two smallest values between these two lists; I should note that I do NOT want to merge these two lists together, since they are different types.
I would like to know if my approach is good or bad. If it is bad, please let me know how I can make it more efficient.
Constantly have both lists sorted by descending order, so the mins will be at the bottom
Find the two mins from list1 and compare it with the two mins from list2 and find the two mins out of those four values
Remove the two mins from the associate list(s), combine their values together (required) and add it to list2
I am essentially performing a portion of the Huffman code, where I want to have a list of the frequency of chars in descending order.
Finding a min in List can be done in linear time without any sorting. Sorting and finding the min every time will be O(m*nlgn) m being the number of iterations and n the size of the list) .
A better way would to use PriorityQueue (min-heap) where the min is always on the top of the heap instead of sorting on every iteration.
Using a min-heap is common in implementing Huffman codes and greedy algorithms in general.
Although this would definitely work, the task of keeping the lists sorted at all times should be a reason for concern:
If your lists allow random access (i.e. ArrayLists), then the process of deleting from them costs you O(n)
If your lists allow O(1) deletions (i.e. LinkedLists), then the process of finding the insertion spot is O(n)
That is on top of the initial sorting, which would cost you O(n*log2n). In other words, there is no advantage to sorting your lists in the first place: maintaining them would cost you O(n) per operation, so you might as well do linear searches.
In other words, the algorithm works, but it is inefficient. Instead of maintaining sorted lists, use containers that maintain minimum or maximum for you, and allow for fast insertions/deletions (e.g. PriorityQueue).
Why don't you keep your min values in variables instead of keeping sorted lists?
List<Integer> list1, list2;
int min1 = Integer.MAX_VALUE, min2 = Integer.MAX_VALUE;
void setMin(int newValue) {
if (newValue < min1) {
min1 = newValue;
} else if (newValue < min2) {
min2 = newValue;
}
}
void updateList1(int newValue) {
setMin(newValue);
list1.add(newValue);
}
void updateList2(int newValue) {
setMin(newValue);
list2.add(newValue);
}

java linkedlist string sort algorithm

So, I have several lists of word pairs and I need to sort them in ascending or descending order. The method I'm using right now is the insertion sort algorithm. And this seems to work fine for the smaller lists. But every time I try to sort a large list, it freezes, with no errors. I tried to see what was going on by debugging with printing out "a was swapped for b"
and you can see it start working and it slows down and eventually stops like the computer just said, "there's too many, I give up". My question is, is there something wrong with my code, or do I simply need to use a more efficient method, and if so, which one and what would it look like?
for (int j=0; j < wordpair_list.size()-1; j++){
for (int i=0; i < wordpair_list.size()-1; i++){
String wordA_1 = wordpair_list.get(i).getWordA();
String wordA_2 = wordpair_list.get(i+1).getWordA();
if (wordA_1.compareToIgnoreCase(wordA_2) < 0){
WordPair temp = wordpair_list.get(i);
wordpair_list.set(i,wordpair_list.get(i+1));
wordpair_list.set(i+1, temp);
}
}
}
that's for descending. all i do for ascending is swap the '>' in the if statement to '<'
I think you are performing bubble sort. As others have pointed out, performing get() and set() operations are expensive with linked lists.
I am not conversant with Java, but it appears that you can use ListIterators to carry out bubble sort in O(N^2)
ListIterator listIterator(int index) Returns a list-iterator of the
elements in this list (in proper sequence), starting at the specified
position in the list. Throws IndexOutOfBoundsException if the
specified index is is out of range (index < 0 || index >= size()).
For bubble sort, you just need to swap the adjacent elements, so you can iterate through the list like an array and keep swapping if needed.
Moreover, you can skip the section of the list that is already sorted. Take a look at a good bubble sort algorithm.
http://en.wikipedia.org/wiki/Bubble_sort
Besides the fact that insertion sort is already O(N^2) algorithm, the access (both get and set) to the item in the linked list by the item index is O(N) operation, making your code O(N^3), in other words extremely slow.
Basically, you have two options:
copy the linked list into a temp array, sort an array using your algorithm (array access by index is O(1), overall algorithm will stay at O(N^2) (unless you choose a better one), create a new sorted linked list from the array.
use some other algorithm that does not need indexed access (for example, it is actually possible to implement insertion sort without indexed operations because you can swap two adjacent items in the linked list, through you will have to use a linked list implementation where you have direct access to the "previous" and "next" links).
First of all this is not insert sort, this is closer to bubble sort. Second, there is nothing wrong with your code, but as expected for this sorting algorithm it is quadratic. Thus for larger input it may take a long time to finish(e.g. for 200 000 elements it may take several minutes).
EDIT: as you are using List the complexity is even higher - up to cubic. As set in a list is not constant. You may try to implement the algorithm with Array to save this added complexity.

removing duplicate strings from a massive array in java efficiently?

I'm considering the best possible way to remove duplicates from an (Unsorted) array of strings - the array contains millions or tens of millions of stringz..The array is already prepopulated so the optimization goal is only on removing dups and not preventing dups from initially populating!!
I was thinking along the lines of doing a sort and then binary search to get a log(n) search instead of n (linear) search. This would give me nlogn + n searches which althout is better than an unsorted (n^2) search = but this still seems slow. (Was also considering along the lines of hashing but not sure about the throughput)
Please help! Looking for an efficient solution that addresses both speed and memory since there are millions of strings involved without using Collections API!
Until your last sentence, the answer seemed obvious to me: use a HashSet<String> or a LinkedHashSet<String> if you need to preserve order:
HashSet<String> distinctStrings = new HashSet<String>(Arrays.asList(array));
If you can't use the collections API, consider building your own hash set... but until you've given a reason why you wouldn't want to use the collections API, it's hard to give a more concrete answer, as that reason could rule out other answers too.
ANALYSIS
Let's perform some analysis:
Using HashSet. Time complexity - O(n). Space complexity O(n). Note, that it requires about 8 * array size bytes (8-16 bytes - a reference to a new object).
Quick Sort. Time - O(n*log n). Space O(log n) (the worst case O(n*n) and O(n) respectively).
Merge Sort (binary tree/TreeSet). Time - O(n * log n). Space O(n)
Heap Sort. Time O(n * log n). Space O(1). (but it is slower than 2 and 3).
In case of Heap Sort you can through away duplicates on fly, so you'll save a final pass after sorting.
CONCLUSION
If time is your concern, and you don't mind allocating 8 * array.length bytes for a HashSet - this solution seems to be optimal.
If space is a concern - then QuickSort + one pass.
If space is a big concern - implement a Heap with throwing away duplicates on fly. It's still O(n * log n) but without additional space.
I would suggest that you use a modified mergesort on the array. Within the merge step, add logic to remove duplicate values. This solution is n*log(n) complexity and could be performed in-place if needed (in this case in-place implementation is a bit harder than with normal mergesort because adjacent parts could contain gaps from the removed duplicates which also need to be closed when merging).
For more information on mergesort see http://en.wikipedia.org/wiki/Merge_sort
Creating a hashset to handle this task is way too expensive. Demonstrably, in fact the whole point of them telling you not to use the Collections API is because they don't want to hear the word hash. So that leaves the code following.
Note that you offered them binary search AFTER sorting the array: that makes no sense, which may be the reason your proposal was rejected.
OPTION 1:
public static void removeDuplicates(String[] input){
Arrays.sort(input);//Use mergesort/quicksort here: n log n
for(int i=1; i<input.length; i++){
if(input[i-1] == input[i])
input[i-1]=null;
}
}
OPTION 2:
public static String[] removeDuplicates(String[] input){
Arrays.sort(input);//Use mergesort here: n log n
int size = 1;
for(int i=1; i<input.length; i++){
if(input[i-1] != input[i])
size++;
}
System.out.println(size);
String output[] = new String[size];
output[0]=input[0];
int n=1;
for(int i=1;i<input.length;i++)
if(input[i-1]!=input[i])
output[n++]=input[i];
//final step: either return output or copy output into input;
//here I just return output
return output;
}
OPTION 3: (added by 949300, based upon Option 1). Note that this mangles the input array, if that is unacceptable, you must make a copy.
public static String[] removeDuplicates(String[] input){
Arrays.sort(input);//Use mergesort/quicksort here: n log n
int outputLength = 0;
for(int i=1; i<input.length; i++){
// I think equals is safer, but are nulls allowed in the input???
if(input[i-1].equals(input[i]))
input[i-1]=null;
else
outputLength++;
}
// check if there were zero duplicates
if (outputLength == input.length)
return input;
String[] output = new String[outputLength];
int idx = 0;
for ( int i=1; i<input.length; i++)
if (input[i] != null)
output[idx++] = input[i];
return output;
}
Hi do you need to put them into an array. It would be faster to use a collection using hash values like a set. Here each value is unique because of its hash value.
If you put all entries to a set collection type. You can use the
HashSet(int initialCapacity)
constructor to prevent memory expansion while run time.
Set<T> mySet = new HashSet<T>(Arrays.asList(someArray))
Arrays.asList() has runtime O(n) if memory do not have to be expanded.
Since this is an interview question, I think they want you to come up with your own implementation instead of using the set api.
Instead of sorting it first and compare it again, you can build a binary tree and create an empty array to store the result.
The first element in the array will be the root.
If the next element is equals to the node, return. -> this remove the duplicate elements
If the next element is less than the node, compare it to the left, else compare it to the right.
Keep doing the above the 2 steps until you reach to the end of the tree, then you can create a new node and know this has no duplicate yet.
Insert this new node value to the array.
After the traverse of all elements of the original array, you get a new copy of an array with no duplicate in the original order.
Traversing takes O(n) and searching the binary tree takes O(logn) (insertion should only take O(1) since you are just attaching it and not re-allocating/balancing the tree) so the total should be O(nlogn).
O.K., if they want super speed, let's use the hashcodes of the Strings as much as possible.
Loop through the array, get the hashcode for each String, and add it to your favorite data structure. Since you aren't allowed to use a Collection, use a BitSet. Note that you need two, one for positives and one for negatives, and they will each be huge.
Loop again through the array, with another BitSet. True means the String passes. If the hashcode for the String does not exist in the Bitset, you can just mark it as true. Else, mark it as possibly duplicate, as false. While you are at it, count how many possible duplicates.
Collect all the possible duplicates into a big String[], named possibleDuplicates. Sort it.
Now go through the possible duplicates in the original array and binary Search in the possibleDuplicates. If present, well, you are still stuck, cause you want to include it ONCE but not all the other times. So you need yet another array somewhere. Messy, and I've got to go eat dinner, but this is a start...

Find nearest number in unordered array

Given a large unordered array of long random numbers and a target long, what's the most efficient algorithm for finding the closest number?
#Test
public void findNearest() throws Exception {
final long[] numbers = {90L, 10L, 30L, 50L, 70L};
Assert.assertEquals("nearest", 10L, findNearest(numbers, 12L));
}
Iterate through the array of longs once. Store the current closest number and the distance to that number. Continue checking each number if it is closer, and just replace the current closest number when you encounter a closer number.
This gets you best performance of O(n).
Building a binary tree as suggested by other answerer will take O(nlogn). Of course future search will only take O(logn)...so it may be worth it if you do a lot of searches.
If you are pro, you can parallelize this with openmp or thread library, but I am guessing that is out of the scope of your question.
If you do not intend to do multiple such requests on the array there is no better way then the brute force linear time check of each number.
If you will do multiple requests on the same array first sort it and then do a binary search on it - this will reduce the time for such requests to O(log(n)) but you still pay the O(n*log(n)) for the sort so this is only reasonable if the number of requests is reasonably large i.e. k*n >>(a lot bigger then) n*log(n) + k* log(n) where k is the number of requests.
If the array will change, then create a binary search tree and do a lower bound request on it. This again is only reasonable if the nearest number request is relatively large with comparison to array change requests and also to the number of elements. As the cost of building the tree is O(n*log(n)) and also the cost of updating it is O(logn) you need to have k*log(n) + n*log(n) + k*log(n) <<(a lot smaller then) k*n
IMHO, I think that you should use a Binary Heap (http://en.wikipedia.org/wiki/Binary_heap) which has the insertion time of O(log n), being O(n log n) for the entire array. For me, the coolest thing about the binary heap is that it can be made inside from your own array, without overhead. Take a look the heapfy section.
"Heapfying" your array turns possible to get the bigger/lower element in O(1).
if you build a binary search tree from your numbers and search against. O(log n) would be the complexity in worst case. In your case you won't search for equality instead, you'll looking for the smallest return value through subtraction
I would check the difference between the numbers while iterating through the array and save the min value for that difference.
If you plan to use findNearest multiple times I would calculate the difference while sorting (with an sorting algorithm of complexity n*log(n)) after each change of values in that array
The time complex to do this job is O(n), the length of the numbers.
final long[] numbers = {90L, 10L, 30L, 50L, 70L};
long tofind = 12L;
long delta = Long.MAX_VALUE;
int index = -1;
int i = 0;
while(i < numbers.length){
Long tmp = Math.abs(tofind - numbers[i]);
if(tmp < delta){
delta = tmp;
index = i;
}
i++;
}
System.out.println(numbers[index]); //if index is not -1
But if you want to find many times with different values such as 12L against the same numbers array, you may sort the array first and binary search against the sorted numbers array.
If your search is a one-off, you can partition the array like in quicksort, using the input value as pivot.
If you keep track - while partitioning - of the min item in the right half, and the max item in the left half, you should have it in O(n) and 1 single pass over the array.
I'd say it's not possible to do it in less than O(n) since it's not sorted and you have to scan the input at the very least.
If you need to do many subsequent search, then a BST could help indeed.
You could do it in below steps
Step 1 : Sort array
Step 2 : Find index of the search element
Step 3 : Based on the index, display the number that are at the Right & Left Side
Let me know incase of any queries...

Categories