java - Remove nearly duplicates from a List

java - Remove nearly duplicates from a List - java

I have a List of Tweet objects (homegrown class) and I want to remove NEARLY duplicates based on their text, using the Levenshtein distance. I have already removed the identical duplicates by hashing the tweets' texts but now I want to remove texts that are identical but have up to 2-3 different characters. Since this is a O(n^2) approach, I have to check every single tweet text with all the others available. Here's my code so far:
int distance;
for(Tweet tweet : this.tweets) {
distance = 0;
Iterator<Tweet> iter = this.tweets.iterator();
while(iter.hasNext()) {
Tweet currentTweet = iter.next();
distance = Levenshtein.distance(tweet.getText(), currentTweet.getText());
if(distance < 3 && (tweet.getID() != currentTweet.getID())) {
iter.remove();
}
}
}
The first problem is that the code throws ConcurrentModificationException at some point and never completes. The second one: can I do anything better than this double loop? The list of tweets contains nearly 400.000 tweets so we're talking about 160 billion iterations!

This solution works for the question in hand(so far tested with possible inputs) but the normal set operations to remove duplicates wont work if you dont implement the full contract for compare to return 1,0 and -1.
Why dont you implement your own compare operation using the Set which can have only distinct values. It is going to be O(n log(n)).
Set set = new TreeSet(new Comparator() {
#Override
public int compare(Tweet first, Tweet second) {
int distance = Levenshtein.distance(first.getText(), second.getText());
if(distance < 3){
return 0;
}
return 1;
}
});
set.addAll(this.tweets);
this.tweets = new ArrayList<Tweet>(set);

As for the ConcurrentModificationException: As the others pointed out, I was removing elements from a list that I was also iterating in the outer for-each. Changing the for-each into a normal for resolved the problem.
As for the O(n^2) approach: There's no "better" algorithm regarding its complexity, than a O(n^2) approach. What I'm trying to do is an "all-to-all" comparison to find nearly duplicate elements. Of course there are optimizations to lower the total capacity of n, parallelization to concurrently parse sub-lists of the original list, but the complexity is quadratic at all times.

Related

Combination Java Performance

I want to use this function with a large amount of possibility like 700 integer but the function make too much time to execute. Does someone have an idea to increase the performance? Thank you :)
public static Set<Set<Integer>> combinations(List<Integer> groupSize, int k) {
Set<Set<Integer>> allCombos = new HashSet<Set<Integer>> ();
// base cases for recursion
if (k == 0) {
// There is only one combination of size 0, the empty team.
allCombos.add(new HashSet<Integer>());
return allCombos;
}
if (k > groupSize.size()) {
// There can be no teams with size larger than the group size,
// so return allCombos without putting any teams in it.
return allCombos;
}
// Create a copy of the group with one item removed.
List<Integer> groupWithoutX = new ArrayList<Integer> (groupSize);
Integer x = groupWithoutX.remove(groupWithoutX.size() - 1);
Set<Set<Integer>> combosWithoutX = combinations(groupWithoutX, k);
Set<Set<Integer>> combosWithX = combinations(groupWithoutX, k - 1);
for (Set<Integer> combo : combosWithX) {
combo.add(x);
}
allCombos.addAll(combosWithoutX);
allCombos.addAll(combosWithX);
return allCombos;
}

What features of Set are you going to need to use on the returned value?
If you only need some of them - perhaps just iterator() or contains(...) - then you could consider returning an Iterator which calculates the combinations on the fly.
There's an interesting mechanism to generate the nth combination of a lexicographically ordered set here.

Other data structure. You could try a BitSet instead of the Set<Integer>. If the integer values have a wild range (negative, larger gaps), use an index in groupSize.
Using indices instead of integer values has other advantages: all subsets as bits can be done in a for-loop (BigInteger as set).
No data. Or make an iterator (Stream) of all combinations to repeatedly apply to your processing methods.
Concurrency.
Paralellism would would only mean a factor 4/8. ThreadPoolExecutor and Future maybe.
OPTIMIZING THE ALGORITHM ITSELF
The set of sets could better be a List. That tremendously improves adding a set.
And shows whether the algorithm does not erroneously create identical sets.

Is it better to use arrays or a queue when merging two sorted arrays?

I'm working on a programming practice site that asked to implement a method that merges two sorted arrays. This is my solution:
public static int[] merge(int[] arrLeft, int[] arrRight){
int[] merged = new int[arrRight.length + arrLeft.length];
Queue<Integer> leftQueue = new LinkedList<>();
Queue<Integer> rightQueue = new LinkedList<>();
for(int i = 0; i < arrLeft.length ; i ++){
leftQueue.add(arrLeft[i]);
}
for(int i = 0; i < arrRight.length; i ++){
rightQueue.add(arrRight[i]);
}
int index = 0;
while (!leftQueue.isEmpty() || !rightQueue.isEmpty()){
int largerLeft = leftQueue.isEmpty() ? Integer.MAX_VALUE : leftQueue.peek();
int largerRight = rightQueue.isEmpty() ? Integer.MAX_VALUE : rightQueue.peek();
if(largerLeft > largerRight){
merged[index] = largerRight;
rightQueue.poll();
} else{
merged[index] = largerLeft;
leftQueue.poll();
}
index ++;
}
return merged;
}
But this is the official solution:
public static int[] merge(int[] arrLeft, int[] arrRight){
// Grab the lengths of the left and right arrays
int lenLeft = arrLeft.length;
int lenRight = arrRight.length;
// Create a new output array with the size = sum of the lengths of left and right
// arrays
int[] arrMerged = new int[lenLeft+lenRight];
// Maintain 3 indices, one for the left array, one for the right and one for
// the merged array
int indLeft = 0, indRight = 0, indMerged = 0;
// While neither array is empty, run a while loop to merge
// the smaller of the two elements, starting at the leftmost position of
// both arrays
while(indLeft < lenLeft && indRight < lenRight){
if(arrLeft[indLeft] < arrRight[indRight])
arrMerged[indMerged++] = arrLeft[indLeft++];
else
arrMerged[indMerged++] = arrRight[indRight++];
}
// Another while loop for when the left array still has elements left
while(indLeft < lenLeft){
arrMerged[indMerged++] = arrLeft[indLeft++];
}
// Another while loop for when the right array still has elements left
while(indRight < lenRight){
arrMerged[indMerged++] = arrRight[indRight++];
}
return arrMerged;
}
Apparently, all the other solutions by users on the site did not make use of a queue as well. I'm wondering if using a Queue is less efficient? Could I be penalized for using a queue in an interview for example?

As the question already states that the left and right input arrays are sorted, this gives you a hint that you should be able to solve the problem without requiring a data structure other than an array for the output.
In a real interview, it is likely that the interviewer will ask you to talk through your thought process while you are coding the solution. They may state that they want the solution implemented with certain constraints. It is very important to make sure that the problem is well defined before you start your coding. Ask as many questions as you can think of to constrain the problem as much as possible before starting.
When you are done implementing your solution, you could mention the time and space complexity of your implementation and suggest an alternative, more efficient solution.
For example, when describing your implementation you could talk about the following:
There is overhead when creating the queues
The big O notation / time and space complexity of your solution
You are unnecessarily iterating over every element of the left and right input array to create the queues before you do any merging
etc...
These types of interview questions are common when applying for positions at companies like Google, Microsoft, Amazon, and some tech startups. To prepare for such questions, I recommend you work through problems in books such as Cracking the Coding Interview. The book covers how to approach such problems, and the interview process for these kinds of companies.

Sorry to say but your solution with queues is horrible.
You are copying all elements to auxiliary dynamic data structures (which can be highly costly because of memory allocations), then back to the destination array.
A big "disadvantage" of merging is that it requires twice the storage space as it cannot be done in-place (or at least no the straightforward way). But you are spoiling things to a much larger extent by adding extra copies and overhead, unnecessarily.
The true solution is to copy directly from source to destination, leading to simpler and much more efficient code.
Also note that using a sentinel value (Integer.MAX_VALUE) when one of the queues is exhausted is a false good idea because it adds extra comparisons when you know the outcome in advance. It is much better to split in three loops as in the reference code.
Lastly, your solution can fail when the data happens to contain Integer.MAX_VALUE.

java multithread loop with collecting results

sorry for limited code, as i have quite no idea how to do it, and parts of the code are not a code, just an explanation what i need. The base is:
arrayList<double> resultTopTen = new arrayList<double();
arrayList<double> conditions = new arrayList<double(); // this arrayList can be of a very large size milion+, gets filled by different code
double result = 0;
for (int i = 0, i < conditions.size(), i++){ //multithread this
loopResult = conditions.get(i) + 5;
if (result.size() < 10){
resultTopTen.add(loopResult);
}
else{
//this part i don't know, if this loopResult belongs to the TOP 10 loopResults so far, just by size, replace the smallest one with current, so that i will get updated resultTopTen in this point of loop.
}
}
loopResult = conditions.get(i) + 5; part is just an example, calculation is different, in fact it is not even double, so it is not possible simply to sort conditions and go from there.
for (int i = 0, i < conditions.size(), i++) part means i have to iterate through input condition list, and execute the calculation and get result for every condition in conditionlist, Don't have to be in order at all.
The multithreading part is the thing i have really no idea how to do, but as the conditions arrayList is really large, i would like to calculate it somehow in parallel, as if i do it just as it is in the code in a simple loop in 1 thread, i wont get my computing resources utilized fully. The trick here is how to split the conditions, and then collect result. For simplicity if i would like to do it in 2 threads, i would split conditions in half, make 1 thread do the same loop for 1st half and second for second, i would get 2 resultTopTen, which i can put together afterwards, But much better would be to split the thing in to as many threads as system resources provide(for example until cpu ut <90%, ram <90%). Is that possible?

Use parallel stream of Java 8.
static class TopN<T> {
final TreeSet<T> max;
final int size;
TopN(int size, Comparator<T> comparator) {
this.max = new TreeSet<>(comparator);
this.size = size;
}
void add(T n) {
max.add(n);
if (max.size() > size)
max.remove(max.last());
}
void combine(TopN<T> o) {
for (T e : o.max)
add(e);
}
}
public static void main(String[] args) {
List<Double> conditions = new ArrayList<>();
// add elements to conditions
TopN<Double> maxN = conditions.parallelStream()
.map(d -> d + 5) // some calculation
.collect(() -> new TopN<Double>(10, (a, b) -> Double.compare(a, b)),
TopN::add, TopN::combine);
System.out.println(maxN.max);
}
Class TopN holds top n items of T.
This code prints minimum top 10 in conditions (add 5 to each element).

Let me simplify your question, from what I understand, please confirm or add:
Requirement: You want to find top10 results from list called conditions.
Procedure: You want multiple threads to process your logic of finding the top10 results and accumulate the results to give top10.
Please also share the logic you want to implement to get top10 elements or it is just a descending order of list and it's top 10 elements.

Leetcode: Why this algorithm is slow?

So I am trying to solve this problem: http://oj.leetcode.com/problems/merge-intervals/
My solution is:
public class Solution {
public ArrayList<Interval> merge(ArrayList<Interval> intervals) {
// Start typing your Java solution below
// DO NOT write main() function
// ArrayList<Interval> result = new ArrayList<Interval>();
//First sort the intervals
Collections.sort(intervals,new Comparator<Interval>(){
public int compare(Interval interval1, Interval interval2) {
if(interval1.start > interval2.start) return 1;
if(interval1.start == interval2.start) return 0;
if(interval1.start < interval2.start) return -1;
return 42;
}
});
for(int i = 0; i < intervals.size() - 1; i++){
Interval currentInterval = intervals.get(i);
Interval nextInterval = intervals.get(i+1);
if(currentInterval.end >= nextInterval.start){
intervals.set(i,new Interval(currentInterval.start,nextInterval.end));
intervals.remove(i+1);
i--;
}
}
return intervals;
}
}
I have seen some blogs using exactly the same solution but get accepted but mine is rejected because it takes too long. Can you enlighten me why it takes longer than expected?
Cheers
EDIT: solved, remove is too costly, using a new arraylist to store the result is faster

Initially you are sorting all your intervals - due to javadocs, this operation has complexity O(N*log(N))
But, after that, as I have noticed - you are iterating over ArrayList, and sometimes removing elements from it.
But removing some element from ArrayList has complexity O(N) (as underlying implementation of ArrayList is plain array - removing any elemnt from the middle of array, requires shifting of the entire right part of this array).
As you do that in loop - finally, complexity of your algirithm would be O(N^2).
I'd suggest you to use LinkedList instead of ArrayList in this case.

You could improve your sorting by using one computation instead of 3 comparisons:
Collections.sort(intervals,new Comparator<Interval>(){
public int compare(Interval interval1, Interval interval2) {
return interval1.start - interval2.start;
}
});

Dictionary of unknown size - find whether a word is in dictionary

Here is an interesting problem.
Given an interface to a dictionary. It is unknown size, distribution, and content. Sorted ascending.
Also we have just a one method
String getWord(long index) throws IndexOutOfBoundsException
Add one method to the API:
boolean isInDictionary(String word)
What would be the best implementation for this problem.

Here is my implementation
boolean isWordInTheDictionary(String word){
if (word == null){
return false;
}
// estimate the length of the dictionary array
long len=2;
String temp= getWord(len);
while(true){
len = len * 2;
try{
temp = getWord(len);
}catch(IndexOutOfBoundsException e){
// found upped bound break from loop
break;
}
}
// Do a modified binary search using the estimated length
long beg = 0 ;
long end = len;
String tempWrd;
while(true){
System.out.println(String.format("beg: %s, end=%s, (beg+end)/2=%s ", beg,end,(beg+end)/2));
if(end - beg <= 1){
return false;
}
long idx = (beg+end)/2;
tempWrd = getWord(idx);
if(tempWrd == null){
end=idx;
continue;
}
if ( word.compareTo(tempWrd) > 0){
beg = idx;
}
else if(word.compareTo(tempWrd) < 0){
end= idx;
}else{
// found the word..
System.out.println(String.format("getword at index: %s, =%s", idx,getWord(idx)));
return true;
}
}
}
Let me know if this is correct

Let's suppose that your hypothetical data structure, with its single method, String getWord(long index), is based on a Dictionary that implements the usual Dictionary operations:
addition of pairs to the collection
removal of pairs from the collection
modification of the values of existing pairs
lookup of the value associated with a particular key
but the methods for all but the last have been hidden from you.
If that is the case, then your code definitely is not correct, because there is no reason to suppose that the dictionary stores values in any particular order, hence your binary search for items, using word.compareTo(), cannot be expected to work.
Also, you don't have catch code for index numbers between the dictionary size and len, the power of two that you found to be larger than the dictionary size, which need not be a power of two, so even if you changed to linear search instead of binary, you'd have an unhandled exception for words not in dictionary.

No, the words inside the dictionary probably aren't sorted. So you have to iterate through the dictionary and check every word if it is the one you're looking for.
If it is sorted, you're solution can be improved. The first loop only has to find out the right most entry after your word, you're searching.

duedl0r is correct, you can't assume that the dictionary will be ordered.
not having any other information, probably random search is the best algorithm that you can choose (after having estimated the size or during the estimation)
just for correcteness, in the second part of your algorithm you should check for exceptions and handle them, because, as you had said in the comment, your estimate is only an upper bound and during getWord there is the possibility that you will catch one
edit: just to give a better explanation
search in an unsorted list has lower bound for time complexity equals to O(n)
randomized search has complexity equals to O(k), where k is the iterations in search. so, you can decide k. but it is important to understand that randomized search does not guarantee success
when n, size of the dictionary, is very big, you can set k to a number of some orders lower than n and having high probability to find the word

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.