Multithreaded: Identifying duplicate objects

Multithreaded: Identifying duplicate objects - java

I'm trying to implement a duplicate objects finding method over a List object. Traversing through the List and finding the duplicate objects using multiple threads is the target. So far I used ExecutorService as follows.
ExecutorService executor = Executors.newFixedThreadPool(5);
for (int i = 0; i < jobs; i++) {
Runnable worker = new TaskToDo(jobs);
executor.execute(worker);
}
executor.shutdown();
while (!executor.isTerminated()) {
}
System.out.println("Finished all threads");
At TaskToDo class I iterate through the loop. When a duplicate is detected the one out of them will be removed from the List. Following are the problems I faced,
When using multiple threads at the executor it does not result as intended. Some duplicate values are still exist in the list. But a single thread at the executor works perfectly. I tried
List<String> list = Collections.synchronizedList(new LinkedList<String>()) also but same problem exists.
What is the best data structure that i can use for this purpose of removing duplicates for better performance ?
Google gave some results to use Concurrent structures. But difficult to figure out a correct approach to achieve this.
Appreciate your help. Thanks in advance... :)
Following is the code for iterating through the specified list object. Here actual content of the files will be compared.
for(int i = currentTemp; i < list.size() - 1; i++){
if(isEqual(list.get(currentTemp), list.get(i+1))){
synchronized (list) {
list.remove(i + 1);
i--;
}}}

With your current logic, you would have to synchronize at coarser granularity, otherwise you risk removing the wrong element.
for (int i = currentTemp; i < list.size() - 1; i++) {
synchronized (list) {
if (i + 1 > list.size() && isEqual(list.get(currentTemp), list.get(i+1))) {
list.remove(i + 1);
i--;
}
}
}
You see, the isEqual() check must be inside the synchronized block to ensure atomicity of the equivalence check with the element removal. Assuming most of your concurrent processing benefit would come from asynchronous comparison of list elements using isEqual(), this change nullifies any benefit you sought.
Also, checking list.size() outside the synchronized block isn't good enough, because list elements can be removed by other threads. And unless you have a way of adjusting your list index down when elements are removed by other threads, your code will unknowingly skip checking some elements in the list. The other threads are shifting elements out from under the current thread's for loop.
This task would be much better implemented using an additional list to keep track of indexes that should be removed:
private volatile Set<Integer> indexesToRemove =
Collections.synchronizedSet(new TreeSet<Integer>(
new Comparator<Integer>() {
#Override public int compare(Integer i1, Integer i2) {
return i2.compareTo(i1); // sort descending for later element removal
}
}
));
The above should be declared at the same shared level as your list. Then the code for iterating through the list should look like this, with no synchronization required:
int size = list.size();
for (int i = currentTemp; i < size - 1; i++) {
if (!indexesToRemove.contains(i + 1)) {
if (isEqual(list.get(currentTemp), list.get(i+1))) {
indexesToRemove.add(i + 1);
}
}
}
And finally, after you have join()ed the worker threads back to a single thread, do this to de-duplicate your list:
for (Integer i: indexesToRemove) {
list.remove(i.intValue());
}
Because we used a descending-sorted TreeSet for indexesToRemove, we can simply iterate its indexes and remove each from the list.

If your algorithm acts on sufficient data that you might really benefit from multiple threads, you encounter another issue that will tend to mitigate any performance benefits. Each thread has to scan the entire list to see if the element it is working on is a duplicate, which will cause the CPU cache to keep missing as various threads compete to access different parts of the list.
This is known as False Sharing.
Even if False Sharing does not get you, you are de-duping the list in O(N^2) because for each element of the list, you re-iterate the entire list.
Instead, consider using a Set to initially collect the data. If you cannot do that, test the performance of adding the list elements to a Set. That should be a very efficient way to approach this problem.

If you're trying to dedup a large number of files, you really ought to be using a hash-based structure. Concurrently modifying lists is dangerous, not least because indexes into the list will constantly be changing out from under you, and that's bad.
If you can use Java 8, my approach would look something like this. Let's assume you have a List<String> fileList.
Collection<String> deduplicatedFiles = fileList.parallelStream()
.map(FileSystems.getDefault()::getPath) // convert strings to Paths
.collect(Collectors.toConcurrentMap(
path -> {
try {
return ByteBuffer.wrap(Files.readAllBytes(path)),
// read out the file contents and wrap in a ByteBuffer
// which is a suitable key for a hash map
} catch (IOException e) {
throw new RuntimeException(e);
}
},
path -> path.toString(), // in the values, convert back to string
(first, second) -> first) // resolve duplicates by choosing arbitrarily
.values();
That's the entire thing: it concurrently reads all the files, hashes them (though with an unspecified hash algorithm that may not be great), deduplicates them, and spits out a list of files with distinct contents.
If you're using Java 7, then what I'd do would be something like this.
CompletionService<Void> service = new ExecutorCompletionService<>(
Executors.newFixedThreadPool(4));
final ConcurrentMap<ByteBuffer, String> unique = new ConcurrentHashMap<>();
for (final String file : fileList) {
service.submit(new Runnable() {
#Override public void run() {
try {
ByteBuffer buffer = ByteBuffer.wrap(Files.readAllBytes(
FileSystem.getDefault().getPath(file)));
unique.putIfAbsent(buffer, file);
} catch (IOException e) {
throw new RuntimeException(e);
}
}, null);
}
for (int i = 0; i < fileList.size(); i++) {
service.take();
}
Collection<String> result = unique.values();

Related

java 8 parallelStream().forEach Result data loss

There are two test cases which use parallelStream():
List<Integer> src = new ArrayList<>();
for (int i = 0; i < 20000; i++) {
src.add(i);
}
List<String> strings = new ArrayList<>();
src.parallelStream().filter(integer -> (integer % 2) == 0).forEach(integer -> strings.add(integer + ""));
System.out.println("=size=>" + strings.size());
=size=>9332
List<Integer> src = new ArrayList<>();
for (int i = 0; i < 20000; i++) {
src.add(i);
}
List<String> strings = new ArrayList<>();
src.parallelStream().forEach(integer -> strings.add(integer + ""));
System.out.println("=size=>" + strings.size());
=size=>17908
Why do I always lose data when using parallelStream?
What did i do wrong?

ArrayList isn't thread safe. You need to do
List<String> strings = Collections.synchronizedList(new ArrayList<>());
or
List<String> strings = new Vector<>();
to ensure all updates are synchronized, or switch to
List<String> strings = src.parallelStream()
.filter(integer -> (integer % 2) == 0)
.map(integer -> integer + "")
.collect(Collectors.toList());
and leave the list building to the Streams framework. Note that it's undefined whether the list returned by collect is modifiable, so if that is a requirement, you may need to modify your approach.
In terms of performance, Stream.collect is likely to be much faster than using Stream.forEach to add to a synchronized collection, since the Streams framework can handle collection of values in each thread separately without synchronization and combine the results at the end in a thread safe fashion.

ArrayList isn't thread-safe. While 1 thread sees a list with 30 elements another might still see 29 and override the 30th position (loosing 1 element).
Another issue might arise when the array backing the list needs to be resized. A new array (with double the size) is created and elements from the original array are copied into it. While other threads might have added stuff the thread doing the resizing might not have seen this or multiple threads are resizing and eventually only 1 will win.
When using multiple threads you need to either do some syncronized when accessing the list OR use a multi-thread safe list (by either wrapping it in a SynchronizedList or by using a CopyOnWriteArrayList to mention 2 possible solutions). Even better would be to use the collect method on the stream to put everything into a list.

ParallelStream with forEach is a deadly combo if not used carefully.
Please take a look at below points to avoid any bugs:
If you have a preexisting list object in which you want to add more objects from a parallelStream loop, Use Collections.synchronizedList & pass that pre-existing list object to it before looping through the parallelstream.
If you have to create a new list, then you can use Vector to initialize the list outside the loop.
or
If you have to create a new list, then simply use parallelStream and collect the output at the end.

You lose the benefits of using stream (and parallel stream) when you try to do mutation. As a general rule, avoid mutation when using streams. Venkat Subramaniam explains why. Instead, use collectors. Also try to get a lot accomplished within the stream chain. For example:
System.out.println(
IntStream.range(0, 200000)
.filter(i -> i % 2 == 0)
.mapToObj(String::valueOf)
.collect(Collectors.toList()).size()
);
You can run that in parallelStream by adding .parallel()

Multiple threads in for loop

I have a method I need to call for each element in a list, then return this list to the caller in another class. I want to create a Thread for each element but am struggling to get my head around how to do this.
public List<MyList> threaded(List<Another> another) {
List<MyList> myList= new ArrayList<>();
Visibility visi = new Visibility();
Thread[] threads = new Thread[another.size()];
for (int i = 0; i < another.size(); i++) {
visi = test(another.get(i));
myList.add(visi);
}
return myList;
}
So i've defined an array of threads that matches the number of elements in another list. To use each of those threads in the loop and then return the myList after all threads have been executed is where i'm lost.

This looks like a perfect use case for a Stream.parallelStream()
public List<MyList> threaded(List<Another> another) {
return another.parallelStream()
.map(a -> test(a));
.collect(Collectors.toList());
}
This will call test on each Another and collect the results as a List using as many cpus as you have available (up to the number of objects you have)
Yes, you could create a Thread for each one, except this is like to be less efficient and much more complicated.

Set vs List when need both unique elements and access by index

I need to keep a unique list of elements seen and I also need to pick random one from them from time to time. There are two simple ways for me to do this.
Keep elements seen in a Set - that gives me uniqueness of elements. When there is a need to pick random one, do the following:
elementsSeen.toArray()[random.nextInt(elementsSeen.size())]
Keep elements seen in a List - this way no need to convert to array as there is the get() function for when I need to ask for a random one. But here I would need to do this when adding.
if (elementsSeen.indexOf(element)==-1) {elementsSeen.add(element);}
So my question is which way would be more efficient? Is converting to array more consuming or is indexOf worse? What if attempting to add an element is done 10 or 100 or 1000 times more often?
I am interested in how to combine functionality of a list (access by index) with that of a set (unique adding) in the most performance effective way.

If using more memory is not a problem then you can get the best of both by using both list and set inside a wrapper:
public class MyContainer<T> {
private final Set<T> set = new HashSet<>();
private final List<T> list = new ArrayList<>();
public void add(T e) {
if (set.add(e)) {
list.add(e);
}
}
public T getRandomElement() {
return list.get(ThreadLocalRandom.current().nextInt(list.size()));
}
// other methods as needed ...
}

HashSet and TreeSet both extend AbstractCollection, which includes the toArray() implementation as shown below:
public Object[] toArray() {
// Estimate size of array; be prepared to see more or fewer elements
Object[] r = new Object[size()];
Iterator<E> it = iterator();
for (int i = 0; i < r.length; i++) {
if (! it.hasNext()) // fewer elements than expected
return Arrays.copyOf(r, i);
r[i] = it.next();
}
return it.hasNext() ? finishToArray(r, it) : r;
}
As you can see, its responsible for allocating the space for an array, as well as creating an Iterator object for copying. So, for a Set, adding is O(1), but retrieving a random element will be O(N) because of the element copy operation.
A List, on the other hand, allows you quick access to a specific index in the backing array, but doesn't guarantee uniqueness. You would have to re-implement the add, remove and associated methods to guarantee uniqueness on insert. Adding a unique element will be O(N), but retrieval will be O(1).
So, it really depends on which area is your potential high usage point. Are the add/remove methods going to be heavily used, with random access used sparingly? Or is this going to be a container for which retrieval is most important, since few elements will be added or removed over the lifetime of the program?
If the former, I'd suggest using the Set with toArray(). If the latter, it may be beneficial for you to implement a unique List to take advantage to the fast retrieval. The significant downside is add contains many edge cases for which the standard Java library takes great care to work with in an efficient manner. Will your implementation be up to the same standards?

Write some test code and put in some realistic values for your use case. Neither of the methods are so complex that it's not worth the effort, if performance is a real issue for you.
I tried that quickly, based on the exact two methods you described, and it appears that the Set implementation will be quicker if you are adding considerably more than you are retrieving, due to the slowness of the indexOf method. But I really recommend that you do the tests yourself - you're the only person who knows what the details are likely to be.
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Random;
import java.util.Set;
public class SetVsListTest<E> {
private static Random random = new Random();
private Set<E> elementSet;
private List<E> elementList;
public SetVsListTest() {
elementSet = new HashSet<>();
elementList = new ArrayList<>();
}
private void listAdd(E element) {
if (elementList.indexOf(element) == -1) {
elementList.add(element);
}
}
private void setAdd(E element) {
elementSet.add(element);
}
private E listGetRandom() {
return elementList.get(random.nextInt(elementList.size()));
}
#SuppressWarnings("unchecked")
private E setGetRandom() {
return (E) elementSet.toArray()[random.nextInt(elementSet.size())];
}
public static void main(String[] args) {
SetVsListTest<Integer> test;
List<Integer> testData = new ArrayList<>();
int testDataSize = 100_000;
int[] addToRetrieveRatios = new int[] { 10, 100, 1000, 10000 };
for (int i = 0; i < testDataSize; i++) {
/*
* Add 1/5 of the total possible number of elements so that we will
* have (on average) 5 duplicates of each number. Adjust this to
* whatever is most realistic
*/
testData.add(random.nextInt(testDataSize / 5));
}
for (int addToRetrieveRatio : addToRetrieveRatios) {
/*
* Test the list method
*/
test = new SetVsListTest<>();
long t1 = System.nanoTime();
for(int i=0;i<testDataSize; i++) {
// Use == 1 here because we don't want to get from an empty collection
if(i%addToRetrieveRatio == 1) {
test.listGetRandom();
} else {
test.listAdd(testData.get(i));
}
}
long t2 = System.nanoTime();
System.out.println(((t2-t1)/1000000L)+" ms for list method with add/retrieve ratio "+addToRetrieveRatio);
/*
* Test the set method
*/
test = new SetVsListTest<>();
t1 = System.nanoTime();
for(int i=0;i<testDataSize; i++) {
// Use == 1 here because we don't want to get from an empty collection
if(i%addToRetrieveRatio == 1) {
test.setGetRandom();
} else {
test.setAdd(testData.get(i));
}
}
t2 = System.nanoTime();
System.out.println(((t2-t1)/1000000L)+" ms for set method with add/retrieve ratio "+addToRetrieveRatio);
}
}
}
Output on my machine was:
819 ms for list method with add/retrieve ratio 10
1204 ms for set method with add/retrieve ratio 10
1547 ms for list method with add/retrieve ratio 100
133 ms for set method with add/retrieve ratio 100
1571 ms for list method with add/retrieve ratio 1000
23 ms for set method with add/retrieve ratio 1000
1542 ms for list method with add/retrieve ratio 10000
5 ms for set method with add/retrieve ratio 10000

You could extend HashSet and track the changes to it, maintaining a current array of all entries.
Here I keep a copy of the array and adjust it every time the set changes. For a more robust (but more costly) solution you could use toArray in your pick method.
class PickableSet<T> extends HashSet<T> {
private T[] asArray = (T[]) this.toArray();
private void dirty() {
asArray = (T[]) this.toArray();
}
public T pick(int which) {
return asArray[which];
}
#Override
public boolean add(T t) {
boolean added = super.add(t);
dirty();
return added;
}
#Override
public boolean remove(Object o) {
boolean removed = super.remove(o);
dirty();
return removed;
}
}
Note that this will not recognise changes to the set if removed by an Iterator - you will need to handle that some other way.

So my question is which way would be more efficient?
Quite a difficult question to answer depending on what one does more, insert or select at random?
We need to look at the Big O for each of the operations. In this case (best cases):
Set: Insert O(1)
Set: toArray O(n) (I'd assume)
Array: Access O(1)
vs
List: Contains O(n)
List: Insert O(1)
List: Access O(1)
So:
Set: Insert: O(1), Access O(n)
List: Insert: O(n), Access O(1)
So in the best case they are much of a muchness with Set winning if you insert more than you select, and List if the reverse is true.
Now the evil answer - Select one (the one that best represents the problem (so Set IMO)), wrap it well and run with it. If it is too slow then deal with it later, and when you do deal with it, look at the problem space. Does your data change often? No, cache the array.

It depends what you value more.
List implementations in Java normally makes use of an array or a linked list. That means inserting and searching for an index is fast, but searching for a specific element will require looping thought the list and comparing each element until the element is found.
Set implementations in Java mainly makes use of an array, the hashCode method and the equals method. So a set is more taxing when you want to insert, but trumps list when it comes to looking for an element. As a set doesn't guarantee the order of the elements in the structure, you will not be able to get an element by index. You can use an ordered set, but this brings with it latency on the insert due to the sort.
If you are going to be working with indexes directly, then you may have to use a List because the order that element will be placed into Set.toArray() changes as you add elements to the Set.
Hope this helps :)

Removing values in an arraylist that DO NOT match a value

I am having some trouble with removing values that do not match a given value. At the moment I am copying over values to a new list and trying to clear the original list - but this is inefficient.
This is my code:
int size = list.size();
ArrayList<String> newList;
int count = 0;
newList = new ArrayList<>();
for (int i=0; i<list.size(); i++){
if(list.get(i).getForename().equals(forename)){
newList.add(i, list);
}
}
list.clear();
Is there a way where I can just remove an item in the arraylist if it does NOT match the name?
EDIT:
It works but then I might need a copy, as if I select a another name from the dropdown it will be referring to the old one
Thanks

A first thought would be to iterate on the list and as soon as you find an item not matching the value, you remove it. But it will create a Concurrent modification exception, as you iterate on list while trying to remove elements in it.
An other, still not efficient would be to iterate on the list, keep track of the indexes to remove, and after iterating on the list, remove them.
ArrayList<Integer> indexList = new ArrayList<Integer>();
for(int i = 0; i<list.size(); i++){
if(!list.get(i).getForename().equals(forename)){
indexList.add(i);
}
for(Integer index : indexList){
list.remove(index);
}
indexList.clear();
Please not that this is not really efficient too, but maybe you were looking for a way to delete from the same list.

A simple solution is
while (list.contains(value)) {
list.remove(list.indexOf(value));
}

Depending on what you want, you might want to use streams instead (seems to be what you actually want, since you don't really seem to want to delete elements in your list):
newList = list.stream()
.filter(e -> getForename().equals(forename))
.collect(Collectors.toList());
or to perform your action what you might want to do:
list.stream()
.filter(e -> getForename().equals(forename))
.forEach(person -> doStuff(person));
Another way would be using iterators to avoid conflicts with modifications during iteration:
ListIterator iterator = list.listIterator();
while(iterator.hasNext()){
if(!iterator.getNext().getForename().equals(forename))
iterator.remove();
}
EDIT: Since OP can't use lambdas and streams (because of Java-version), here is what nearly happens for the second stream (the forEach). I am not using the proper interfaces, since OP can't do so either. The difference to streams is, that they also might split this into several threads and hence would be faster (especially on multi-core processors and big lists):
interface Consumer<T>{ //this is normally given by the JAVA 8 API (which has one more default method)
void accept(T t);
}
Consumer<YourObject> doIt = new Consumer<YourObject>(){ //This is what the lambda expression actually does
#Override
public void accept(YourObject e) {
doStuff(e);
}
};
for(YourObject element : list){ //since JAVA 1.5. Alternativ your old for-loop with element=list.get(i);
if(!element.getForename().equals(forename)) //the filter written in easy
continue;
doIt.accept(element); //You could also use a method or expressions instead in this context.
//doStuff(element); //What actually the upper stream does.
}
You might want to look at the oracle tutorial (this chapter) to get a feeling, when this design is appropriate https://docs.oracle.com/javase/tutorial/java/javaOO/lambdaexpressions.html (I have a strong feeling, you might want to use it).

Assuming your List contains String objects the following should be what you are looking for:
for (Iterator<String> it = list.iterator(); it.hasNext()){
String foreName = it.next();
if(forName != null && foreName.equals(forename)){
it.remove();
}
}

try
for (int i=0; i<list.size();){
if(!list.get(i).getForename().equals(forename)){
list.remove(i);
}
else {
i++;
}
}

java intstream parallel loop omitting data

I have this piece of code:
ArrayList<ArrayList<Double> results = new ArrayList<ArrayList<Double>();
IntStream.range(0, 100).parallel().forEach(x ->{
for (int y = 0; y <100;y++){
for (int z = 0; z <100;z++){
for (int q = 0; q <100;q++){
results.add(someMethodThatReturnsArrayListDouble);
}
}
}
});
System.out.println(results.size());
After running this code, i get always different results.size(), always a few short. Any idea why is that and how to fix it?

ArrayList is not threadsafe. If you try and add items to it in different threads (which is what a parallellised stream does), it is likely to break.
From the docs:
Note that this implementation is not synchronized. If multiple threads access an ArrayList instance concurrently, and at least one of the threads modifies the list structurally, it must be synchronized externally. (A structural modification is any operation that adds or deletes one or more elements, or explicitly resizes the backing array; merely setting the value of an element is not a structural modification.) This is typically accomplished by synchronizing on some object that naturally encapsulates the list. If no such object exists, the list should be "wrapped" using the Collections.synchronizedList method.
The easiest fix, in this case, would be to remove the call to parallel().

You result is not synchronized. There are multiple ways to solve your problem, the best would be letting the java stream api handle the combining of the lists.
List<List<Double>> results = IntStream.range(0, 100).parallel().flatmap(x ->{
List<Double>> results = new ArrayList<Double>();
for (int y = 0; y <100;y++){
for (int z = 0; z <100;z++){
for (int q = 0; q <100;q++){
results.add(someMethodThatReturnsArrayListDouble);
}
}
}
return results.stream();
}).collect(Collectors.toList());
This collects the lists in the method, and returns them as a stream to be combined at the end of the method using collectors.toList(), what is thread safe.

use
Vector
it's a thread-safe implementation of List.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Multithreaded: Identifying duplicate objects - java

Related

java 8 parallelStream().forEach Result data loss

Multiple threads in for loop

Set vs List when need both unique elements and access by index

Removing values in an arraylist that DO NOT match a value

java intstream parallel loop omitting data

Categories

Resources