Java Iterator Concurrency - java

I'm trying to loop over a Java iterator concurrently, but am having troubles with the best way to do this.
Here is what I have where I don't try to do anything concurrently.
Long l;
Iterator<Long> i = getUserIDs();
while (i.hasNext()) {
l = i.next();
someObject.doSomething(l);
anotheObject.doSomething(l);
}
There should be no race conditions between the things I'm doing on the non iterator objects, so I'm not too worried about that. I'd just like to speed up how long it takes to loop through the iterator by not doing it sequentially.
Thanks in advance.

One solution is to use an executor to parallelise your work.
Simple example:
ExecutorService executor = Executors.newCachedThreadPool();
Iterator<Long> i = getUserIDs();
while (i.hasNext()) {
final Long l = i.next();
Runnable task = new Runnable() {
public void run() {
someObject.doSomething(l);
anotheObject.doSomething(l);
}
}
executor.submit(task);
}
executor.shutdown();
This will create a new thread for each item in the iterator, which will then do the work. You can tune how many threads are used by using a different method on the Executors class, or subdivide the work as you see fit (e.g. a different Runnable for each of the method calls).

A can offer two possible approaches:
Use a thread pool and dispatch the items received from the iterator to a set of processing threads. This will not accelerate the iterator operations themselves, since those would still happen in a single thread, but it will parallelize the actual processing.
Depending on how the iteration is created, you might be able to split the iteration process to multiple segments, each to be processed by a separate thread via a different Iterator object. For an example, have a look at the List.sublist(int fromIndex, int toIndex) and List.listIterator(int index) methods.
This would allow the iterator operations to happen in parallel, but it is not always possible to segment the iteration like this, usually due to the simple fact that the items to be iterated over are not immediately available.
As a bonus trick, if the iteration operations are expensive or slow, such as those required to access a database, you might see a throughput improvement if you separate them out to a separate thread that will use the iterator to fill in a BlockingQueue. The dispatcher thread will then only have to access the queue, without waiting on the iterator object to retrieve the next item.
The most important advice in this case is this: "Use your profiler", usually to be followed by "Do not optimise prematurely". By using a profiler, such as VisualVM, you should be able to ascertain the exact cause of any performance issues, without taking shots in the dark.

If you are using Java 7, you can use the new fork/join; see the tutorial.
Not only does it split automatically the tasks among the threads, but if some thread finishes its tasks earlier than the other threads, it "steals" some tasks from the other threads.

Related

How to avoid congesting/stalling/deadlocking an executorservice with recursive callable

All the threads in an ExecutorService are busy with tasks that wait for tasks that are stuck in the queue of the executor service.
Example code:
ExecutorService es=Executors.newFixedThreadPool(8);
Set<Future<Set<String>>> outerSet=new HashSet<>();
for(int i=0;i<8;i++){
outerSet.add(es.submit(new Callable<Set<String>>() {
#Override
public Set<String> call() throws Exception {
Thread.sleep(10000); //to simulate work
Set<Future<String>> innerSet=new HashSet<>();
for(int j=0;j<8;j++) {
int k=j;
innerSet.add(es.submit(new Callable<String>() {
#Override
public String call() throws Exception {
return "number "+k+" in inner loop";
}
}));
}
Set<String> out=new HashSet<>();
while(!innerSet.isEmpty()) { //we are stuck at this loop because all the
for(Future<String> f:innerSet) { //callable in innerSet are stuckin the queue
if(f.isDone()) { //of es and can't start since all the threads
out.add(f.get()); //in es are busy waiting for them to finish
}
}
}
return out;
}
}));
}
Are there any way to avoid this other than by making more threadpools for each layer or by having a threadpool that is not fixed in size?
A practical example would be if some callables are submitted to ForkJoinPool.commonPool() and then these tasks use objects that also submit to the commonPool in one of their methods.
You should use a ForkJoinPool. It was made for this situation.
Whereas your solution blocks a thread permanently while it's waiting for its subtasks to finish, the work stealing ForkJoinPool can perform work while in join(). This makes it efficient for these kinds of situations where you may have a variable number of small (and often recursive) tasks that are being run. With a regular thread-pool you would need to oversize it, to make sure that you don't run out of threads.
With CompletableFuture you need to handle a lot more of the actual planning/scheduling yourself, and it will be more complex to tune if you decide to change things. With FJP the only thing you need to tune is the amount of threads in the pool, with CF you need to think about then vs. thenAsync as well.
I would recommend trying to decompose the work to use completion stages via CompletableFuture
CompletableFuture.supplyAsync(outerTask)
.thenCompose(CompletableFuture.allOf(innerTasks)
That way your outer task doesn’t hog the execution thread while processing inner tasks, but you still get a Future that resolves when the entire job is done. It can be hard to split those stages up if they’re too tightly coupled though.
The approach that you are suggesting which basically is based on the hypothesis that there is a possible resolution if the number of threads are more than the number of task, will not work here, if you are already allocating a single thread pool. You may try it to see it. It's a simple case of deadlock as you have stated in the comments of your code.
In such a case, use two separate thread pools, one for the outer and another for the inner. And when the task from the inner pool completes, simply return back the value to the outer.
Or you can simply create a thread on the fly, get the work done in it, get the result and return it back to the outer.

Using Threads in Loop

I have a for loop which needs to execute 36000 times
for(int i=0;i<36000;i++)
{
}
Whether its possible to use Multiple threads inorder to execute the loop faster at the same time
Please suggest how to use it.
If you want a more explicit method, you can use thread pools with Thread, Callable or Runnable. See my answere here for examples:
Java : a method to do multiple calculations on arrays quickly
Thread won't naturally exit at end of run()
I do not recommend using Java's Fork/Join as they are not that great as they were hyped to be. Performance is pretty bad. Instead, I would use Java 8's map and parallel streams if you want to make it easy. You have several options using this method.
IntStream.range(1, 4)
.mapToObj(i -> "testing " + i)
.forEach(System.out::println);
You would want to call map( lambda ). Java 8 finally brings lambda functions. It is possible to feed the stream one huge list, but there will be a performance impact. IntStream.range will do what you want. Then you need to figure out which of the new functions you want to use like filter, map, count, sum, reduce, etc. You may have to tell it that you want it to be a parallel stream. See these links.
https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html
http://winterbe.com/posts/2014/07/31/java8-stream-tutorial-examples/
Classic method and still has the best performance is to do it yourself using a thread pool:
Basically, you would create a Runnable (does not return something) or Callable (returns a result) object that will do some work on one of the treads in the pool. The pool with handle scheduling, which is great for us. Java has several options on the pool you use. You can create a Runnable/Callable in a loop, then submit that into the pool. The pool immediately returns a Future object that represents the task. You can add that Future to an ArrayList if you have many of these. After adding all the futures to the list, loop through them and call future.get(), which will wait for the end of execution. See the linked example above, which does not use a list, but does everything else I said.

Processing sub-streams of a stream in Java using executors

I have a program that processes a huge stream (not in the sense of java.util.stream, but rather InputStream) of data coming in through the network. The stream consists of objects, each having a sort of sub-stream identifier. Right now the whole processing is done in a single thread, but it takes a lot of CPU time and each sub-stream can easily be processed independently, so I'm thinking of multi-threading it.
However, each sub-stream requires to keep a lot of bulky state, including various buffers, hash maps and such. There is no particular reason to make it concurrent or synchronized since sub-streams are independent of each other. Moreover, each sub-stream requires that its objects are processed in the order they arrive, which means that probably there should be a single thread for each sub-stream (but possibly one thread processing multiple sub-streams).
I'm thinking of several approaches to this, but they are not quite elegant.
Create a single ThreadPoolExecutor for all tasks. Each task will contain the next object to process and the reference to a Processor instance which keeps all the state. That would ensure the necessary happens-before relationship thus ensuring that the processing thread will see the up-to-date state for this sub-stream. This approach has no way to make sure that the next object of the same sub-stream will be processed in the same thread, as far as I can see. Moreover, it needs some guarantee that objects will be processed in the order they come in, which will require additional synchronization of the Processor objects, introducing unnecessary delays.
Create multiple single-thread executors manually and a sort of hash-map that maps sub-stream identifiers to executor. This approach requires manual management of executors, creating or shutting down them as new sub-streams begin or end, and distributing the tasks between them accordingly.
Create a custom executor that processes a special subclass of tasks each having a sub-stream ID. This executor would use it as a hint to use the same thread for executing this task as the previous one with the same ID. However, I don't see an easy way to implement such executor. Unfortunately, it doesn't seem possible to extend any of the existing executor classes, and implementing an executor from scratch is kind of overkill.
Create a single ThreadPoolExecutor, but instead of creating a task for each incoming object, create a single long-running task for each sub-stream that would block in a concurrent queue, waiting for the next object. Then put objects in queues according to their sub-stream IDs. This approach needs as many threads as there are sub-streams because the tasks will be blocked. The expected number of sub-streams is about 30-60, so that may be acceptable.
Alternatively, proceed as in 4, but limit the number of threads, assigning multiple sub-streams to a single task. This is sort of a hybrid between 2 and 4. As far as I can see, this is the best approach of these, but it still requires some sort of manual sub-stream distribution between tasks and some way to shut the extra tasks down as sub-streams end.
What would be the best way to ensure that each sub-stream is processed in its own thread without a lot of error-prone code? So that the following pseudo-code will work:
// loop {
Item next = stream.read();
int id = next.getSubstreamID();
Processor processor = getProcessor(id);
SubstreamTask task = new SubstreamTask(processor, next, id);
executor.submit(task); // This makes sure that the task will
// be executed in the same thread as the
// previous task with the same ID.
// } // loop
I suggest having an array of single threaded executors. If you can devise a consistent hashing strategy for sub-streams, you can map sub-streams to individual threads. e.g.
final ExecutorsService[] es = ...
public void submit(int id, Runnable run) {
es[(id & 0x7FFFFFFF) % es.length].submit(run);
}
The key could be an String or long but some way to identify the sub-stream. If you know a particular sub-stream is very expensive, you could assign it a dedicated thread.
The solution I finally chose looks like this:
private final Executor[] streamThreads
= new Executor[Runtime.getRuntime().availableProcessors()];
{
for (int i = 0; i < streamThreads.length; ++i) {
streamThreads[i] = Executors.newSingleThreadExecutor();
}
}
private final ConcurrentHashMap<SubstreamId, Integer>
threadById = new ConcurrentHashMap<>();
This code determines which executor to use:
Message msg = in.readNext();
SubstreamId msgSubstream = msg.getSubstreamId();
int exe = threadById.computeIfAbsent(msgSubstream,
id -> findBestExecutor());
streamThreads[exe].execute(() -> {
// processing goes here
});
And the findBestExecutor() function is this:
private int findBestExecutor() {
// Thread index -> substream count mapping:
final int[] loads = new int[streamThreads.length];
for (int thread : threadById.values()) {
++loads[thread];
}
// return the index of the minimum load
return IntStream.range(0, streamThreads.length)
.reduce((i, j) -> loads[i] <= loads[j] ? i : j)
.orElse(0);
}
This is, of course, not very efficient, but note that this function is only called when a new sub-stream shows up (which happens several times every few hours, so it's not a big deal in my case). My real code looks a bit more complicated because I have a way to determine whether two sub-streams are likely to finish simultaneously, and if they are, I try to assign them to different threads in order to maintain even load after they do finish. But since I never mentioned this detail in the question, I guess it doesn't belong to the answer either.

LinkedList Iterator throwing Concurrent Modification Exception

Is there a way to stop a ListIterator from throwing a ConcurrentModificationException? This is what I want to do:
Create a LinkedList with a bunch of objects that have a certain method that is to be executed frequently.
Have a set number of threads (say N) all of which are responsible for executing the said method of the objects in the LinkedList. For example, if there are k objects in the list, thread n would execute the method of the n-th object in the list, then move on to n+N-th object, then to n+2N-th, etc., until it loops back to the beginning.
The problem here lies in the retrieval of these objects. I would obviously be using a ListIterator to do this work. However, I predict this will not get very far, thanks to the ConcurrentModificationException that will be thrown according to the documentation. I want the list to be modifiable, and for the iterators to not care. In fact, it is expected that these objects will create and destroy other objects in the list.
I've thought of a few work-arounds:
Create and destroy a new iterator to retrieve the object at the given index. However, this is O(n), undesirable.
Use an ArrayedList instead; however, this is also undesirable, since deletions are O(n) and there are problems with the list needing to expand (and perhaps contract?) from time to time.
Write my own LinkedList class. Don't want to.
Thus, my question. Is there a way to stop a ListIterator from throwing a ConcurrentModificationException?
You seem concerned with performance. Have you actually measured the performance hit of using an O(n) vs O(1) algorithm? Depending on what you are doing and how frequently you are doing it, it might be acceptable to simply use a CopyOnWriteArrayList which is thread safe. Its iterators are also thread safe.
The main performance drag is on mutative operations (set, add, remove...): a new list is recreated each time.
However, the performance will be good enough for most applications. I would personally try using that, profile my application to check that the performance is good enough, and move on if it is. If it is not, you will need to find other ways.
Is there a way to stop a ListIterator from throwing a ConcurrentModificationException?
That you are asking this question this way shows a lack of understanding of how to properly use threads to increase the performance of your application.
The whole purpose of using threads is to divide processing and IO into separate runnable entities that can be executed in parallel -- independent of each other. If you are forking threads to all work on the same LinkedList then you most likely will have a performance loss or minimal gain since the overhead of the synchronization necessary to keep each of the threads' "view" of the LinkedList in sync would counter any gains due to parallel execution.
The question should not be "how to I stop ConcurrentModificationException", it should be "how can I use threads to improve the processing of a list of objects". That's the right question.
To process a collection of objects in parallel with a number of threads, you should be using an ExecutorService thread-pool. You create the pool with something like the following code. Each of the entries in your LinkedList (in this example Job) would then be processed by the threads in the pool in parallel.
// create a thread pool with 10 workers
ExecutorService threadPool = Executors.newFixedThreadPool(10);
// submit each of the objects in the list to the pool
for (Job job : jobLinkedList) {
    threadPool.submit(new MyJobProcessor(job));
}
// once we have submitted all jobs to the thread pool, it should be shutdown
threadPool.shutdown();
// wait for the thread-pool jobs to finish
threadPool.awaitTermination(Long.MAX_VALUE, TimeUnit.MILLISECONDS);
synchronized (jobLinkedList) {
// not sure this is necessary but we need to a memory barrier somewhere
}
...
// you wouldn't need this if Job implemented Runnable
public class MyJobProcessor implements Runnable {
    private Job job;
public MyJobProcessor(Job job) {
        this.job = job;
}
  public void run() {
    // process the job
    }
}
You could use one Iterator to scan the list, and use an Executor to do the work on each object by passing off to a pool of threads. That's easy. There's overhead in packaging up work units this way. You still have to be careful to use Iterator method to modify the list, only, but maybe that simplifies the problem.
Or can you perform your work in one pass, then list modification in the next?
Can you split into N lists?
Please see the answer from #assylias -- his advice is good. I would add that if you decide to write your own linked list class, you need to think very carefully about how to make it thread-safe.
Think about all the ways your list could get mangled if multiple threads tried to modify it simultaneously. Just locking 1 or 2 nodes is not enough -- as an example, take the following list:
A -> B -> C -> D
Imagine that one thread tries to remove B, just as another thread is removing C. To remove B, the link from A needs to "jump" over B to C. But what if C is no longer part of the list by that time? Likewise, to remove C, the link from B needs to be changed to jump to D, but what if B has already been removed from the list by that time? Similar issues arise when nodes are added simultaneously to nearby parts of the list.
If you have 1 lock per node, and you lock 3 nodes when doing a "remove" operation (the node to be removed, and the nodes before and after it), I think it will be thread-safe. You need to also think carefully about which nodes must be locked when adding nodes, and when traversing the list. To avoid deadlocks, you need to make sure to always acquire locks in a constant order, and when traversing the list, you need to use "hand-over-hand" locking (which precludes the use of ordinary Java monitors -- you need explicit lock objects).

what to use in multithreaded environment; Vector or ArrayList

I have this situation:
web application with cca 200 concurent requests (Threads) are in need to log something to local filesystem. I have one class to which all threads are placing their calls, and that class internally stores messages to one Array (Vector or ArrayList) which then in turn will be written to filesystem.
Idea is to return from thread's call ASAP so thread can do it's job as fast as possible, what thread wanted to log can be written to filesystem later, it is not so crucial.
So, that class in turn removes first element from that list and writes it to filesystem, while in real time there is 10 or 20 threads which are appending new logs at the end of that list.
I would like to use ArrayList since it is not synchronized and therefore thread's calls will last less, question is:
am I risking deadlocks / data loss? Is it better to use Vector since it is thread safe? Is it slower to use Vector?
Actually both ArrayList and Vector are very bad choices here, not because of synchronization (which you would definitely need), but because removing the first element is O(n).
The perfect data structure for your purspose is the ConcurrentLinkedQueue: it offers both thread safety (without using synchronization), and O(1) adding and removing.
Are you limitted to particular (old) java version? It not please consider using java.util.concurrent.LinkedBlockingQueue for this kind of stuff. It's really worth looking at java.util.concurrent.* package when dealing with concurrency.
Vector is worse than useless. Don't use it even when using multithreading. A trivial example of why it's bad is to consider two threads simultaneously iterating and removing elements on the list at the same time. The methods size(), get(), remove() might all be synchronized but the iteration loop is not atomic so - kaboom. One thread is bound to try removing something which is not there, or skip elements because the size() changes.
Instead use synchronized() blocks where you expect two threads to access the same data.
private ArrayList myList;
void removeElement(Object e)
{
synchronized (myList) {
myList.remove(e);
}
}
Java 5 provides explicit Lock objects which allow more finegrained control, such as being able to attempt to timeout if a resource is not available in some time period.
private final Lock lock = new ReentrantLock();
private ArrayList myList;
void removeElement(Object e) {
{
if (!lock.tryLock(1, TimeUnit.SECONDS)) {
// Timeout
throw new SomeException();
}
try {
myList.remove(e);
}
finally {
lock.unlock();
}
}
There actually is a marginal difference in performance between a sychronizedlist and a vector. (http://www.javacodegeeks.com/2010/08/java-best-practices-vector-arraylist.html)

Categories