Java multithreaded parser

Java multithreaded parser - java

I am writing a multithreaded parser.
Parser class is as follows.
public class Parser extends HTMLEditorKit.ParserCallback implements Runnable {
private static List<Station> itemList = Collections.synchronizedList(new ArrayList<Item>());
private boolean h2Tag = false;
private int count;
private static int threadCount = 0;
public static List<Item> parse() {
for (int i = 1; i <= 1000; i++) { //1000 of the same type of pages that need to parse
while (threadCount == 20) { //limit the number of simultaneous threads
try {
Thread.sleep(50);
} catch (InterruptedException ex) {
ex.printStackTrace();
}
}
Thread thread = new Thread(new Parser());
thread.setName(Integer.toString(i));
threadCount++; //increase the number of working threads
thread.start();
}
return itemList;
}
public void run() {
//Here is a piece of code responsible for creating links based on
//the thread name and passed as a parameter remained i,
//connection, start parsing, etc.
//In general, nothing special. Therefore, I won't paste it here.
threadCount--; //reduce the number of running threads when current stops
}
private static void addItem(Item item) {
itenList.add(item);
}
//This method retrieves the necessary information after the H2 tag is detected
#Override
public void handleText(char[] data, int pos) {
if (h2Tag) {
String itemName = new String(data).trim();
//Item - the item on which we receive information from a Web page
Item item = new Item();
item.setName(itemName);
item.setId(count);
addItem(item);
//Display information about an item in the console
System.out.println(count + " = " + itemName);
}
}
#Override
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
if (HTML.Tag.H2 == t) {
h2Tag = true;
}
}
#Override
public void handleEndTag(HTML.Tag t, int pos) {
if (HTML.Tag.H2 == t) {
h2Tag = false;
}
}
}
From another class parser runs as follows:
List<Item> list = Parser.parse();
All is good, but there is a problem. At the end of parsing in the final list "List itemList" contains 980 elements onto, instead of 1000. But in the console there is all of 1000 elements (items). That is, some threads for some reason did not call in the handleText method the addItem method.
I already tried to change the type of itemList to ArrayList, CopyOnWriteArrayList, Vector. Makes the method addItem synchronized, changed its call on the synchronized block. All this only changes the number of elements a little, but the final thousand can not be obtained.
I also tried to parse a smaller number of pages (ten). As the result the list is empty, but in the console all 10.
If I remove multi-threading, then everything works fine, but, of course, slowly. That's not good.
If decrease the number of concurrent threads, the number of items in the list is close to the desired 1000, if increase - a little distanced from 1000. That is, I think, there is a struggle for the ability to record to the list. But then why are synchronization not working?
What's the problem?

After your parse() call returns, all of your 1000 Threads have been started, but it is not guaranteed that they are finished. In fact, they aren't that's the problem you see. I would heavily recommend not write this by yourself but use the tools provided for this kind of job by the SDK.
The documentation Thread Pools and the ThreadPoolExecutor are e.g. a good starting point. Again, don't implement this yourself if you are not absolutely sure you have too, because writing such multi-threading code is pure pain.
Your code should look something like this:
ExecutorService executor = Executors.newFixedThreadPool(20);
List<Future<?>> futures = new ArrayList<Future<?>>(1000);
for (int i = 0; i < 1000; i++) {
futures.add(executor.submit(new Runnable() {...}));
}
for (Future<?> f : futures) {
f.get();
}

There is no problem with the code, it is working as you have coded. the problem is with the last iteration. rest all iterations will work properly, but during the last iteration which is from 980 to 1000, the threads are created, but the main process, does not waits for the other thread to complete, and then return the list. therefore you will be getting some odd number between 980 to 1000, if you are working with 20 threads at a time.
Now you can try adding Thread.wait(50), before returning the list, in that case your main thread will wait, some time, and may be by the time, other threads might finish the processing.
or you can use some syncronization API from java. Instead of Thread.wait(), use CountDownLatch, this will help you to wait for the threads to complete the processing, and then you can create new threads.

Related

How to prioritise waiting CompletableFutures by access time instead of creation time?

TL;DR: When several CompletableFutures are waiting to get executed, how can I prioritize those whose values i'm interested in?
I have a list of 10,000 CompletableFutures (which calculate the data rows for an internal report over the product database):
List<Product> products = ...;
List<CompletableFuture<DataRow>> dataRows = products
.stream()
.map(p -> CompletableFuture.supplyAsync(() -> calculateDataRowForProduct(p), singleThreadedExecutor))
.collect(Collectors.toList());
Each takes around 50ms to complete, so the entire thing finishes in 500sec. (they all share the same DB connection, so cannot run in parallel).
Let's say I want to access the data row of the 9000th product:
dataRows.get(9000).join()
The problem is, all these CompletableFutures are executed in the order they have been created, not in the order they are accessed. Which means I have to wait 450sec for it to calculate stuff that at the moment I don't care about, to finally get to the data row I want.
Question:
Is there any way to change this behaviour, so that the Futures I try to access get priority over those I don't care about at the moment?
First thoughts:
I noticed that a ThreadPoolExecutor uses a BlockingQueue<Runnable> to queue up entries waiting for an available Thread.
So I thought about using a PriorityBlockingQueue, to change the priority of the Runnable when I access its CompletableFuture but:
PriorityBlockingQueue does not have a method to reprioritize an existing element, and
I need to figure out a way to get from the CompletableFuture to the corresponding Runnable entry in the queue.
Before I go further down this road, do you think this sounds like the correct approach. Do others ever had this kind of requirement? I tried to search for it, but found exactly nothing. Maybe CompletableFuture is not the correct way of doing this?
Background:
We have an internal report which displays 100 products per page. Initially we precalculated all DataRows for the report, which took way to long if someone has that many products.
So first optimization was to wrap the calculation in a memoized supplier:
List<Supplier<DataRow>> dataRows = products
.stream()
.map(p -> Suppliers.memoize(() -> calculateDataRowForProduct(p)))
.collect(Collectors.toList());
This means that initial display of first 100 entries now takes 5sec instead of 500sec (which is great), but when the user switches to the next pages, it takes another 5sec for each single one of them.
So the idea is, while the user is staring at the first screen, why not precalculate the next pages in the background. Which leads me to my question above.

Interesting problem :)
One way is to roll out custom FutureTask class to facilitate changing priorities of tasks dynamically.
DataRow and Product are both taken as just String here for simplicity.
import java.util.*;
import java.util.concurrent.*;
public class Testing {
private static String calculateDataRowForProduct(String product) {
try {
// Dummy operation.
Thread.sleep(200);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("Computation done for " + product);
return "data row for " + product;
}
public static void main(String[] args) throws ExecutionException, InterruptedException {
PriorityBlockingQueue<Runnable> customQueue = new PriorityBlockingQueue<Runnable>(1, new CustomRunnableComparator());
ThreadPoolExecutor executor = new ThreadPoolExecutor(1, 1, 0L, TimeUnit.MILLISECONDS, customQueue);
List<String> products = new ArrayList<>();
for (int i = 0; i < 10; i++) {
products.add("product" + i);
}
Map<Integer, PrioritizedFutureTask<String>> taskIndexMap = new HashMap<>();
for (int i = 0; i < products.size(); i++) {
String product = products.get(i);
Callable callable = () -> calculateDataRowForProduct(product);
PrioritizedFutureTask<String> dataRowFutureTask = new PrioritizedFutureTask<>(callable, i);
taskIndexMap.put(i, dataRowFutureTask);
executor.execute(dataRowFutureTask);
}
List<Integer> accessOrder = new ArrayList<>();
accessOrder.add(4);
accessOrder.add(7);
accessOrder.add(2);
accessOrder.add(9);
int priority = -1 * accessOrder.size();
for (Integer nextIndex : accessOrder) {
PrioritizedFutureTask taskAtIndex = taskIndexMap.get(nextIndex);
assert (customQueue.remove(taskAtIndex));
customQueue.offer(taskAtIndex.set_priority(priority++));
// Now this task will be at the front of the thread pool queue.
// Hence this task will execute next.
}
for (Integer nextIndex : accessOrder) {
PrioritizedFutureTask<String> dataRowFutureTask = taskIndexMap.get(nextIndex);
String dataRow = dataRowFutureTask.get();
System.out.println("Data row for index " + nextIndex + " = " + dataRow);
}
}
}
class PrioritizedFutureTask<T> extends FutureTask<T> implements Comparable<PrioritizedFutureTask<T>> {
private Integer _priority = 0;
private Callable<T> callable;
public PrioritizedFutureTask(Callable<T> callable, Integer priority) {
super(callable);
this.callable = callable;
_priority = priority;
}
public Integer get_priority() {
return _priority;
}
public PrioritizedFutureTask set_priority(Integer priority) {
_priority = priority;
return this;
}
#Override
public int compareTo(#NotNull PrioritizedFutureTask<T> other) {
if (other == null) {
throw new NullPointerException();
}
return get_priority().compareTo(other.get_priority());
}
}
class CustomRunnableComparator implements Comparator<Runnable> {
#Override
public int compare(Runnable task1, Runnable task2) {
return ((PrioritizedFutureTask)task1).compareTo((PrioritizedFutureTask)task2);
}
}
Output:
Computation done for product0
Computation done for product4
Data row for index 4 = data row for product4
Computation done for product7
Data row for index 7 = data row for product7
Computation done for product2
Data row for index 2 = data row for product2
Computation done for product9
Data row for index 9 = data row for product9
Computation done for product1
Computation done for product3
Computation done for product5
Computation done for product6
Computation done for product8
There is one more scope of optimization here.
The customQueue.remove(taskAtIndex) operation has O(n) time complexity with respect to the size of the queue (or the total number of products).
It might not affect much if the number of products is less (<= 10^5).
But it might result in a performance issue otherwise.
One solution to that is to extend BlockingPriorityQueue and roll out functionality to remove an element from a priority queue in O(logn) rather than O(n).
We can achieve that by keeping a hashmap inside the PriorityQueue structure. This hashmap will keep a count of elements vs the index (or indices in case of duplicates) of that element in the underlying array.
Fortunately, I had already implemented such a heap in Python sometime back.
If you have more questions on this optimization, its probably better to ask a new question altogether.

You could avoid submitting all of the tasks to the executor at the start, instead only submit one background task and when it finishes submit the next. If you want to get the 9000th row submit it immediately (if it has not already been submitted):
static class FutureDataRow {
CompletableFuture<DataRow> future;
int index;
List<FutureDataRow> list;
Product product;
FutureDataRow(List<FutureDataRow> list, Product product){
this.list = list;
index = list.size();
list.add(this);
this.product = product;
}
public DataRow get(){
submit();
return future.join();
}
private synchronized void submit(){
if(future == null) future = CompletableFuture.supplyAsync(() ->
calculateDataRowForProduct(product), singleThreadedExecutor);
}
private void background(){
submit();
if(index >= list.size() - 1) return;
future.whenComplete((dr, t) -> list.get(index + 1).background());
}
}
...
List<FutureDataRow> dataRows = new ArrayList<>();
products.forEach(p -> new FutureDataRow(dataRows, p));
dataRows.get(0).background();
If you want you could also submit the next row inside the get method if you expect that they will navigate to the next page afterwards.
If you were instead using a multithreaded executor and you wanted to run multiple background tasks concurrently you could modify the background method to find the next unsubmitted task in the list and start it when the current background task has finished.
private synchronized boolean background(){
if(future != null) return false;
submit();
future.whenComplete((dr, t) -> {
for(int i = index + 1; i < list.size(); i++){
if(list.get(i).background()) return;
}
});
return true;
}
You would also need to start the first n tasks in the background instead of just the first one.
int n = 8; //number of active background tasks
for(int i = 0; i < dataRows.size() && n > 0; i++){
if(dataRows.get(i).background()) n--;
}

To answer my own question...
There is a surprisingly simple (and surprisingly boring) solution to my problem. I have no idea why it took me three days to find it, I guess it required the right mindset, that you only have when walking along an endless tranquilizing beach looking into the sunset on a quiet Sunday evening.
So, ah, it's a little bit embarrassing to write this, but when I need to fetch a certain value (say for 9000th product), and the future has not yet computed that value, I can, instead of somehow forcing the future to produce that value asap (by doing all this repriorisation and scheduling magic), I can, well, I can, ... simply ... compute that value myself! Yes! Wait, what? Seriously, that's it?
It's something like this: if (!future.isDone()) {future.complete(supplier.get());}
I just need to store the original Supplier alongside the CompletableFuture in some wrapper class. This is the wrapper class, which works like a charm, all it needs is a better name:
public static class FuturizedMemoizedSupplier<T> implements Supplier<T> {
private CompletableFuture<T> future;
private Supplier<T> supplier;
public FuturizedSupplier(Supplier<T> supplier) {
this.supplier = supplier;
this.future = CompletableFuture.supplyAsync(supplier, singleThreadExecutor);
}
public T get() {
// if the future is not yet completed, we just calculate the value ourselves, and set it into the future
if (!future.isDone()) {
future.complete(supplier.get());
}
supplier = null;
return future.join();
}
}
Now, I think, there is a small chance for a race condition here, which could lead to the supplier being executed twice. But actually, I don't care, it produces the same value anyway.
Afterthoughts:
I have no idea why I didn't think of this earlier, I was completely fixated on the idea, it has to be the CompletableFuture which calculates the value, and it has to run in one of these background threads, and whatnot, and, well, none of these mattered or were in any way a requirement.
I think this whole question is a classic example of Ask what problem you really want to solve instead of coming up with a half baked broken solution, and ask how to fix that. In the end, I didn't care about CompletableFuture or any of its features at all, it was just the easiest way that came to my mind to run something in the background.
Thanks for your help!

Concurrent checking if collection is empty

I have this piece of code:
private ConcurrentLinkedQueue<Interval> intervals = new ConcurrentLinkedQueue();
#Override
public void run(){
while(!intervals.isEmpty()){
//remove one interval
//do calculations
//add some intervals
}
}
This code is being executed by a specific number of threads at the same time. As you see, loop should go on until there are no more intervals left in the collection, but there is a problem. In the beginning of each iteration an interval gets removed from collection and in the end some number of intervals might get added back into same collection.
Problem is, that while one thread is inside the loop the collection might become empty, so other threads that are trying to enter the loop won't be able to do that and will finish their work prematurely, even though collection might be filled with values after the first thread will finish the iteration. I want the thread count to remain constant (or not more than some number n) until all work is really finished.
That means that no threads are currently working in the loop and there are no elements left in the collection. What are possible ways of accomplishing that? Any ideas are welcomed.
One way to solve this problem in my specific case is to give every thread a different piece of the original collection. But after one thread would finish its work it wouldn't be used by the program anymore, even though it could help other threads with their calculations, so I don't like this solution, because it's important to utilize all cores of the machine in my problem.
This is the simplest minimal working example I could come up with. It might be to lengthy.
public class Test{
private ConcurrentLinkedQueue<Interval> intervals = new ConcurrentLinkedQueue();
private int threadNumber;
private Thread[] threads;
private double result;
public Test(int threadNumber){
intervals.add(new Interval(0, 1));
this.threadNumber = threadNumber;
threads = new Thread[threadNumber];
}
public double find(){
for(int i = 0; i < threadNumber; i++){
threads[i] = new Thread(new Finder());
threads[i].start();
}
try{
for(int i = 0; i < threadNumber; i++){
threads[i].join();
}
}
catch(InterruptedException e){
System.err.println(e);
}
return result;
}
private class Finder implements Runnable{
#Override
public void run(){
while(!intervals.isEmpty()){
Interval interval = intervals.poll();
if(interval.high - interval.low > 1e-6){
double middle = (interval.high + interval.low) / 2;
boolean something = true;
if(something){
intervals.add(new Interval(interval.low + 0.1, middle - 0.1));
intervals.add(new Interval(middle + 0.1, interval.high - 0.1));
}
else{
intervals.add(new Interval(interval.low + 0.1, interval.high - 0.1));
}
}
}
}
}
private class Interval{
double low;
double high;
public Interval(double low, double high){
this.low = low;
this.high = high;
}
}
}
What you might need to know about the program: After every iteration interval should either disappear (because it's too small), become smaller or split into two smaller intervals. Work is finished after no intervals are left. Also, I should be able to limit number of threads that are doing this work with some number n. The actual program looks for a maximum value of some function by dividing the intervals and throwing away the parts of those intervals that can't contain the maximum value using some rules, but this shouldn't really be relevant to my problem.

The CompletableFuture class is also an interesting solution for these kind of tasks.
It automatically distributes workload over a number of worker threads.
static CompletableFuture<Integer> fibonacci(int n) {
if(n < 2) return CompletableFuture.completedFuture(n);
else {
return CompletableFuture.supplyAsync(() -> {
System.out.println(Thread.currentThread());
CompletableFuture<Integer> f1 = fibonacci(n - 1);
CompletableFuture<Integer> f2 = fibonacci(n - 2);
return f1.thenCombineAsync(f2, (a, b) -> a + b);
}).thenComposeAsync(f -> f);
}
}
public static void main(String[] args) throws Exception {
int fib = fibonacci(10).get();
System.out.println(fib);
}

You can use atomic flag, i.e.:
private ConcurrentLinkedQueue<Interval> intervals = new ConcurrentLinkedQueue<>();
private AtomicBoolean inUse = new AtomicBoolean();
#Override
public void run() {
while (!intervals.isEmpty() && inUse.compareAndSet(false, true)) {
// work
inUse.set(false);
}
}
UPD
Question has been updated, so I would give you better solution. It is more "classic" solution using blocking queue;
private BlockingQueue<Interval> intervals = new ArrayBlockingQueue<Object>();
private volatile boolean finished = false;
#Override
public void run() {
try {
while (!finished) {
Interval next = intervals.take();
// put work there
// after you decide work is finished just set finished = true
intervals.put(interval); // anyway, return interval to queue
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
UPD2
Now it seems better to re-write solution and divide range to sub-ranges for each thread.

Your problem looks like a recursive one - processing one task (interval) might produce some sub-tasks (sub intervals).
For that purpose I would use ForkJoinPool and RecursiveTask:
class Interval {
...
}
class IntervalAction extends RecursiveAction {
private Interval interval;
private IntervalAction(Interval interval) {
this.interval = interval;
}
#Override
protected void compute() {
if (...) {
// we need two sub-tasks
IntervalAction sub1 = new IntervalAction(new Interval(...));
IntervalAction sub2 = new IntervalAction(new Interval(...));
sub1.fork();
sub2.fork();
sub1.join();
sub2.join();
} else if (...) {
// we need just one sub-task
IntervalAction sub3 = new IntervalAction(new Interval(...));
sub3.fork();
sub3.join();
} else {
// current task doesn't need any sub-tasks, just return
}
}
}
public static void compute(Interval initial) {
ForkJoinPool pool = new ForkJoinPool();
pool.invoke(new IntervalAction(initial));
// invoke will return when all the processing is completed
}

I had the same problem, and I tested the following solution.
In my test example I have a queue (the equivalent of your intervals) filled with integers. For the test, at each iteration one number is taken from the queue, incremented and placed back in the queue if the new value is below 7 (arbitrary). This has the same impact as your interval generation on the mechanism.
Here is an example working code (Note that I develop in java 1.8 and I use the Executor framework to handle my thread pool.) :
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.Executors;
import java.util.concurrent.LinkedBlockingQueue;
import java.util.concurrent.PriorityBlockingQueue;
import java.util.concurrent.ThreadPoolExecutor;
public class Test {
final int numberOfThreads;
final BlockingQueue<Integer> queue;
final BlockingQueue<Integer> availableThreadsTokens;
final BlockingQueue<Integer> sleepingThreadsTokens;
final ThreadPoolExecutor executor;
public static void main(String[] args) {
final Test test = new Test(2); // arbitrary number of thread => 2
test.launch();
}
private Test(int numberOfThreads){
this.numberOfThreads = numberOfThreads;
this.queue = new PriorityBlockingQueue<Integer>();
this.availableThreadsTokens = new LinkedBlockingQueue<Integer>(numberOfThreads);
this.sleepingThreadsTokens = new LinkedBlockingQueue<Integer>(numberOfThreads);
this.executor = (ThreadPoolExecutor) Executors.newFixedThreadPool(numberOfThreads);
}
public void launch() {
// put some elements in queue at the beginning
queue.add(1);
queue.add(2);
queue.add(3);
for(int i = 0; i < numberOfThreads; i++){
availableThreadsTokens.add(1);
}
System.out.println("Start");
boolean algorithmIsFinished = false;
while(!algorithmIsFinished){
if(sleepingThreadsTokens.size() != numberOfThreads){
try {
availableThreadsTokens.take();
} catch (final InterruptedException e) {
e.printStackTrace();
// some treatment should be put there in case of failure
break;
}
if(!queue.isEmpty()){ // Continuation condition
sleepingThreadsTokens.drainTo(availableThreadsTokens);
executor.submit(new Loop(queue.poll(), queue, availableThreadsTokens));
}
else{
sleepingThreadsTokens.add(1);
}
}
else{
algorithmIsFinished = true;
}
}
executor.shutdown();
System.out.println("Finished");
}
public static class Loop implements Runnable{
int element;
final BlockingQueue<Integer> queue;
final BlockingQueue<Integer> availableThreadsTokens;
public Loop(Integer element, BlockingQueue<Integer> queue, BlockingQueue<Integer> availableThreadsTokens){
this.element = element;
this.queue = queue;
this.availableThreadsTokens = availableThreadsTokens;
}
#Override
public void run(){
System.out.println("taking element "+element);
for(Long l = (long) 0; l < 500000000L; l++){
}
for(Long l = (long) 0; l < 500000000L; l++){
}
for(Long l = (long) 0; l < 500000000L; l++){
}
if(element < 7){
this.queue.add(element+1);
System.out.println("Inserted element"+(element + 1));
}
else{
System.out.println("no insertion");
}
this.availableThreadsTokens.offer(1);
}
}
}
I ran this code for check, and it seems to work properly. However there are certainly some improvement that can be made :
sleepingThreadsTokens do not have to be a BlockingQueue, since only the main accesses it. I used this interface because it allowed a nice sleepingThreadsTokens.drainTo(availableThreadsTokens);
I'm not sure whether queue has to be blocking or not, since only main takes from it and does not wait for elements (it waits only for tokens).
...
The idea is that the main thread checks for the termination, and for this it has to know how many threads are currently working (so that it does not prematurely stops the algorithm because the queue is empty). To do so two specific queues are created : availableThreadsTokens and sleepingThreadsTokens. Each element in availableThreadsTokens symbolizes a thread that have finished an iteration, and wait to be given another one. Each element in sleepingThreadsTokens symbolizes a thread that was available to take a new iteration, but the queue was empty, so it had no job and went to "sleep". So at each moment availableThreadsTokens.size() + sleepingThreadsTokens.size() = numberOfThreads - threadExcecutingIteration.
Note that the elements on availableThreadsTokens and sleepingThreadsTokens only symbolizes thread activity, they are not thread nor design a specific thread.
Case of termination : let suppose we have N threads (aribtrary, fixed number). The N threads are waiting for work (N tokens in availableThreadsTokens), there is only 1 remaining element in the queue and the treatment of this element won't generate any other element. Main takes the first token, finds that the queue is not empty, poll the element and sends the thread to work. The N-1 next tokens are consumed one by one, and since the queue is empty the token are moved into sleepingThreadsTokens one by one. Main knows that there is 1 thread working in the loop since there is no token in availableThreadsTokens and only N-1 in sleepingThreadsTokens, so it waits (.take()). When the thread finishes and releases the token Main consumes it, discovers that the queue is now empty and put the last token in sleepingThreadsTokens. Since all tokens are now in sleepingThreadsTokens Main knows that 1) all threads are inactive 2) the queue is empty (else the last token wouldn't have been transferred to sleepingThreadsTokens since the thread would have take the job).
Note that if the working thread finishes the treatment before all the availableThreadsTokens are moved to sleepingThreadsTokens it makes no difference.
Now if we suppose that the treatment of the last element would have generated M new elements in the queue then the Main would have put all the tokens from sleepingThreadsTokens back to availableThreadsTokens, and start to assign them treatments again. We put all the token back even if M < N because we don't know how much elements will be inserted in the future, so we have to keep all the thread available.

I would suggest a master/worker approach then.
The master process goes through the intervals and assigns the calculations of that interval to a different process. It also removes/adds as necessary. This way, all the cores are utilized, and only when all intervals are finished, the process is done. This is also known as dynamic work allocation.
A possible example:
public void run(){
while(!intervals.isEmpty()){
//remove one interval
Thread t = new Thread(new Runnable()
{
//do calculations
});
t.run();
//add some intervals
}
}
The possible solution you provided is known as static allocation, and you're correct, it will finish as fast as the slowest processor, but the dynamic approach will utilize all memory.

I've run into this problem as well. The way I solved it was to use an AtomicInteger to know what is in the queue. Before each offer() increment the integer. After each poll() decrement the integer. The CLQ has no real isEmpty() since it must look at head/tail nodes and this can change atomically (CAS).
This doesn't guarantee 100% that some thread may increment after another thread decrements so you need to check again before ending the thread. It is better than relying on while(...isEmpty())
Other than that, you may need to synchronize.

Multithreading arraylist of objects

My program has an arraylist of websites which I do I/O with image processing, scrape data from sites and update/insert into database. Right now it is slow because all of the I/O being done. I would like to speed this up by allowing my program to run with threads. Nothing is ever removed from the list and every website in the list is separate from each other so to me it seems okay to have instances looping through the list at the same time to speed this up.
Let's say my list is 10 websites, right now of course it's looping through position 0 through 9 until my program is done processing for all websites.
And let's say I want to have 3 threads looping through this list of 10 websites at once doing all the I/O and database updates in their own separate space at the same time but using the same list.
website.get(0) // thread1
website.get(1) // thread2
website.get(2) // thread3
Then say if thread2 reaches the end of the loop it first it comes back and works on the next position
website.get(3) // thread2
Then thread3 completes and gets the next position
website.get(4) // thread3
and then thread1 finally completes and works on the next position
website.get(5) // thread1
etc until it's done. Is this easy to set up? Is there somewhere I can find a good example of it being done? I've looked online to try to find somewhere else talking about my scenario but I haven't found it.

In my app, I use ExecutorService like this, and it works well:
Main code:
ExecutorService pool = Executors.newFixedThreadPool(3); //number of concurrent threads
for (String name : website) { //Your ArrayList
pool.submit(new DownloadTask(name, toPath));
}
pool.shutdown();
pool.awaitTermination(5, TimeUnit.SECONDS); //Wait for all the threads to finish, adjust as needed.
The actual class where you do the work:
private static class DownloadTask implements Runnable {
private String name;
private final String toPath;
public DownloadTask(String name, String toPath) {
this.name = name;
this.toPath = toPath;
}
#Override
public void run() {
//Do your parsing / downloading / etc. here.
}
}
Some cautions:
If you are using a database, you have to ensure that you don't have two threads writing to that database at the same time.
See here for more info.

As mentioned in other comments/answer you just need a thread pool executor with fixed size (say 3 as per your example) which runs 3 threads which iterate over the same list without picking up duplicate websites.
So apart from thread pool executor, you probably need to also need to correctly work out the next index in each thread to pick the element from that list in such a way that thread does not pick up same element from list and also not miss any element.
Hence i think you can use BlockingQueue instead of list which eliminates the index calculation part and guarantees that the element is correctly picked from the collection.
public class WebsitesHandler {
public static void main(String[] args) {
BlockingQueue<Object> websites = new LinkedBlockingQueue<>();
ExecutorService executorService = Executors.newFixedThreadPool(3);
Worker[] workers = new Worker[3];
for (int i = 0; i < workers.length; i++) {
workers[i] = new Worker(websites);
}
try {
executorService.invokeAll(Arrays.asList(workers));
} catch (InterruptedException e) {
e.printStackTrace();
}
}
private static class Worker implements Callable {
private BlockingQueue<Object> websites;
public Worker(BlockingQueue<Object> websites) {
this.websites = websites;
}
public String call() {
try {
Object website;
while ((website = websites.poll(1, TimeUnit.SECONDS)) != null) {
// execute the task
}
} catch (InterruptedException e) {
e.printStackTrace();
}
return "done";
}
}
}

I think you need to update yourself with latest version of java i.e Java8
And study about Streams API,That will definitely solve your problem

What are some good strategies for synchronizing in this multi-threaded situation?

I have two primary threads. One spawns new threads and the other listens for results, like so:
//Spawner
while(!done) {
spawnNewProcess(nextId, parameters);
myListener.listenFor(nextId);
nextId ++;
}
The spawnNewProcess method takes a widely variable amount of time. When it finishes, it will put a result object into a map that can be accessed by Id.
The listener thread runs like so:
//Listener
while(!done) {
for (int id : toListenFor) {
if (resultMap.contains(id)) {
result = resultMap.get(id);
toListenFor.remove(id);
process(result);
}
}
}
I can't change the spawnNewProcess method, nor how it stores results. What I want to do is set a maximum limit on how many can be going concurrently. My first inclination would be to just have a variable track that number. If the max would be exceeded, then the spawner waits. When a result comes back, the listener will notify it. Like this:
//Spawner2
AtomicInteger numSpawns = new AtomicInteger(0);
int maxSpawns = 10;
while(!done) {
if (numSpawns.intValue() >= maxSpawns) {
this.wait(0);
}
numSpawns.getAndIncrement;
spawnNewProcess(nextId, parameters);
myListener.listenFor(nextId);
nextId ++;
}
And the Listener be:
//Listener2
while(!done) {
for (int id : toListenFor) {
if (resultMap.contains(id)) {
result = resultMap.get(id);
toListenFor.remove(id);
numSpawns.getAndDecrement();
Spawner.notify();
process(result);
}
}
}
Will this work? Are there potential deadlocks that I'm missing? It wouldn't be a huge deal if somehow 11 or 9 spawns were running at the same time instead of 10. Or is there a much better way that I'm oblivious to?

Use a Semaphore.
import java.util.concurrent.Semaphore;
private Semaphore sem = new Semaphore(NUM_MAX_CONCURRENT);
// Spawner
while(!done) {
sem.acquire(); // added by corsiKa
spawnNewProcess(nextId, parameters);
myListener.listenFor(nextId);
nextId ++;
}
// listener
while(!done) {
for (int id : toListenFor) {
if (resultMap.contains(id)) {
result = resultMap.get(id);
toListenFor.remove(id);
sem.release(); // added by corsiKa
process(result);
}
}
}

To control the number of spawners running, use a Executors.newFixedThreadPool(size), which will always run no more than a fixed amount of tasks at once. Then wrap the spawning tasks in a Runnable and pass them to the ExecutorService.
while(!done) {
task = new Runnable() { public void run() {
spawnNewProcess(nextId, parameters);
} });
exec.submit(task);;
nextId ++;
}
To get the results back, use a SynchronousQueue or ConcurrentLinkedQueue, which will allow you to pass objects between threads without using lower-level concurrency objects.

Assigning a object to a field defined outside a synchronized block - is it thread safe?

Is there anything wrong with the thread safety of this java code? Threads 1-10 add numbers via sample.add(), and Threads 11-20 call removeAndDouble() and print the results to stdout. I recall from the back of my mind that someone said that assigning item in same way as I've got in removeAndDouble() using it outside of the synchronized block may not be thread safe. That the compiler may optimize the instructions away so they occur out of sequence. Is that the case here? Is my removeAndDouble() method unsafe?
Is there anything else wrong from a concurrency perspective with this code? I am trying to get a better understanding of concurrency and the memory model with java (1.6 upwards).
import java.util.*;
import java.util.concurrent.*;
public class Sample {
private final List<Integer> list = new ArrayList<Integer>();
public void add(Integer o) {
synchronized (list) {
list.add(o);
list.notify();
}
}
public void waitUntilEmpty() {
synchronized (list) {
while (!list.isEmpty()) {
try {
list.wait(10000);
} catch (InterruptedException ex) { }
}
}
}
public void waitUntilNotEmpty() {
synchronized (list) {
while (list.isEmpty()) {
try {
list.wait(10000);
} catch (InterruptedException ex) { }
}
}
}
public Integer removeAndDouble() {
// item declared outside synchronized block
Integer item;
synchronized (list) {
waitUntilNotEmpty();
item = list.remove(0);
}
// Would this ever be anything but that from list.remove(0)?
return Integer.valueOf(item.intValue() * 2);
}
public static void main(String[] args) {
final Sample sample = new Sample();
for (int i = 0; i < 10; i++) {
Thread t = new Thread() {
public void run() {
while (true) {
System.out.println(getName()+" Found: " + sample.removeAndDouble());
}
}
};
t.setName("Consumer-"+i);
t.setDaemon(true);
t.start();
}
final ExecutorService producers = Executors.newFixedThreadPool(10);
for (int i = 0; i < 10; i++) {
final int j = i * 10000;
Thread t = new Thread() {
public void run() {
for (int c = 0; c < 1000; c++) {
sample.add(j + c);
}
}
};
t.setName("Producer-"+i);
t.setDaemon(false);
producers.execute(t);
}
producers.shutdown();
try {
producers.awaitTermination(600, TimeUnit.SECONDS);
} catch (InterruptedException e) {
e.printStackTrace();
}
sample.waitUntilEmpty();
System.out.println("Done.");
}
}

It looks thread safe to me. Here is my reasoning.
Everytime you access list you do it synchronized. This is great. Even though you pull out a part of the list in item, that item is not accessed by multiple threads.
As long as you only access list while synchronized, you should be good (in your current design.)

Your synchronization is fine, and will not result in any out-of-order execution problems.
However, I do notice a few issues.
First, your waitUntilEmpty method would be much more timely if you add a list.notifyAll() after the list.remove(0) in removeAndDouble. This will eliminate an up-to 10 second delay in your wait(10000).
Second, your list.notify in add(Integer) should be a notifyAll, because notify only wakes one thread, and it may wake a thread that is waiting inside waitUntilEmpty instead of waitUntilNotEmpty.
Third, none of the above is terminal to your application's liveness, because you used bounded waits, but if you make the two above changes, your application will have better threaded performance (waitUntilEmpty) and the bounded waits become unnecessary and can become plain old no-arg waits.

Your code as-is is in fact thread safe. The reasoning behind this is two part.
The first is mutual exclusion. Your synchronization correctly ensures that only one thread at a time will modify the collections.
The second has to do with your concern about compiler reordering. Youre worried that the compile can in fact re order the assigning in which it wouldnt be thread safe. You dont have to worry about it in this case. Synchronizing on the list creates a happens-before relationship. All removes from the list happens-before the write to Integer item. This tells the compiler that it cannot re order the write to item in that method.

Your code is thread-safe, but not concurrent (as in parallel). As everything is accessed under a single mutual exclusion lock, you are serialising all access, in effect access to the structure is single-threaded.
If you require the functionality as described in your production code, the java.util.concurrent package already provides a BlockingQueue with (fixed size) array and (growable) linked list based implementations. These are very interesting to study for implementation ideas at the very least.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java multithreaded parser - java

Related

How to prioritise waiting CompletableFutures by access time instead of creation time?

Concurrent checking if collection is empty

Multithreading arraylist of objects

What are some good strategies for synchronizing in this multi-threaded situation?

Assigning a object to a field defined outside a synchronized block - is it thread safe?

Categories

Resources