i have the following situation:
Read data from database
do work "calculation"
write result to database
I have a thread that reads from the database and puts the generated objects into a BlockingQueue. These objects are extremely heavy weight hence the queue to limit amount of objects in memory.
A multiple threads take objects from the Queue, performs work and put the results in a second queue.
The final thread takes results from second queue and saves result to database.
The problem is how to prevent deadlocks, eg. the "calculation threads" need to know when no more objects will be put into the queue.
Currently I achieve this by passing a references of the threads (callable) to each other and checking thread.isDone() before a poll or offer and then if the element is null. I also check the size of the queue, as long as there are elements in it, the must be consumed. Using take or put leads to deadlocks.
Is there a simpler way to achieve this?
One of the ways to accomplish would be to put a "dummy" or "poison" message as the last message on the queue when you are sure that no more tasks are going to arrive on the queue.. for example after putting the message related to the last row of the db query. So the producer puts a dummy message on the queue, the consumer on receiving this dummy message knows that no more meaningful work is expected in this batch.
Maybe you should take a look at CompletionService
It is designed to combine executor and a queue functionality in one.
Tasks which completed execution will be available from the completions service via
completionServiceInstance.take()
You can then again use another executor for 3. i.e. fill DB with the results, which you will feed with the results taken from the completionServiceInstance.
Related
I am trying to process a object list at the same time with different classes, but I am not sure if I am heading in the right direction. I have read up about ExecutorService and Parallel stream, but not sure if it is the correct way.
So to provide an example:
I have a publisher that collects data and places it in a list. The publisher has multiple subscribers linked to it, which need to process the data and store it in a map. The map is then built up until all the data is processed and is then stored in a database table. Each subscriber has their own table that needs to be populate in some form with the data provided. What I am trying to accomplish is distributing the list to the subscribers at the same time and once all subscribers are finished processing the next set off data is supplied and so forth until all the data has been processed for a date range.
If anyone has some suggestions what I can look at, that would be awesome.
I'd start this way:
Publisher
do not collect to a list, rather make it an Observable and have all the subscribers observe him.
On each batch, initialise a CountDownLatch to the batch size, and wait until it clears you.
Subscriber
should observe events of the type they are interested in in a different thread (only one per type)
and on finishing with the handling of an event - they should notify the CDL of the Publisher.
Publisher
after all the events are finished, and the CDL is released, should save the results to db
and then go on to the next batch
Now, all this is a suggestion, with multiple other possible solutions, and a very high-level one at that with many implementation details left unsaid.
You should keep your eyes open as you work, and not be afraid to change routes if it is better.
I am working on a design of a program that will need to fetch results from a datastore and post those results to another system. The data that I am fetching is referenced by a UUID, and has other documents linked to it by UUIDs. I will be posting a lot of documents (>100K documents), so I would like to do this concurrently. I am thinking about the following design:
Get the list of documents from the datastore. Each document would have:
docId (UUID)
docData (json doc)
type1 (UUID)
type1Data (json)
type2 (UUUID)
type2Data (json)
list<UUID> type3Ids
list of type3 data (json)
The only data that I get from my first call are the docIds. I was thinking of pushing these documents into a queue and having a set of workers (fetchers) make the relevant calls back to the datastore to retrieve the data.
retrieve the docData from datastore, fill in the type1, type2 and type3 UUIDS
do a batch get to retrieve all the type1, typ2 and type3 docs
Push the results into another queue for posting to other system
The second set of workers (posters) would read from the scond queue each document and post the results to the second system.
One question that I have, should I create 1 FixedThreadPool(size X) or two FixedThreadPool(size X/2)? Is there a danger of starvation if there are a lot of jobs in the first queue such that the second queue would not get started until the first queue was empty?
The fetchers will be making network coalls to talk to the database, they seem like they would be more IO bound than CPU bound. The posters will also make network calls, but they are in the cloud in the same VPC as where my code would run, so they would be fairly close together.
Blocking Queue
This is a pretty normal pattern.
If you have two distinct jobs to do, use two distinct thread pools and make their size configurable so you can size them as needed / test different values on the deployment server.
It is common to use a blocking queue (BlockingQueue built into Java 5 and later) with a bounded size (say, 1000 elements for an arbitrary example).
The blocking queue is thread-safe, so everything in the first thread pool writes to it as fast as they can, everything in the second thread pool reads as fast as it can. If the queue is full, the write just blocks, and if the queue is empty, the read just blocks - nice and easy.
You can tune the thread numbers and repeatedly run to narrow down the best configured size for each pool.
There is a program which is implemented using producer and consumer pattern. The producer fetches data from db based on list of queries and puts it in array blocking queue... The consumer prepares excel report based on data in array blocking queue. For increasing performance, I want to have dynamic number of producers and consumers.. example, when producer is slow, have more number of producers.. when, consumer is slow, have more numbers of consumers . How can I have dynamic producers and consumers??
If you do this, you must first ask yourself a couple of questions:
How will you make sure that multiple parallel producers put items in the queue in the correct order? This might or might not be possible - it depends on the kind of problem you are dealing with.
How will you make sure that multiple parallel consumers don't "steal" each other's items from the queue? Again, this depends on your problem, in some cases this might be desirable and in others it's forbidden. You didn't provide enough information, but typically if you prepare data for report, you will need to have a single consumer and wait until the report data is complete.
Is this actually going to achieve any speedup? Did you actually measure that the bottleneck is I/O bound on the producer side, or are you just assuming? If the bottleneck is CPU-bound, you will not achieve anything.
So, assuming that you need complete data for report (i.e. single consumer, which needs the full data), and that your data can be "sharded" to independent subsets, and that the bottleneck is in fact what you think it is, you could do it like this:
As multiple producers will be producing different parts of results, they will not be sequential. So a list is not a good option; you would need a data structure where you would store interim results and care about which ranges have been completed and which ranges are still missing. Possibly, you could use one list per producer as a buffer and have a "merge" thread which will write to a single output list for consumer.
You need to split input data to several input pieces (one per producer)
You need to somehow track the ordering and ensure that the consumer takes out pieces in correct order
You can start consumer at the moment the first output piece comes out
You must stop the consumer when the last piece is produced.
In short, this is a kind of problem for which you should probably think about using something like MapReduce
I have many threads performing different operations on object and when nearly 50% of the task finished then I want to serialize everything(might be I want to shut down my machine ).
When I come back then I want to start from the point where I had left.
How can we achieve?
This is like saving state of objects of any game while playing.
Normally we save the state of the object and retrieve back. But here we are storing its process's count/state.
For example:
I am having a thread which is creating salary excel sheet for 50 thousand employee.
Other thread is creating appraisal letters for same 50 thousand employee.
Another thread is writing "Happy New Year" e-mail to 50 thousand employee.
so imagine multiple operations.
Now I want to shut down in between 50% of task finishes. say 25-30 thousand employee salary excel-sheet have been written and appraisal letters done for 25-30 thousand and so on.
When I will come back next day then I want to start the process from where I had left.
This is like resume.
I'm not sure if this might help, but you can achieve this if the threads communicate via in-memory queues.
To serialize the whole application, what you need to do is to disable the consumption of the queues, and when all the threads are idle you'll reach a "safe-point" where you can serialize the whole state. You'll need to keep track of all the threads you spawn, to know if they are in are idle.
You might be able to do this with another technology (maybe a java agent?) that freezes the JVM and allows you to dump the whole state, but I don't know if this exists.
well its not much different than saving state of object.
just maintain separate queues for different kind of inputs. and on every launch (1st launch or relaunch) check those queues, if not empty resume your 'stopped process' by starting new process but with remaining data.
say for ex. an app is sending messages, and u quit the app with 10 msg remaining. Have a global queue, which the app's senderMethod will check on every launch. so in this case it will have 10msg in pending queue, so it will continue sending remaining msgs.
Edit:
basically, for all resumable process' say pr1, pr2....prN, maintain queue of inputs, say q1, q2..... qN. queue should remove processed elements, to contain only pending inputs. as soon as u suspend system. store these queues, and on relaunching restore them. have a common routine say resumeOperation, which will call all resumable process (pr1, pr2....prN). So it will trigger the execution of methods with non-0 queues. which in tern replicate resuming behavior.
Java provides the java.io.Serializable interface to indicate serialization support in classes.
You don't provide much information about the task, so it's difficult to give an answer.
One way to think about a task is in terms of a general algorithm which can split in several steps. Each of these steps in turn are tasks themselves, so you should see a pattern here.
By cutting down each algorithms in small pieces until you cannot divide further you get a pretty good idea of where your task can be interrupted and recovered later.
The result of a task can be:
a success: the task returns a value of the expected type
a failure: somehow, something didn't turn right while doing computation
an interrupted computation: the work wasn't finished, but it may be resumed later, and the return value is the state of the task
(Note that the later case could be considered a subcase of a failure, it's up to you to organize your protocol as you see fit).
Depending on how you generate the interruption event (will it be a message passed from the main thread to the worker threads? Will it be an exception?), that event will have to bubble within the task tree, and trigger each task to evaluate if its work can be resumed or not, and then provide a serialized version of itself to the larger task containing it.
I don't think serialization is the correct approach to this problem. What you want is persistent queues, which you remove an item from when you've processed it. Every time you start the program you just start processing the queue from the beginning. There are numerous ways of implementing a persistent queue, but a database comes to mind given the scale of your operations.
We have a JMS queue of job statuses, and two identical processes pulling from the queue to persist the statuses via JDBC. When a job status is pulled from the queue, the database is checked to see if there is already a row for the job. If so, the existing row is updated with new status. If not, a row is created for this initial status.
What we are seeing is that a small percentage of new jobs are being added to the database twice. We are pretty sure this is because the job's initial status is quickly followed by a status update - one process gets one, another process the other. Both processes check to see if the job is new, and since it has not been recorded yet, both create a record for it.
So, my question is, how would you go about preventing this in a vendor-neutral way? Can it be done without locking the entire table?
EDIT: For those saying the "architecture" is unsound - I agree, but am not at liberty to change it.
Create a unique constraint on JOB_ID, and retry to persist the status in the event of a constraint violation exception.
That being said, I think your architecture is unsound: If two processes are pulling messages from the queue, it is not guaranteed they will write them to the database in queue order: one consumer might be a bit slower, a packet might be dropped, ..., causing the other consumer to persist the later messages first, causing them to be overridden with the earlier state.
One way to guard against that is to include sequence numbers in the messages, update the row only if the sequence number is as expected, and delay the update otherwise (this is vulnerable to lost messages, though ...).
Of course, the easiest way would be to have only one consumer ...
JDBC connections are not thread safe, so there's nothing to be done about that.
"...two identical processes pulling from the queue to persist the statuses via JDBC..."
I don't understand this at all. Why two identical processes? Wouldn't it be better to have a pool of message queue listeners, each of which would handle messages landing on the queue? Each listener would have its own thread; each one would be its own transaction. A Java EE app server allows you to configure the size of the message listener pool to match the load.
I think a design that duplicates a process like this is asking for trouble.
You could also change the isolation level on the JDBC connection. If you make it SERIALIZABLE you'll ensure ACID at the price of slower performance.
Since it's an asynchronous process, performance will only be an issue if you find that the listeners can't keep up with the messages landing on the queue. If that's the case, you can try increasing the size of the listener pool until you have adequate capacity to process the incoming messages.