I am working on a design of a program that will need to fetch results from a datastore and post those results to another system. The data that I am fetching is referenced by a UUID, and has other documents linked to it by UUIDs. I will be posting a lot of documents (>100K documents), so I would like to do this concurrently. I am thinking about the following design:
Get the list of documents from the datastore. Each document would have:
docId (UUID)
docData (json doc)
type1 (UUID)
type1Data (json)
type2 (UUUID)
type2Data (json)
list<UUID> type3Ids
list of type3 data (json)
The only data that I get from my first call are the docIds. I was thinking of pushing these documents into a queue and having a set of workers (fetchers) make the relevant calls back to the datastore to retrieve the data.
retrieve the docData from datastore, fill in the type1, type2 and type3 UUIDS
do a batch get to retrieve all the type1, typ2 and type3 docs
Push the results into another queue for posting to other system
The second set of workers (posters) would read from the scond queue each document and post the results to the second system.
One question that I have, should I create 1 FixedThreadPool(size X) or two FixedThreadPool(size X/2)? Is there a danger of starvation if there are a lot of jobs in the first queue such that the second queue would not get started until the first queue was empty?
The fetchers will be making network coalls to talk to the database, they seem like they would be more IO bound than CPU bound. The posters will also make network calls, but they are in the cloud in the same VPC as where my code would run, so they would be fairly close together.
Blocking Queue
This is a pretty normal pattern.
If you have two distinct jobs to do, use two distinct thread pools and make their size configurable so you can size them as needed / test different values on the deployment server.
It is common to use a blocking queue (BlockingQueue built into Java 5 and later) with a bounded size (say, 1000 elements for an arbitrary example).
The blocking queue is thread-safe, so everything in the first thread pool writes to it as fast as they can, everything in the second thread pool reads as fast as it can. If the queue is full, the write just blocks, and if the queue is empty, the read just blocks - nice and easy.
You can tune the thread numbers and repeatedly run to narrow down the best configured size for each pool.
Related
AWS newbie here.
I have a DynamoDB table and 2+ nodes of Java apps reading/writing from/to it. My use case is as follow: the app should fetch N numbers of items every X seconds based on a timestamp, process them, then remove them from the DB. Because the app may scale, other nodes might be reading from the DB in the same time and I want to avoid processing the same items multiple times.
The questions is: is there any way to implement something like a poll() method that fetches the item and immediately removes it (atomic operation) as if the table was a queue. As far as I checked, delete item methods that DynamoDBMapper offers do not return removed items data.
Consistency is a weak spot of DDB, but that's the price to pay for its scalability.
You said it yourself, you're looking for a queue, so why not use one?
I suggest:
Create a lambda that:
Reads the items
Publishes them to an SQS FIFO queue with message deduplication
Deletes the items from the DB
Create an EventBridge schedule to run the Lambda every n minutes
Have your nodes poll that queue instead of DDB
For this to work you have to consider a few things regarding timings:
DDB will typically be consistent in under a second, but this isn't guaranteed.
SQS deduplication only works for 5 minutes.
EventBridge only supports minute level granularity, not seconds.
So you can run your Lambda as frequently as once a minute, but you can run your nodes as frequently (or infrequently) as you like.
If you run your Lambda less frequently than every 5 minutes then there is technically a chance of processing an item twice, but this is very unlikely to ever happen (technically this could still happen anyway if DDB took >10 minutes to be consistent, but again, extremely unlikely to ever happen).
My understanding is that you want to read and delete an item in an atomic manner, however, we are aware that is not possible with DynamoDB.
However, what is possible is deleting the item and being returned the value, which is more likened to a delete then read. As you correctly pointed out, the Mapper client does not support ReturnValues however the low level clients do.
Key keyToDelete = new Key().withHashKeyElement(new AttributeValue("214141"));
DeleteItemRequest dir = new DeleteItemRequest()
.withTableName("ABC")
.withKey(keyToDelete)
.withReturnValues("ALL_OLD");
More info here DeleteItemRequest
I am limited to a 1-core machine on AWS, but after measuring the time to complete all of my http requests and check their results, two of them together require as much time combined to fetch data as the remaining fifty requests (roughly 2 minutes).
I don't want to bloat my code more than I have to, but I know parallelism and asynchrony can seriously cut down the execution time for this task. I want to launch the two big requests on their own threads so they can go out while the others are running, but I store the results of these http requests in a list currently.
Can you access different (guaranteed) elements of a list at the same time as long as the data is initialized beforehand? I've seen the concurrent list and parallel list, but the one isn't parallel, and the other reallocates the entire list on every modification, so neither is a particularly sane option.
What can I do in this situation?
There is no such thing as a concurrent list in Java. I'm assuming that you are referring to a concurrent hash set (using newSetFromMap) and your "parallel list" refers to a CopyOnWriteArrayList.
You most definitely can use the former option to store update data.
A better way to solve your problem of updating data asynchronously is to just simply use a non-thread-safe collection for your worker thread and then push them all at once when you're done to a thread-safe collection that you use to aggregate all your requests.
So something like:
Set<Response> aggregate = Collections.newSetFromMap(...);
executor.execute(...);
...
// Workers
Set<Response> local = new HashSet<>();
populate(local);
aggregate.addAll(local);
You might want to use various synchronizers if you want your response data to be ordered in a specific way, such as having all your responses from Request 1 to be together. If you only need to move one request from each worker, use a thread safe transfer or a singleton collection.
In my app, I would receive some user data, putting them into an ArrayBlockingQueue, and then put them into a database. Here several threads are used for 'getting the data from the queue and putting it into database'. Then an issue came up.
The database is used to store each user's current status, thus the data's time sequence is very important. But when using multi threads to 'get and put', the order can not be ensured.
So I came up with an idea, it's like 'field grouping': for different users' data, multi-threads is fine, the order between them can be ignored; but each user's data must be retrieved by the same thread.
Now the question is, how can I do that?
Is the number of Users limited? Then you can simply cache a thread across each user.
// thread cache
Map<Sting, Thread> threadcache = new HashMap<String,Thread>();
threadcache.put("primary_key", t);
// when accessing the daya
Thread torun = threadcache.get(queue.peek());
torun.start();
else
Java thread takes name Thread.setName()/getName. Use that to identify a thread, still reuse is something you have to handle according to your business logic.
Try using PriorityBlockingQueue<E> . <E> should be comparable. Implement logic such that that each user's data is individually sorted as per required attributes. Also use threadpools instead of managing threads discretely .
I have a MySQL database with a large number of rows.
I want to initialize multiple Threads (each with its own database connection) in Java and read/print the data simultaneously.
How to partition data between multiple threads so as no two Threads read the same record? What strategies can be used?
It depends on what kind of work are your threads going to do. For example i usually execute single SELECT for some kind of large dataset, add tasks to thread safe task queue and submit workers which picks up proper task from queue to process. I usually write to DB without synchronisation, but that depends on size of unit of work, and DB constrains (like unique keys etc). Works like charm.
Other method would be to just simply run multiple threads and let them work on their own. I strongly disadvice usage of some fancy LIMIT, OFFSET however. It still requires DB to fetch MORE data rows than it will actually return from query.
EDIT:
As you have added comment that you have same data, than yes, my solution is what are you looking for
Get dataset by single query
Add data to queue
Lunch your threads (by executors or new threads)
Pick data from queue and process it.
If the large dataset has an integer primary key, then one of the approaches would be as follows
Get the count of rows using the same select query.
Divide the entire dataset into equal number of partitions
Assign each partition to each thead. Each thread will have its own select query with primary key value range as constraint.
Note: the following issues with this approach
You (fire number of threads + 1) queries to database. So performance might be a problem.
All the partitions may not be equal (as there will be some ids which are deleted).
This approach is simple and makes sure that a row is strictly processed by only thread.
You can use a singleton class to maintain already read rows. So every thread can access the row number from that singleton.
Otherwise you can use static AtomicInteger variable from a common class. Every time threads will call getAndIncrement method. So you can partition data between the threads.
i have the following situation:
Read data from database
do work "calculation"
write result to database
I have a thread that reads from the database and puts the generated objects into a BlockingQueue. These objects are extremely heavy weight hence the queue to limit amount of objects in memory.
A multiple threads take objects from the Queue, performs work and put the results in a second queue.
The final thread takes results from second queue and saves result to database.
The problem is how to prevent deadlocks, eg. the "calculation threads" need to know when no more objects will be put into the queue.
Currently I achieve this by passing a references of the threads (callable) to each other and checking thread.isDone() before a poll or offer and then if the element is null. I also check the size of the queue, as long as there are elements in it, the must be consumed. Using take or put leads to deadlocks.
Is there a simpler way to achieve this?
One of the ways to accomplish would be to put a "dummy" or "poison" message as the last message on the queue when you are sure that no more tasks are going to arrive on the queue.. for example after putting the message related to the last row of the db query. So the producer puts a dummy message on the queue, the consumer on receiving this dummy message knows that no more meaningful work is expected in this batch.
Maybe you should take a look at CompletionService
It is designed to combine executor and a queue functionality in one.
Tasks which completed execution will be available from the completions service via
completionServiceInstance.take()
You can then again use another executor for 3. i.e. fill DB with the results, which you will feed with the results taken from the completionServiceInstance.