My this question is an extension to my another SO Question. Since that doesn't look possible, I am trying to execute chunks in parallel for parallel / partitioned slave steps.
Article says that by just specifying SimpleAsyncTaskExecutor as task executor for a step would start executing chunks in parallel.
#Bean
public Step masterLuceneIndexerStep() throws Exception{
return stepBuilderFactory.get("masterLuceneIndexerStep")
.partitioner(slaveLuceneIndexerStep())
.partitioner("slaveLuceneIndexerStep", partitioner())
.gridSize(Constants.PARTITIONER_GRID_SIZE)
.taskExecutor(simpleAsyntaskExecutor)
.build();
}
#Bean
public Step slaveLuceneIndexerStep()throws Exception{
return stepBuilderFactory.get("slaveLuceneIndexerStep")
.<IndexerInputVO,IndexerOutputVO> chunk(Constants.INDEXER_STEP_CHUNK_SIZE)
.reader(luceneIndexReader(null))
.processor(luceneIndexProcessor())
.writer(luceneIndexWriter(null))
.listener(luceneIndexerStepListener)
.listener(lichunkListener)
.throttleLimit(Constants.THROTTLE_LIMIT)
.build();
}
If I specify, .taskExecutor(simpleAsyntaskExecutor) for slave step then job fails. Line .taskExecutor(simpleAsyntaskExecutor) in master step works OK but chunks work in serial and partitioned steps in parallel.
Is it possible to parallelize chunks of slaveLuceneIndexerStep()?
Basically, each chunk is writing Lucene indices to a single directory in sequential fashion and I want to further parallelize index writing process within each directory since Lucene IndexWriter is thread-safe.
I am able to launch parallel chunks from within a partitioned slave step by following,
1.I first took care of my reader, processor and writer to be thread safe so that those components can participate in parallel chunks without concurrency issues.
2.I kept task executor as for master step as SimpleAsyntaskExecutor since slave steps are long running and I wish to start exactly N-threads at a point in time. I control N by setting concurrencyLimit of task executor.
3.Then I set a ThreadPoolTaskExecutor as task executor for slave step. This pool gets used by all slave steps as a common pool so I set its core pool size as a minimum of N so that each slave step gets at least one thread and starvation doesn't happen. You can increase this thread pool size as per system capacity and I used a thread pool since chunks are smaller running processes.
Using a thread pool also handles a specific case for my application that my partitioning is by client_id so when smaller clients are done same threads get automatically reused by bigger clients and asymmetry created by client_id partitioning gets handled since data to be processed for each client varies a lot.
Master step task executor simply starts all slave step threads and goes to WAITINGstate while slave step chunks get processed by thread pool specified in slave step.
Related
There are two important fields for controlling the concurrency level in Java GCP PubSub consumer:
Parallel pull count
Number of executor threads
From the official example:
setParallelPullCount determines how many StreamingPull streams the subscriber will open to receive message. It defaults to 1. setExecutorProvider configures an executor for the subscriber to process messages. Here, the subscriber is configured to open 2 streams for receiving messages, each stream creates a new executor with 4 threads to help process the message callbacks. In total 2x4=8 threads are used for message processing.
So parallel pull count, if I'm not mistaken, directly refers to the number of Java executors (=thread pools), and number of executor threads sets the amount of threads per each pool.
Normally I reason about separate thread pools as having different use cases or responsibilities, so we might for example have one unbounded cached thread pool for IO, a fixed thread pool for CPU-bound ops, a single (or low number) threaded pool for async IO notifications, and so on.
But what would be the benefit of having two or more thread pools with identical properties for consuming and processing pubsub messages, compared to simply having a single thread pool with maximum desired number of threads? For example, if I can spare a total of 8 threads on the subscriber, what would be the concrete reason for using 1x8 vs 2x4 combination? (a single pool of 8 threads, versus pull count=2 using 4 threads each)?
The setParallelPullCount option doesn't just refer to the number of Java Executors, it refers to the number of streams created that request messages from the server. The different streams could potentially return a different number of messages due to a variety of factors. One may want to increase parallel pull count in order to process more messages in a single client than can be transmitted on a single stream (10MB/s). This is independent of the choice of whether or not to share executors/thread pools.
Whether or not to share a thread pool across the streams would be handled by calling setExecutorProvider. If you set an ExecutorProvider that returns the same Executor on each call to getExecutor, then the streams share it. If you have it return a new Executor for each call, then they each have their own dedicated Executor. The default ExecutorProvider does the latter.
If one calls setParallelPullCount(X), then setExecutor gets called X times to get an Executor for each stream. The choice between a shared one across all of them or individual ones for each probably doesn't change much the vast majority of the time. If you are trying to keep the number of overall threads relatively low, then sharing a single Executor may be helpful in doing that.
The choice between X Executors with Y threads and one Executor with X*Y threads really comes down to the capability to share such resources if the amount of data coming from each stream is vastly different, which probably isn't going to be the case most of the time. If it is, then a shared Executor means that a particularly saturated stream could "borrow" threads from an unsaturated one. On the other hand, using individual Executors could mean that in such a scenario, messages on the stream with fewer messages are as able to get through as messages on the saturated stream.
Context:
I am designing an application which will be consuming messages from various Amazon SQS queues. (More than 25 queues)
For this, I am thinking of creating a library to consume messages from the queues, (call it MessageConsumer)
I want to be dynamically allocating threads to receive/process messages from different queues based on traffic in the queue to minimise waste of resources.
There are 2 ways I can go about it.
1) Can have only one type of thread that polls queues, receives messages and process those message and have one common thread pool for all queues.
2) Can have separate polling and worker threads.
In the second case, I will be having common worker thread pool and constant number of pollers per queue.
Edit:
To elaborate on the second case:
I am planning to have 1 continuously running thread per queue to poke that queue for the amount of messages in it. Then have some logic to decide the number of polling threads required per queue based on the number of messages in each queue and priority of the queue.
I dont want polling threads running all the time because that may cause empty receives (sqs.receiveMessages()), so I will allocate the polling threads based on traffic.
The high traffic queues will have more polling threads and hence more jobs being submitted to worker thread pool.
Please suggest any improvements or flaws in this design?
The recommended process is:
Workers poll the queue using Long Polling (which means it will wait for a maximum of 20 seconds before returning an empty response)
They can request up to 10 messages per call to ReceiveMessage()
The worker processes the message(s)
The worker deletes the message from the queue
Repeat
If you wish to scale the number of workers, you can base this on the ApproximateNumberOfMessagesVisible metric in Amazon CloudWatch. If the number goes too high, add a worker. If it drops to zero (or below some threshold), remove a worker.
It is probably easiest to have each worker only poll one queue.
There is no need for "pollers". The workers do the polling themselves. This way, you can scale the workers independently, without needing some central "polling" service trying to manage it all. Simply launch a new Amazon EC2 instance, launch the some workers and they start processing messages. When scaling-in, just terminate the workers or even the instance -- again, no need to register/deregister workers with a central "polling" service.
I am using GAE task queue to update bulk data in Datastore. Number of records are around 1-2M. To do this I scheduled a cron Job and a queue in this way
<queue>
<name>queueName</name>
<rate>20/s</rate>
<bucket-size>300</bucket-size>
<retry-parameters>
<task-retry-limit>1</task-retry-limit>
</retry-parameters>
<max-concurrent-requests>800</max-concurrent-requests>
</queue>
Each task is doing following task
Fetching 1500 record from datastore using a cursor.
If the next cursor exists create a new task and push in the queue.
Process 1500 fetched record, means updating all 1500 in datastore back.
the expected task to add should be around 667, but I can only see 40 tasks in logs.
In logs, I can see the 40 tasks are added in the queue in 40 sec. I m not getting any error in the logs.
Can anybody help me to understand what is happening? Why I m not able to add all the task.
Thanks
In your approach the task enqueueing appears to be very tightly coupled with the task request processing, in the sense that the request for one such task in the queue needs to be processed to enqueue the next task. So you need to take a look at your task processing rate limiting factors you may hit. The ones from your queue configuration are pretty generous, but there are others.
If you configured your app with threadsafe and if your app design takes advantage of it an instance of your app will be able to handle multiple requests concurrently, up to a maximum depending on its max-concurrent-requests config and its processing latency. Without the threadsafe config that maximum is 1.
Once an instance hits the max number of task requests it can process concurrently it won't start processing new tasks from the queue (so it won't execute step #1 - enqueueing a new task) until it completes processing at least one of the tasks already in progress. The task enqueueing rate per app instance is thus effectively limited - each running instance can contribute to the overall number of tasks in the queue only with a number equal to the max number of tasks it can process in parallel.
But your app is configured for automatic scaling, so once you manage to quickly "fill up" all your running instances, the scheduler will start new instances for it. As new instances are started they will be able to process more of the tasks in the queue and thus also enqueue new tasks, contributing with the above-mentioned amount to the total number of tasks in the queue.
But this growth in the number of enqueued tasks can be much slower than while instances didn't hit their max processing rate - it takes some time to measure how new instances helps with traffic to determine if more instances are needed or not. The overall growth in the number of tasks in the queue will have a "staircase" profile, with the height of a step being the max number of concurrent requests an instance can handle and the number of steps being the number of new instances started +1.
Since you aren't seeing any actual task enqueuing errors I can only suspect that you're somehow hitting a rate limit in processing your enqueued tasks or somehow that processing completely stops. There can be many reasons for it, including, for example:
hitting your app's daily budget (most likely due to the number of instance-hours)
hitting automatic scaling limits
You'd have to investigate your app from this perspective to pinpoint the culprit.
Side note: I assume this is on GAE, not on the development server (which doesn't respect the task queue configs and most likely can't get even close to GAE's parallel processing capability).
Use case:
I have a file with ids in it (approx 500k)
My application reads these files and processes them (processing for each id is huge). So overall it takes a lot of time and memory for this processing.
What we need to achieve is by expanding the number of processes (run the Java processes on separate boxes/machines) and divide the entire list of ids in some fixed batch sizes such that 5 processes start processing items from the file and pick up the next batch whenever its processing is finished.
eg. if total items in the file are 100 and my batch size is 5 with total 3 processes then processing should be like
Process 1: 1-5
Process 2: 6-10
Process 3: 11-15
such that if Process 2 finishes before other processes then it starts processing 16-20 and notify other about this so that next available process picks items 21-25.
Kindly note due to memory constraints we cannot do it using mulitple threads on single process/host.
Can someone please suggest solutions/references how it can be achieved.
It sounds like you have a distributed computing problem. You have a set of "things to process", and want to do that processing across multiple machines. The simplest and most typical way to do that is to put those "things to process" into a distributed queue like Amazon SQS or RabbitMQ (a file won't work).
Have one process (and only one) be responsible for transferring the file to the distributed queue. If you can avoid the file entirely (and have whatever is writing to the file just write to the queue), do this instead.
Setup multiple hosts (consider Amazon EC2) to read from that queue, and do the processing.
Make sure to delete the item from the queue after processing is complete (and set reasonable visibilty timeouts based on how long processing should take) to avoid another worker host picking up the item when it shouldn't.
If you want, you can pull from the queue one at a time, or in batches. I suggest setting up a thread pool on each host to perform the poll/work/delete loop, so the amount of concurrency per host can be easily tweaked just by changing the thread pool size.
By using a distributed queue like this, items taken by one host will not be seen by other hosts (thus avoiding double-processing).
We have an application that needs to
nightly reprocess large amounts of data, and
reprocess large amounts of data on demand.
In both of these cases, around 10,000 quartz jobs get spawned and then run. In the case of nightly, we have one quartz cron job that spawns the 10,000 jobs which each individually do the work of processing the data.
The issue that we have is that we are running with around 30 threads, so naturally the quartz jobs misfire, and continue to misfire until everything is processed. The processing can take up to 6 hours. Each of these 10,000 jobs pertain to a specific domain object that can processed in parallel and are completely independent. Each of the 10,000 jobs can take a variable amount of time (from half a second to a minute).
My question is:
Is there a better way to do this?
If not, what is the best way for us to schedule/setup our quartz jobs so that a minimal amount of time is spent thrashing and dealing with misfires?
A note about or architecture: We are running two clusters with three nodes apiece. The version of quartz is a bit old (2.0.1), and clustering is enabled in the quartz.properties file.
In both of these cases, around 10,000 quartz jobs get spawned
No need to spawn new quartz jobs. Quartz is a scheduler - not a task manager.
In the nightly reprocess - you need only that one quartz cron job to invoke some service responsible for managing and running the 10,000 tasks. In the "on demand" scenario, quartz shouldn't be involved at all. Just invoke that service directly.
How does the service manage 10,000 tasks?
Typically, when only one JVM is available, you'd just use some ExecutorService. Here, since you have 6 nodes under your fingers, you can easily use Hazelcast. Hazelcast is a java library that enables you to cluster your nodes, sharing resources efficiently with each other. Hazelcast has a straightforward solution distributing your ExecutorService, that's called Distributed Executor Service. It's as easy as creating a Hazelcast ExecutorService and submitting the task on all members. Here's an example from the documentation for invoking on a single member:
Callable<String> task = new Echo(input); // Echo is just some Callable
HazelcastInstance hz = Hazelcast.newHazelcastInstance();
IExecutorService executorService = hz.getExecutorService("default");
Future<String> future = executorService.submitToMember(task, member);
String echoResult = future.get();
I would do this by making use of a queue (RabbitMQ/ActiveMQ). The cron job (or whatever your on-demand trigger is) populates the queue with messages representing the 10,000 work instructions (i.e. the instruction to reprocess the data for a given domain object).
On each of your nodes you have a pool of executors which pull from the queue and carry out the work instruction. This solution means that each executor is kept as busy as possible while there are still work items on the queue, meaning that the overall processing is accomplished as quickly as possible.
The best way is to use a cluster of Quartz Instances. This will share the jobs between many cluster nodes :
http://quartz-scheduler.org/documentation/quartz-2.x/configuration/ConfigJDBCJobStoreClustering
I would use a scheduled quartz job to initiate the 10k tasks, but it does so by appending task details to a JMS queue (10k messages). That queue is monitored by a message-driven bean (Java-EE EJB MDB). The MDB can run simultaneously on multiple nodes in your cluster, and each node can run multiple instances... don't reinvent the wheel for distributed taskload: let Java-EE do it.