I'm looking into Java Batch, and found a couple of good examples. But, I'm not sure whether the independent jobs will run concurrently/in-parallel or sequentially. If they will run in-parallel, or concurrently, then how to set the limit on concurrency? Further, is there any queue maintained internally -- I know about JobRepository, but in which fashion jobs will be picked?
Note: It's not about defining flows by split to run concurrently on multiple threads, neither it's about partitioning the input in order to process a range of data in parallel.
Related
I have a bunch of independent pieces of work that I need processes to perform. These pieces of work can be performed in any order, and they last long enough that processes sometimes fail when work is being performed.
I need to coordinate the assignment of these pieces of work, and Curator's DistributedQueue seems like it is almost what I want. I don't need the ordering it provides, though, so I am curious what level of overhead I am paying for that assuming I decline to have a single consumer (ie each process just consumes from the queue).
My main concern is how the lockPath() function on the queue builder actually works. I need the functionality it provides, because it is possible for processes to fail and I need to not be dropping the jobs they were supposed to be doing. But what I don't want is for only one process to be able to do any work at a time. If I use lockPath(), will the queue block for other processes while a process is consuming a message?
Also, if the queue seems like an unreasonable approach, is there another tool available to achieve what I want, or would I have to roll my own? I want to stay within the Curator / ZK environment but am open to alternatives within that.
(Note: I'm the main author of Apache Curator)
The documentation needs to be improved. The lock is used to make the queue entry retry-able. i.e. the entry in the queue is not removed until the consumer finishes. The lock assures that only 1 process is acting on the entry. If you don't care about dropping queue entries on failure you don't need to use the lock. With or without the lock, though, each consumer that you run processes queue entries. So, if you want to have concurrent processing of the queue you'd run multiple consumers (in the same JVM or in separate JVMs - it doesn't matter).
Here's a workflow engine I wrote that uses the Curator queue to do distributed work. Feel free to use it as it is open source: http://nirmataoss.github.io/workflow/
I am trying to process records in processor step using multiple processor classes. These classes can work in parallel. Currently I have written a multi threaded step where I
Set input and output row for a processor class
Submit it to Executor service
Get all future objects and collect final output
Now as soon as I make my job parallel by adding taskExecutor ; I get issues as input objects set in step 1 get overwritten in step 2 and processors are called with overwritten values. I tried to search if I can write composite processor that delegates task to multiple steps in parallel but they work only in sequential manner.
Any inputs would be greatly helpful. Thanks !
Welcome to concurrency. You can get yourself into allot of trouble when you do not follow the path which keeps you in safe deterministic world. You can get rid of all your issues if you use pure functions. As in your functions do not have any side effects, all your variables should be final, you'll find that you wont have any concurrency issues if you stick to this. In general stay away for the threading libraries that get shipped with Java. You should treat thread pools and executors etc. as a resource. Probably should do a bit of reading about concurrency, locks, volatile variables, why these lower level constructs are hard to use, and then look at higher order constructs such as promises, futures and actors.
In our current Java project, we need to batch process a huge set of records. Once, this processing is done, it must start again and process all records again. This processing must be parallelized as well as distributed among multiple nodes.
The records itself are stored in a database. Using some id range (e.g. 1-10000) for identifying a batch would be sufficient.
From a high level perspective, I see the following steps:
A sub task processes one batch of records.
A master task checks if any sub task is still running. If not, create one sub task for each batch of records.
We use MongoDB quite heavily and thought of persisting sub tasks in it. Then, each node can pick up sub tasks that are not done yet, does the processing and marks the record as done. Once there are no undone subtasks, the master task creates all the sub tasks again. This would probably work, but we are looking for a solution in which we don't need to do the heavy synchronization work ourselves.
Could this be a possible use-case for akka?
Can akka-persistence be used to synchronize the processing among different nodes?
Are there any other Java/JVM frameworks suited for this job?
Your question is way too broad for SO's format. Plase read this guide in the future before asking, and don't ask your group members to vote your question up just to inflate what is obviously an ill-posed question ( ͡° ͜ʖ ͡°).
Anyways:
1) Yes, you can implement your requirements in Akka. In particular, since you mentioned multiple nodes, you are looking at the akka-cluster module (for inter-node communication), and you might also need akka-cluster-sharding (in case you want to keep all data in memory beside during processing).
2) No, I would strongly not reccomend that. While you could technically force your problem into using akka-persistence for synchronizing the tasks, the goal of akka-persistence is simply to make an actor's state persistent. Akka itself in its basic form is enough for handling all your synchronization issues. Simply have a master actor create a worker for every subtask and monitor its completion.
3) Yes. Note that the answer to this question is always yes no matter which job.
I'm writing a simple utility which accepts a collection of Callable tasks, and runs them in parallel. The hope is that the total time taken is little over the time taken by the longest task. The utility also adds some error handling logic - if any task fails, and the failure is something that can be treated as "retry-able" (e.g. a timeout, or a user-specified exception), then we run the task directly.
I've implemented this utility around an ExecutorService. There are two parts:
submit() all the Callable tasks to the ExecutorService, storing the Future objects.
in a for-loop, get() the result of each Future. In case of exceptions, do the "retry-able" logic.
I wrote some unit tests to ensure that using this utility is faster than running the tasks in sequence. For each test, I'd generate a certain number of Callable's, each essentially performing a Thread.sleep() for a random amount of time within a bound. I experimented with different timeouts, different number of tasks, etc. and the utility seemed to outperform sequential execution.
But when I added it to the actual system which needs this kind of utility, I saw results that were very variable - sometimes the parallel execution was faster, sometimes it was slower, and sometimes it was faster, but still took a lot more time than the longest individual task.
Am I just doing it all wrong? I know ExecutorService has invokeAll() but that swallows the underlying exceptions. I also tried using a CompletionService to fetch task results in the order in which they completed, but it exhibited more or less the same behavior. I'm reading up now on latches and barriers - is this the right direction for solving this problem?
I wrote some unit tests to ensure that using this utility is faster than running the tasks in sequence. For each test, I'd generate a certain number of Callable's, each essentially performing a Thread.sleep() for a random amount of time within a bound
Yeah this is certainly not a fair test since it is using neither CPU nor IO. I certainly hope that parallel sleeps would run faster than serial. :-)
But when I added it to the actual system which needs this kind of utility, I saw results that were very variable
Right. Whether or not a threaded application runs faster than a serial one depends a lot on a number of factors. In particular, IO bound applications will not improve in performance since they are bound by the IO channel and really cannot do concurrent operations because of this. The more processing that is needed by the application, the larger the win is to convert it to be multi-threaded.
Am I just doing it all wrong?
Hard to know without more details. You might consider playing around with the number of threads that are running concurrently. If you have a ton of jobs to process you should not be using a Executos.newCachedThreadPool() and should optimized the newFixedSizeThreadPool(...) depending on the number of CPUs your architecture has.
You also may want to see if you can isolate the IO operations in a few threads and the processing to other threads. Like one input thread reading from a file and one output thread (or a couple) writing to the database or something. So multiple sized pools may do better for different types of tasks instead of using a single thread-pool.
tried using a CompletionService to fetch task results in the order in which they completed
If you are retrying operations, using a CompletionService is exactly the way to go. As jobs finish and throw exceptions (or return failure), they can be restarted and put back into the thread-pool immediately. I don't see any reason why your performance problems would be because of this.
Multi-threaded programming doesn't come for free. It has an overhead. The over head can easily exceed and performance gain and usually makes your code more complex.
Additional threads give access to more cpu power (assuming you have spare cpus) but in general they won't make you HDD spin faster , give you more network bandwidth or speed up something which is not cpu bound.
Multiple threads can help give you a greater share of an external resource.
Our company has a Batch Application which runs every day, It does some database related jobs mostly, import data into database table from file for example.
There are 20+ tasks defined in that application, each one may depends on other ones or not.
The application execute tasks one by one, the whole application runs in a single thread.
It takes 3~7 hours to finish all the tasks. I think it's too long, so I think maybe I can improve performance by multi-threading.
I think as there is dependency between tasks, it not good (or it's not easy) to make tasks run in parallel, but maybe I can use multi-threading to improve performance inside a task.
for example : we have a task defined as "ImportBizData", which copy data into a database table from a data file(usually contains 100,0000+ rows). I wonder is that worth to use multi-threading?
As I know a little about multi-threading, I hope some one provide some tutorial links on this topic.
Multi-threading will improve your performance but there are a couple of things you need to know:
Each thread needs its own JDBC connection. Connections can't be shared between threads because each connection is also a transaction.
Upload the data in chunks and commit once in a while to avoid accumulating huge rollback/undo tables.
Cut tasks into several work units where each unit does one job.
To elaborate the last point: Currently, you have a task that reads a file, parses it, opens a JDBC connection, does some calculations, sends the data to the database, etc.
What you should do:
One (!) thread to read the file and create "jobs" out of it. Each job should contains a small, but not too small "unit of work". Push those into a queue
The next thread(s) wait(s) for jobs in the queue and do the calculations. This can happen while the threads in step #1 wait for the slow hard disk to return the new lines of data. The result of this conversion step goes into the next queue
One or more threads to upload the data via JDBC.
The first and the last threads are pretty slow because they are I/O bound (hard disks are slow and network connections are even worse). Plus inserting data in a database is a very complex task (allocating space, updating indexes, checking foreign keys)
Using different worker threads gives you lots of advantages:
It's easy to test each thread separately. Since they don't share data, you need no synchronization. The queues will do that for you
You can quickly change the number of threads for each step to tweak performance
Multi threading may be of help, if the lines are uncorrelated, you may start off two processes one reading even lines, another uneven lines, and get your db connection from a connection pool (dbcp) and analyze performance. But first I would investigate whether jdbc is the best approach normally databases have optimized solutions for imports like this. These solutions may also temporarily switch of constraint checking of your table, and turn that back on later, which is also great for performance. As always depending on your requirements.
Also you may want to checkout springbatch which is designed for batch processing.
As far as I know,the JDBC Bridge uses synchronized methods to serialize all calls to ODBC so using mutliple threads won't give you any performance boost unless it boosts your application itself.
I am not all that familiar with JDBC but regarding the multithreading bit of your question, what you should keep in mind is that parallel processing relies on effectively dividing your problem into bits that are independent of one another and in some way putting them back together (their output that is). If you dont know the underlying dependencies between tasks you might end up having really odd errors/exceptions in your code. Even worse, it might all execute without any problems, but the results might be off from true values. Multi-threading is tricky business, in a way fun to learn (at least I think so) but pain in the neck when things go south.
Here are a couple of links that might provide useful:
Oracle's java trail: best place to start
A good tutorial for java concurrency
an interesting article on concurrency
If you are serious about putting effort to getting into multi-threading I can recommend GOETZ, BRIAN: JAVA CONCURRENCY, amazing book really..
Good luck
I had a similar task. But in my case, all the tables were unrelated to each other.
STEP1:
Using SQL Loader(Oracle) for uploading data into database(very fast) OR any similar bulk update tools for your database.
STEP2:
Running each uploading process in a different thread(for unrelated tasks) and in a single thread for related tasks.
P.S. You could identify different inter-related jobs in your application and categorize them in groups; and running each group in different threads.
Links to run you up:
JAVA Threading
follow the last example in the above link(Example: Partitioning a large task with multiple threads)
SQL Loader can dramatically improve performance
The fastest way I've found to insert large numbers of records into Oracle is with array operations. See the "setExecuteBatch" method, which is specific to OraclePreparedStatement. It's described in one of the examples here:
http://betteratoracle.com/posts/25-array-batch-inserts-with-jdbc
If Multi threading would complicate your work, you could go with Async messaging. I'm not fully aware of what your needs are, so, the following is from what I am seeing currently.
Create a file reader java whose purpose is to read the biz file and put messages into the JMS queue on the server. This could be plain Java with static void main()
Consume the JMS messages in the Message driven beans(You can set the limit on the number of beans to be created in the pool, 50 or 100 depending on the need) if you have mutliple servers, well and good, your job is now split into multiple servers.
Each row of data is asynchronously split between 2 servers and 50 beans on each server.
You do not have to deal with threads in the whole process, JMS is ideal because your data is within a transaction, if something fails before you send an ack to the server, the message will be resent to the consumer, the load will be split between the servers without you doing anything special like multi threading.
Also, spring is providing spring-batch which can help you. http://docs.spring.io/spring-batch/reference/html/spring-batch-intro.html#springBatchUsageScenarios