Our Spring Web Application uses Spring Batch with Quartz to carry out complex jobs. Most of these jobs run in the scope of a transaction because if one part of the complex system fails we want any previous database works to be rolled back. We would then investigate the problem, deploy a fix, and restart the servers.
It's getting to be an issue because some of these jobs do a HUGE amount of processing and can take a long time to run. As execution time starts to surpass the 1 hour mark, we find ourselves unable to deploy fixes to production for other problems because we don't want to interrupt a vital job.
I have been reading up on the Reactor implementation as a solution to our problems. We can do a small bit of processing, publish an event, and have other systems do the appropriate action as needed. Sweet!
The only question I have is, what is the best way to handle failure? If I publish an event and a Consumer fails to conduct some critical functionality, will it restart at a later time?
What if an event is published, and before all the appropriate consumers that listen for it can handle it appropriately, the server shuts down for a deployment?
I just started to use reactor recently so I may have some misconception about it, however I'll try to answer you.
Reactor is a library which helps you to develop non-blocking code with back-pressure support which may help you to scale your application without consuming a lot of resources.
The fluent style of reactor can easily replace Spring Batch however the reactor by itself doesn't provide any way to handle transaction nor Spring and in case the jdbc current implementation it will be always blocking since there's no support in the drive level to non-blocking processing. There are discussions around how to handle transactions anyway but as far as know there's no final decision about this matter.
You can always use transactions but remember that you are not going to have non-blocking processing since you need to update/delete/insert/commit in the same thread or manually propagate the transactional context to the new thread and block the main thread
So I believe Reactor won't help you solve your performance issues and another kind of approach may take place.
My recommendation is:
- Use parallel processing in Spring Batch
- Find the optimal chunk number
- Review your indexes (not just create but delete it)
- Review your queries
- Avoid unneeded transformations
- And even more important: Profile it! the bottleneck can be something that you have no idea
Spring batch allows you to break large transactions up into multiple smaller transactions when you use chunk oriented processing. If a chunk fails, then it's transaction rolls back but all previous chunks transactions would have committed. By default, when you restart the job, it will start again from where it failed, so if it had already processed 99 chunks successfully and the 100th chunk failed, restarting the job starts from the 100th chunk and continues.
If you have a long running job and want to deploy a new version, you can stop the job and it will stop after processing the current chunk. You can then restart the job from where it was stopped. It helps to have a GUI to view, launch, stop and restart your jobs. You can use spring batch admin or spring cloud dataflow for that if you want an out of the box GUI.
Related
Context of My question:
I use a proprietary Database (target database) and I can not reveal the name of the DB (you may not know even If I reveal the name).
Here, I usually need to update the records using java. (The number of records vary from 20000 to 40000)
Each update transaction is taking one or two seconds for this DB. So, you see that the execution time would be in hours. There are no Batch execution functions are available for this Database API. For this, I am thinking to use Java multi-threaded feature, instead of executing all the records in single process I want to create a thread for every 100 records. We know that Java can make these threads run parallelly.
But, I want to know how does the DB process these threads sharing the same connection? I can find this by running a trail program and compare time intervals. I feel that it may be deceiving to some extent. I know that you don't have much information about the database. You can just answer this question assuming the DB as MS SQL/MySQL.
Please suggest me if there is any other feature in java I can utilize to make this program execute faster if not multi-threading.
It is not recommended to use single connection with multiple threads, you can read the pitfalls of doing so here.
If you really need to use a single connection with multiple threads, then I would suggest making sure threads start and stop successfully within a transaction. If one of them fails you have to make sure to rollback the changes. So, first get the count, make cursor ranges and for each range start a thread that will execute that on that range. One thing to look for is to not close the connection after executing the partitions individually, but to close it when the transaction is complete and the db is committed.
If you have an option to use Spring Framework, check out Spring Batch.
Spring Batch provides reusable functions that are essential in processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management. It also provides more advanced technical services and features that will enable extremely high-volume and high performance batch jobs through optimization and partitioning techniques. Simple as well as complex, high-volume batch jobs can leverage the framework in a highly scalable manner to process significant volumes of information.
Hope this helps.
I have an architectural question about how to handle big tasks both transactional and scalable in Java/Java EE.
The general challenge
I have a web application (Tomcat right now, but that should not limit the solution space, so just take this to illustrate what I'd like to achieve). This web application is distributed over several (virtual and physical) nodes, connected to a central DBMS (MySQL in this case, but again, this should not limit the solution...) and able to handle some 1000s of users, serving pages, doing stuff, just as you'd expect from your average web-based information system.
Now, there are some tasks which affect a larger portion of data and the system should be optimized to carry out these tasks reasonably fast. (Faster than processing everything sequentially, that is). So I'd make the task parallel and distribute it over several (or all) nodes:
(Note: the data portions which are processed are independent, so there are no database or locking conflicts here).
The problem is, I'd like the (whole) task to be transactional. So if one of the parallel subtasks fails, I'd like to have all other tasks rolled back as a result. Otherwise the system would be in a potentially inconsistent state from a domain perspective.
Current implementation
As I said, the current implementation uses Tomcat and MySQL. The nodes use JMS to communicate (so there is a JMS server to which a dispatcher sends a message for each subtask; and executors take tasks from the message queue, execute them, and post the results to a result queue from which the dispatcher collects the results. The dispatcher blocks and waits for all results to come in and if anything is fine, it terminates with an OK status.
The problem here is that all the executors have their own local transaction context, so the picture would look like this:
If for some reason one of the subtasks fails, the local transaction is rolled back and the dispatcher gets an error result. (There is some failsafe mechanism here, which tries to repeat the failed transaction, but let's assume for some reason, the one task cannot be completed).
The problem is that the system now is in a state where all transactions but one is already committed and completed. And because I cannot get the one final transaction to finish successfully, I cannot get out of this state.
Possible solutions
These are the thoughts which I have followed so far:
I could somehow implement a domain-specific rollback mechanism myself. Because the distributor knows which tasks have been carried out, it could revert the effects explicitly (e.g. storing old values somewhere and revert already committed values back to the previous values). Of course, in this case, I must guarantee that no other process changes something in between, so I'd also have to set the system to a read-only state, as long as the big operation is running.
More or less, I'd need to simulate a transaction in business logic ...
I could choose not to parallelize and do everything on a single node in one big transaction (but as stated at the beginning, I need to speed up processing, so this is not an option...)
I have tried to find out about XATransactions or distributed transactions in general, but this seems to be an advanced Java EE feature, which is not implemented in all Java EE servers, and which would not really solve that basic problem, because there does not seem to be a way to transfer a transaction context over to a remote node in an asynchronous call. (e.g. section 4.5.3 of EJB Specification 3.1: "Client transaction context does not propagate with an asynchronous method invocation. From the Bean Developer’s view, there is never a transaction context flowing in from the client.")
The Question
Am I overlooking something? Is it not possible to distribute a task asynchronously over several nodes and at the same time have a (shared) transactional state which can be rolled back as a whole?
Thanks for any pointers, hints, propositions ...
If you want to distribute your application as described, JTA is your friend in Java EE context. Since it's part of the Java EE spec, you should be able to use it in any compliant container. As with all implementations of the spec, there are differences in the details or configuration, as for example with JPA, but in real life it's very uncommon to change application servers very often.
But without knowing the details and complexity of your problem, my advice is to rethink if you really need to share the task execution for one use case, or if it's not possible and better to have at least everything belonging to that one use case within one node, even though you might need several nodes for the overall application. In case you really have to use several nodes to fulfill your requirements, then I'd go for distributed tasks which do not write directly to the database, but give back results and then commit/rollback them in the one component which initiated the tasks.
And don't forget to measure first, before over-engeneering the architecture. Try to keep it simple at first, assuming that one node could handle it and then write a stress test which tries to break your system, to learn about the maximum possible load it can handle with the given architecture.
I am supposed to design a component which should achieve the following tasks by using multiThreading in Java as the files are huge/multiple and the task has to happen in a very short window:
Read multiple csv/xml files and save all the data in database
Read the database and write the data in separate files csv & xmls as per the txn types. (Each file may contain different types of records life file header, batch header, batchfooter, file footer, different transactions, and checksum record)
I am very new to multithreading & doing some research on Spring Batch in order to use it for the above tasks.
Please let me know what you suggest to use traditional multithreading in Java or Spring Batch. The input sources are multiple here and output sources are also multiple.
I would recommend going with something from framework rather than writing whole threading part yourself. I've quite successfully used Sping's tasks and scheduling for scheduled tasks that involved reaching data from DB, doing some processing, and sending emails, writing data back to database).
Spring Batch is ideal to implement your requirement. First of all you can use the builtin readers and writers to simplify your implementation - there is very good support for parsing CSV files, XML files, reading from database via JDBC etc. You also get the benefit of features like retrying in case of failure, skipping input that is invalid, restarting the whole job if something fails in between - the framework will track the status & restart from where it left off. Implementing all this by yourself is very complex and doing it well requires a lot of effort.
Once you implement your batch jobs with spring batch it gives you simple ways of parallelizing it. A single step can be run in multiple threads - it is mostly a configuration change. If you have multiple steps to be performed you can configure that as well. There is also support for distributing the processing over multiple machines if required. Most of the work to achieve parallelism is done by Spring Batch.
I would strongly suggest that you prototype a couple of your most complex scenarios with Spring Batch. If that works out you can go ahead with Spring Batch. Implementing it on your own especially when you are new to multi threading is a sure recipe for disaster.
I need a mechanism for implementing asynchronous job scheduling in Java and was looking at the Quartz Scheduler, but it seems that it does not offer the necessary functionality.
More specifically, my application, which runs across different nodes, has a web-based UI, via which users schedule several different jobs. When a job is completed (sometime in the future), it should report back to the UI, so that the user is informed of its status. Until then, the user should have the option of editing or cancelling a scheduled job.
An implementation approach would be to have a Scheduler thread running constantly in the background in one of the nodes and collecting JobDetail definitions for job execution.
In any case, there are two the questions (applicable for either a single-node or multi-node scenario):
Does Quartz allow a modification or a cancellation of an already scheduled job?
How can a "callback" mechanism be implemented so that a job execution result is reported back to the UI?
Any code examples, or pointers, are greatly appreciated.
Does Quartz allow a modification or a cancellation of an already scheduled job?
You can "unschedule" a job:
scheduler.unscheduleJob(triggerKey("trigger1", "group1"));
Or delete a job:
scheduler.deleteJob(jobKey("job1", "group1"));
As described in the docs:
http://quartz-scheduler.org/documentation/quartz-2.x/cookbook/UnscheduleJob
Note, the main difference between the two is unscheduleing a job removes the given trigger, while deleting the job removes all triggers to the given job.
How can a "callback" mechanism be implemented so that a job execution result is reported back to the UI?
Typically, in the web world, you will poll the web server for changes. Depending upon your web framework, there may be a component available (Push, Pull, Poll?) that makes this easy. Also, what you will need to do is store some state about your job on the server. Once the job finishes you may update a value in the database or possibly in memory. This, in turn, would be picked up by the polling and displayed to the user.
To complete the answer of #johncarl with Quarz also have this
handle event's before and after the excution implementing SchedulerListener interface
Groups your jobs
Support with Enterprise Edition
Most important the cron expressions here's the documentation
I have a quite an update-intensive application and was wondering if it is a good idea to have a daemon thread that has a hibernate session and that takes update instructions and groups those updates into a batch (which I guess is more efficient).
Thank you in advance.
If your use-cases trigger many updates, but those updates are not necessary to be executed right when they are triggered (like the data they would add is not mandatory on the spot) you can and in fact it is advised to store them and execute them at a later time (like at night, or when the application is less heavily hit). You don't necessarely need a daemon (or any kind of) thread for that, just store your update requests somewhere, and then have a CRON Job or something be triggered at certain intervals and execute them. You can use Quartz Scheduler for this, it will deal with all the underlying threads/scheduling.
Also regarding batching multiple queries vs. executing them idividually, normally Hibernate takes care of that (like if you issue multiple update requests in the same transaction Hibernate will batch them togehter and execute them at the latest possible time - usually when the transactio ends), so this shouldn't be a concern for you. What MIGHT become a concern, depening on your number of update requests , would be Hibernate itself. At some point you might want to consider using JDBC and batch execute those updates manually to gain performance.