I have an architectural question about how to handle big tasks both transactional and scalable in Java/Java EE.
The general challenge
I have a web application (Tomcat right now, but that should not limit the solution space, so just take this to illustrate what I'd like to achieve). This web application is distributed over several (virtual and physical) nodes, connected to a central DBMS (MySQL in this case, but again, this should not limit the solution...) and able to handle some 1000s of users, serving pages, doing stuff, just as you'd expect from your average web-based information system.
Now, there are some tasks which affect a larger portion of data and the system should be optimized to carry out these tasks reasonably fast. (Faster than processing everything sequentially, that is). So I'd make the task parallel and distribute it over several (or all) nodes:
(Note: the data portions which are processed are independent, so there are no database or locking conflicts here).
The problem is, I'd like the (whole) task to be transactional. So if one of the parallel subtasks fails, I'd like to have all other tasks rolled back as a result. Otherwise the system would be in a potentially inconsistent state from a domain perspective.
Current implementation
As I said, the current implementation uses Tomcat and MySQL. The nodes use JMS to communicate (so there is a JMS server to which a dispatcher sends a message for each subtask; and executors take tasks from the message queue, execute them, and post the results to a result queue from which the dispatcher collects the results. The dispatcher blocks and waits for all results to come in and if anything is fine, it terminates with an OK status.
The problem here is that all the executors have their own local transaction context, so the picture would look like this:
If for some reason one of the subtasks fails, the local transaction is rolled back and the dispatcher gets an error result. (There is some failsafe mechanism here, which tries to repeat the failed transaction, but let's assume for some reason, the one task cannot be completed).
The problem is that the system now is in a state where all transactions but one is already committed and completed. And because I cannot get the one final transaction to finish successfully, I cannot get out of this state.
Possible solutions
These are the thoughts which I have followed so far:
I could somehow implement a domain-specific rollback mechanism myself. Because the distributor knows which tasks have been carried out, it could revert the effects explicitly (e.g. storing old values somewhere and revert already committed values back to the previous values). Of course, in this case, I must guarantee that no other process changes something in between, so I'd also have to set the system to a read-only state, as long as the big operation is running.
More or less, I'd need to simulate a transaction in business logic ...
I could choose not to parallelize and do everything on a single node in one big transaction (but as stated at the beginning, I need to speed up processing, so this is not an option...)
I have tried to find out about XATransactions or distributed transactions in general, but this seems to be an advanced Java EE feature, which is not implemented in all Java EE servers, and which would not really solve that basic problem, because there does not seem to be a way to transfer a transaction context over to a remote node in an asynchronous call. (e.g. section 4.5.3 of EJB Specification 3.1: "Client transaction context does not propagate with an asynchronous method invocation. From the Bean Developer’s view, there is never a transaction context flowing in from the client.")
The Question
Am I overlooking something? Is it not possible to distribute a task asynchronously over several nodes and at the same time have a (shared) transactional state which can be rolled back as a whole?
Thanks for any pointers, hints, propositions ...
If you want to distribute your application as described, JTA is your friend in Java EE context. Since it's part of the Java EE spec, you should be able to use it in any compliant container. As with all implementations of the spec, there are differences in the details or configuration, as for example with JPA, but in real life it's very uncommon to change application servers very often.
But without knowing the details and complexity of your problem, my advice is to rethink if you really need to share the task execution for one use case, or if it's not possible and better to have at least everything belonging to that one use case within one node, even though you might need several nodes for the overall application. In case you really have to use several nodes to fulfill your requirements, then I'd go for distributed tasks which do not write directly to the database, but give back results and then commit/rollback them in the one component which initiated the tasks.
And don't forget to measure first, before over-engeneering the architecture. Try to keep it simple at first, assuming that one node could handle it and then write a stress test which tries to break your system, to learn about the maximum possible load it can handle with the given architecture.
Related
Context of My question:
I use a proprietary Database (target database) and I can not reveal the name of the DB (you may not know even If I reveal the name).
Here, I usually need to update the records using java. (The number of records vary from 20000 to 40000)
Each update transaction is taking one or two seconds for this DB. So, you see that the execution time would be in hours. There are no Batch execution functions are available for this Database API. For this, I am thinking to use Java multi-threaded feature, instead of executing all the records in single process I want to create a thread for every 100 records. We know that Java can make these threads run parallelly.
But, I want to know how does the DB process these threads sharing the same connection? I can find this by running a trail program and compare time intervals. I feel that it may be deceiving to some extent. I know that you don't have much information about the database. You can just answer this question assuming the DB as MS SQL/MySQL.
Please suggest me if there is any other feature in java I can utilize to make this program execute faster if not multi-threading.
It is not recommended to use single connection with multiple threads, you can read the pitfalls of doing so here.
If you really need to use a single connection with multiple threads, then I would suggest making sure threads start and stop successfully within a transaction. If one of them fails you have to make sure to rollback the changes. So, first get the count, make cursor ranges and for each range start a thread that will execute that on that range. One thing to look for is to not close the connection after executing the partitions individually, but to close it when the transaction is complete and the db is committed.
If you have an option to use Spring Framework, check out Spring Batch.
Spring Batch provides reusable functions that are essential in processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management. It also provides more advanced technical services and features that will enable extremely high-volume and high performance batch jobs through optimization and partitioning techniques. Simple as well as complex, high-volume batch jobs can leverage the framework in a highly scalable manner to process significant volumes of information.
Hope this helps.
I have written a single Kafka consumer (using Spring Kafka), that reads from a single topic and is a part of a consumer group. Once a message is consumed, it will perform all downstream operations and move on to the next message offset. I have packaged this as a WAR file and my deployment pipeline pushes this out to a single instance. Using my deployment pipeline, I could potentially deploy this artifact to multiple instances in my deployment pool.
However, I am not able to understand the following, when I want multiple consumers as part of my infrastructure -
I can actually define multiple instances in my deployment pool and
have this WAR running on all those instances. This would mean, all of
them are listening to the same topic, are a part of the same consumer
group and will actually divide the partitions among themselves. The
downstream logic will work as is. This works perfectly fine for my
use case, however, I am not sure, if this is the optimal approach to
follow ?
Reading online, I came across resources here and here,
where people are defining a single consumer thread, but internally,
creating multiple worker threads. There are also examples where we
could define multiple consumer threads that do the downstream logic.
Thinking about these approaches and mapping them to deployment
environments, we could achieve the same result (as my theoretical
solution above could), but with less number of machines.
Personally, I think my solution is simple, scalable but might not be optimal, while the second approach might be optimal, but wanted to know your experiences, suggestions or any other metrics / constraints I should consider ? Also, I am thinking with my theoretical solution, I could actually employ bare bones simple machines as Kafka consumers.
While I know, I haven’t posted any code, please let me know if I need to move this question to another forum. If you need specific code examples, I can provide them too, but I didn’t think they are important, in the context of my question.
Your existing solution is best. Handing off to another thread will cause problems with offset management. Spring kafka allows you to run multiple threads in each instance, as long as you have enough partitions.
If your current approach works, just stick to it. It's the simple and elegant way to go.
You would only go to approach 2 in case you cannot for some reason increase the number of partitions but need higher level of parallelism. But then you have ordering and race conditions to worry about. If you ever need to go that route, I'd recommend the akka-stream-kafka library, which provides facilities to handle offset commits correctly and to do what you need in parallel and then merge back into a single stream preserving the original ordering, etc. Otherwise, these things are error-prone to do yourself.
Let's start of by saying that I'm using version 3.1.2 of Axon Framework with tracking event processors enabled for both #EventHandlers and Sagas.
The current default behaviour for creating event processors for Sagas, as I'm seeing it, is to create a single tracking event processor for a single Saga. This works quite well on a microservice scale, but might turn out to be a problem in big monolithic applications which may implement a lot of Sagas. Since I'm writing such application, I want to have a better control over number of running threads, which in turn will give me better control over usage of database connection pool, context switching and memory usage. Ideally, I would like to have as many tracking event processors as CPU cores where each event processor executes multiple sagas and/or #EventHandlers.
I have already figured that I'm able to do that for #EventHandlers via either #ProcessingGroup annotation or EventHandlingConfiguration::assignHandlersMatching method, but SagaConfiguration does not seem to expose similar API. In fact, the most specific SagaConfiguration::trackingSagaManager method is hardcoded to create a new TrackingEventProcessor object, which makes me think what I'm trying to achieve is currently impossible. So here's my question: Is there some non-straightforward way that I'm missing which will let me execute multiple Sagas in the context of a single event processor?
I can confirm with you that it is (currently) not possible to have multiple Sagas be managed by a singleEventProcessor. Added to that, I'm doubting about the pro's and con's to doing so, as your scenario doesn't sound to weird at first glance.
I recommend to drop a feature request on the AxonFramework GitHub page. That way we (1) document this idea/desire and (2) have a good place to discuss whether to implement this or not.
Our Spring Web Application uses Spring Batch with Quartz to carry out complex jobs. Most of these jobs run in the scope of a transaction because if one part of the complex system fails we want any previous database works to be rolled back. We would then investigate the problem, deploy a fix, and restart the servers.
It's getting to be an issue because some of these jobs do a HUGE amount of processing and can take a long time to run. As execution time starts to surpass the 1 hour mark, we find ourselves unable to deploy fixes to production for other problems because we don't want to interrupt a vital job.
I have been reading up on the Reactor implementation as a solution to our problems. We can do a small bit of processing, publish an event, and have other systems do the appropriate action as needed. Sweet!
The only question I have is, what is the best way to handle failure? If I publish an event and a Consumer fails to conduct some critical functionality, will it restart at a later time?
What if an event is published, and before all the appropriate consumers that listen for it can handle it appropriately, the server shuts down for a deployment?
I just started to use reactor recently so I may have some misconception about it, however I'll try to answer you.
Reactor is a library which helps you to develop non-blocking code with back-pressure support which may help you to scale your application without consuming a lot of resources.
The fluent style of reactor can easily replace Spring Batch however the reactor by itself doesn't provide any way to handle transaction nor Spring and in case the jdbc current implementation it will be always blocking since there's no support in the drive level to non-blocking processing. There are discussions around how to handle transactions anyway but as far as know there's no final decision about this matter.
You can always use transactions but remember that you are not going to have non-blocking processing since you need to update/delete/insert/commit in the same thread or manually propagate the transactional context to the new thread and block the main thread
So I believe Reactor won't help you solve your performance issues and another kind of approach may take place.
My recommendation is:
- Use parallel processing in Spring Batch
- Find the optimal chunk number
- Review your indexes (not just create but delete it)
- Review your queries
- Avoid unneeded transformations
- And even more important: Profile it! the bottleneck can be something that you have no idea
Spring batch allows you to break large transactions up into multiple smaller transactions when you use chunk oriented processing. If a chunk fails, then it's transaction rolls back but all previous chunks transactions would have committed. By default, when you restart the job, it will start again from where it failed, so if it had already processed 99 chunks successfully and the 100th chunk failed, restarting the job starts from the 100th chunk and continues.
If you have a long running job and want to deploy a new version, you can stop the job and it will stop after processing the current chunk. You can then restart the job from where it was stopped. It helps to have a GUI to view, launch, stop and restart your jobs. You can use spring batch admin or spring cloud dataflow for that if you want an out of the box GUI.
I plan to implement a GAE app only for my own usage.
The application will get its data using URL Fetch service, updating it every x minutes (using Scheduled tasks). Then it will serve that information to me when I request it.
I have barely started to look into GAE, but I have a main question that I am not able to clear. Can state be maintained in GAE between different requests without using jdo/jpa and the datastore?
As I am the only user, I guess I could keep the info in a servlet subclass and so I can avoid having to deal with Datastore...but my concern is that, as this app will have very few request, if it is moved to disk or whatever (don't know yet if it has some specific name), it will loose its status?
I am not concerned about having to restart the whole app and start collecting data from scratch from time to time, that is ok.
If this is an app for your own use, and you're double-extra sure that you won't be making it multi-user, and you're not concerned about the possibility that you might be using it from two browsers at once, you can skip using sessions and use a known key for storing information in memcache.
If your reason for avoiding datastore is concern over performance, then I strong recommend testing that assumption. You may be pleasantly surprised.
You could use the http session to maintain state between requests, but that will use the datastore itself (although you won't have to write any code to get this behaviour).
You might also consider using the Cache API (like memcache). It's JSR 107 I think, which Google provide an implementation of. The Cache is shared between instances, but it can empty at anytime. But if you're happy with that behaviour this may be an option. Looking at your requirements this may be the most feasible option, if you don't want to write your own persistence code.
You could store data as a static against your Class or in an application scoped Object, but doing that means when your instance spins down or your instance switches to another instance, the data would be lost as your classes would need to be loaded into the new instance.
Or you could serialize the state to the client and send it back in with each request.
The most robust option is persistence to the datastore - the JPA code is trivial. Perhaps you should reconsider?