I have a quite an update-intensive application and was wondering if it is a good idea to have a daemon thread that has a hibernate session and that takes update instructions and groups those updates into a batch (which I guess is more efficient).
Thank you in advance.
If your use-cases trigger many updates, but those updates are not necessary to be executed right when they are triggered (like the data they would add is not mandatory on the spot) you can and in fact it is advised to store them and execute them at a later time (like at night, or when the application is less heavily hit). You don't necessarely need a daemon (or any kind of) thread for that, just store your update requests somewhere, and then have a CRON Job or something be triggered at certain intervals and execute them. You can use Quartz Scheduler for this, it will deal with all the underlying threads/scheduling.
Also regarding batching multiple queries vs. executing them idividually, normally Hibernate takes care of that (like if you issue multiple update requests in the same transaction Hibernate will batch them togehter and execute them at the latest possible time - usually when the transactio ends), so this shouldn't be a concern for you. What MIGHT become a concern, depening on your number of update requests , would be Hibernate itself. At some point you might want to consider using JDBC and batch execute those updates manually to gain performance.
Related
Context of My question:
I use a proprietary Database (target database) and I can not reveal the name of the DB (you may not know even If I reveal the name).
Here, I usually need to update the records using java. (The number of records vary from 20000 to 40000)
Each update transaction is taking one or two seconds for this DB. So, you see that the execution time would be in hours. There are no Batch execution functions are available for this Database API. For this, I am thinking to use Java multi-threaded feature, instead of executing all the records in single process I want to create a thread for every 100 records. We know that Java can make these threads run parallelly.
But, I want to know how does the DB process these threads sharing the same connection? I can find this by running a trail program and compare time intervals. I feel that it may be deceiving to some extent. I know that you don't have much information about the database. You can just answer this question assuming the DB as MS SQL/MySQL.
Please suggest me if there is any other feature in java I can utilize to make this program execute faster if not multi-threading.
It is not recommended to use single connection with multiple threads, you can read the pitfalls of doing so here.
If you really need to use a single connection with multiple threads, then I would suggest making sure threads start and stop successfully within a transaction. If one of them fails you have to make sure to rollback the changes. So, first get the count, make cursor ranges and for each range start a thread that will execute that on that range. One thing to look for is to not close the connection after executing the partitions individually, but to close it when the transaction is complete and the db is committed.
If you have an option to use Spring Framework, check out Spring Batch.
Spring Batch provides reusable functions that are essential in processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management. It also provides more advanced technical services and features that will enable extremely high-volume and high performance batch jobs through optimization and partitioning techniques. Simple as well as complex, high-volume batch jobs can leverage the framework in a highly scalable manner to process significant volumes of information.
Hope this helps.
Our Spring Web Application uses Spring Batch with Quartz to carry out complex jobs. Most of these jobs run in the scope of a transaction because if one part of the complex system fails we want any previous database works to be rolled back. We would then investigate the problem, deploy a fix, and restart the servers.
It's getting to be an issue because some of these jobs do a HUGE amount of processing and can take a long time to run. As execution time starts to surpass the 1 hour mark, we find ourselves unable to deploy fixes to production for other problems because we don't want to interrupt a vital job.
I have been reading up on the Reactor implementation as a solution to our problems. We can do a small bit of processing, publish an event, and have other systems do the appropriate action as needed. Sweet!
The only question I have is, what is the best way to handle failure? If I publish an event and a Consumer fails to conduct some critical functionality, will it restart at a later time?
What if an event is published, and before all the appropriate consumers that listen for it can handle it appropriately, the server shuts down for a deployment?
I just started to use reactor recently so I may have some misconception about it, however I'll try to answer you.
Reactor is a library which helps you to develop non-blocking code with back-pressure support which may help you to scale your application without consuming a lot of resources.
The fluent style of reactor can easily replace Spring Batch however the reactor by itself doesn't provide any way to handle transaction nor Spring and in case the jdbc current implementation it will be always blocking since there's no support in the drive level to non-blocking processing. There are discussions around how to handle transactions anyway but as far as know there's no final decision about this matter.
You can always use transactions but remember that you are not going to have non-blocking processing since you need to update/delete/insert/commit in the same thread or manually propagate the transactional context to the new thread and block the main thread
So I believe Reactor won't help you solve your performance issues and another kind of approach may take place.
My recommendation is:
- Use parallel processing in Spring Batch
- Find the optimal chunk number
- Review your indexes (not just create but delete it)
- Review your queries
- Avoid unneeded transformations
- And even more important: Profile it! the bottleneck can be something that you have no idea
Spring batch allows you to break large transactions up into multiple smaller transactions when you use chunk oriented processing. If a chunk fails, then it's transaction rolls back but all previous chunks transactions would have committed. By default, when you restart the job, it will start again from where it failed, so if it had already processed 99 chunks successfully and the 100th chunk failed, restarting the job starts from the 100th chunk and continues.
If you have a long running job and want to deploy a new version, you can stop the job and it will stop after processing the current chunk. You can then restart the job from where it was stopped. It helps to have a GUI to view, launch, stop and restart your jobs. You can use spring batch admin or spring cloud dataflow for that if you want an out of the box GUI.
In my database, i have many records of a certain table that need to be processed from time to time by my java spring app.
There is a boolean flag, on each row of that table saying whether a given record is currently being processed.
What I'm looking at is having my java spring app deployed multiple times on different servers, all accessing the same shared DB, the same app duplicated with some load balancer, etc.
But only one java app instance at a time can process a given DB record of that particular table.
What are the different approaches to enforce that constraint?
I can think of some unique queue that would dispatch those processing tasks to different java running instances making sure that the same DB record is not processed simultaneously by two different java instances. But that sounds quite complicated for what it is. Maybe there is something simpler? Anything else? Thanks in advance.
you can use the locking strategies to enforce the exclusiveness of access to the particular records in you table. there are 2 different approaches that can be applied to reach this requirement. optimistic locking or pessimistic locking, take a look at hibernate docs
additionally, there's another issue that you should think of. with current approach, if the server would crash during the time when it was processing a certain record and eventually would not succeed to complete, then this record would stay in "incomplete" state and would not be processed by others. one possible solution that come to my mind is to use the 'node id' of server that took responsibility for processing instead of state flag.
I have an architectural question about how to handle big tasks both transactional and scalable in Java/Java EE.
The general challenge
I have a web application (Tomcat right now, but that should not limit the solution space, so just take this to illustrate what I'd like to achieve). This web application is distributed over several (virtual and physical) nodes, connected to a central DBMS (MySQL in this case, but again, this should not limit the solution...) and able to handle some 1000s of users, serving pages, doing stuff, just as you'd expect from your average web-based information system.
Now, there are some tasks which affect a larger portion of data and the system should be optimized to carry out these tasks reasonably fast. (Faster than processing everything sequentially, that is). So I'd make the task parallel and distribute it over several (or all) nodes:
(Note: the data portions which are processed are independent, so there are no database or locking conflicts here).
The problem is, I'd like the (whole) task to be transactional. So if one of the parallel subtasks fails, I'd like to have all other tasks rolled back as a result. Otherwise the system would be in a potentially inconsistent state from a domain perspective.
Current implementation
As I said, the current implementation uses Tomcat and MySQL. The nodes use JMS to communicate (so there is a JMS server to which a dispatcher sends a message for each subtask; and executors take tasks from the message queue, execute them, and post the results to a result queue from which the dispatcher collects the results. The dispatcher blocks and waits for all results to come in and if anything is fine, it terminates with an OK status.
The problem here is that all the executors have their own local transaction context, so the picture would look like this:
If for some reason one of the subtasks fails, the local transaction is rolled back and the dispatcher gets an error result. (There is some failsafe mechanism here, which tries to repeat the failed transaction, but let's assume for some reason, the one task cannot be completed).
The problem is that the system now is in a state where all transactions but one is already committed and completed. And because I cannot get the one final transaction to finish successfully, I cannot get out of this state.
Possible solutions
These are the thoughts which I have followed so far:
I could somehow implement a domain-specific rollback mechanism myself. Because the distributor knows which tasks have been carried out, it could revert the effects explicitly (e.g. storing old values somewhere and revert already committed values back to the previous values). Of course, in this case, I must guarantee that no other process changes something in between, so I'd also have to set the system to a read-only state, as long as the big operation is running.
More or less, I'd need to simulate a transaction in business logic ...
I could choose not to parallelize and do everything on a single node in one big transaction (but as stated at the beginning, I need to speed up processing, so this is not an option...)
I have tried to find out about XATransactions or distributed transactions in general, but this seems to be an advanced Java EE feature, which is not implemented in all Java EE servers, and which would not really solve that basic problem, because there does not seem to be a way to transfer a transaction context over to a remote node in an asynchronous call. (e.g. section 4.5.3 of EJB Specification 3.1: "Client transaction context does not propagate with an asynchronous method invocation. From the Bean Developer’s view, there is never a transaction context flowing in from the client.")
The Question
Am I overlooking something? Is it not possible to distribute a task asynchronously over several nodes and at the same time have a (shared) transactional state which can be rolled back as a whole?
Thanks for any pointers, hints, propositions ...
If you want to distribute your application as described, JTA is your friend in Java EE context. Since it's part of the Java EE spec, you should be able to use it in any compliant container. As with all implementations of the spec, there are differences in the details or configuration, as for example with JPA, but in real life it's very uncommon to change application servers very often.
But without knowing the details and complexity of your problem, my advice is to rethink if you really need to share the task execution for one use case, or if it's not possible and better to have at least everything belonging to that one use case within one node, even though you might need several nodes for the overall application. In case you really have to use several nodes to fulfill your requirements, then I'd go for distributed tasks which do not write directly to the database, but give back results and then commit/rollback them in the one component which initiated the tasks.
And don't forget to measure first, before over-engeneering the architecture. Try to keep it simple at first, assuming that one node could handle it and then write a stress test which tries to break your system, to learn about the maximum possible load it can handle with the given architecture.
Here is my requirement:
a date is inserted in to a db table with each record. Two weeks
before that particulate date, a separate record should be entered to a
different table.
My initial solution was to put up a SQL schedule job, but my client insisted on it being handled through java.
What is the best approach for this?
What are the pros and cons of using SQL schedule job and Java scheduling for this task?
Ask yourself the question: to what domain does this piece of work belong? If it's required for data integrity, then it's obviously the DBMS' problem and would probably best be handled there. If it's part of the business domain rather than the data, or might require information or processing that's not available or natural to the DBMS, it's probably best made external.
I'd say, use the best tool for the job. Having stuff handled by the database using whatever features it offers is often nice. For example, a log table that keeps "snapshots" of status updates of records in another table is something I typically like to have a trigger for, taking that responsibility out of my app's hands.
But that's something that's available in practically any DBMS. There's the possibility that other databases won't offer the job scheduling capacities you require. If it's conceivable that some day you'll be switching to a different DBMS, you'll then be forced to do it in Java anyway. That's the advantage of the Java approach: you've got the functionality independently of the database. If you're using pure JDBC with standard SQL queries, you've got a fully portable solution.
Both approaches seem valid. Consider what induces the least work and worries. If it's done in Java you'll need to make sure that process is running or scheduled. That's some external dependency. If it's in the database, you'll be sure the job is done as long as the DB is up.
Well, first off, if you want to do it in Java, you can use the Timer for a simple basic repetitive job, or Quartz for more advanced stuff.
Personally I also think that it would be better to have the same entity (application) deal with all related database actions. In other words, if your Java app is reading/writing to/from the db, it should be consistent and also deal with scheduled reading/writings. And as a plus, this way you can synchronize your actions easier, like, if you want to make sure that a scheduled job is running, has started, has finished, you can do that a lot easier if all is done in Java as compared with having a different process (like the SQL Scheduler) doing it.