process 1 table in hibernate with multithreads - java

I am working in a application using spring+hibernate.
I have a situation where I have to fetch a set of records from 1 table(with status flag 0), process each of them (will generate data for other tables), then set the status flag to 1.
problem is all of this is being done by 1 thread, and is very slow. I want to achieve lets say i make 10 threads. each of them will take one records, process, save, done. Then the process will speed up 10 times.
Pls look at picture. Any advise on how to do this?
---- current situation ----
------ desired situation ------

Wrap the process method in a Runnable and execute it using a TaskExecutor.
You can play with TaskExecutor parameters like thread pool size using the spring task namespace or annotations, at you choice.


java, quartz and multiple tasks triggered at certain times saved in a database

I'm building a system where users can set a future date(down to hours and minutes) in calendar. At that date a trigger is calling a certain task, unique for every user.
Every user can set a different date. The system will have 10k+ from the start and a user can create more than one trigger.
So assuming I have 10k users each user create on average 3 triggers => 30k triggers with 30k different dates.
All dates are saved in a database.
I'm new to quartz, can this be done in a more optimized way?
I was thinking about making a task run every minute that will get the tasks that will suppose to run in the next hour and remove them from database.
Do you have any better ideas? Did someone used quartz for a large number of triggers.
You have the schedule backed in the database. If I understand the idea - you want the quartz to load all the upcoming tasks to execute them in the future.
This is problematic approach:
Synchronization Issues: I assume that users can edit, remove and add new tasks to the database. You would have to periodically ask the database to refresh the state of the quartz jobs, remove some jobs, edit other jobs etc. This may not be trivial. The state of the program would be a long living cache which needs to be synchronised often.
Performance and scalability issues: Even if proposed solution may be ok for 30K tasks it may not be ok for 70k or 700k tasks. In your approach it's not easy to scale - adding new machine would require additional layer of synchronisation - which machine should actually execute which job (as all of them have all the tasks).
What I would propose:
Add the "stage" to the Tasks table (new, queued, running, finished, failed)
divide your solution into several components. (Initially they can run on a single machine but it will be easy to scale)
Task Finder: Executed periodically (once every few seconds). Scans the database for tasks that are "new", and due soon. Sends the tasks found to Message Queue and marks the task as "queued" in the db. Marking as "queued" has to be done carefully as there can be multiple "task finders". (As an addition it may find the tasks that have been marked as "queued" or "running" more than N minutes ago and are not "finished" nor "canceled" - probably need to re-run these)
Message Queue: Connector between Taks Finder and Task Executor.
Task Executor: Listens to the Message Queue and process the tasks that it received. Marks the tasks as "running" initially and "finished" or "failed" later on.
With this approach you can have:
multiple Task Executors on multiple machines
multiple Task Schedulers on multiple machines
even if one of the Task Schedulers or Executors will fail it will not be Single Point of Failure. Some of the tasks will be delayed but it will be picked up and run afterwards.
This may not address all the scenarios but would be a good starting point.
I don't see why you need quartz here at all. As far as I remember, quartz is best used to schedule backend internal processes, not user-defined tasks obtained from db.
Just process the trigger as it is created, save a row to your tasks table with start_date based on the trigger and every second select all incomplete tasks with start_date< sysdate. If the job is repeating, calculate next execution time and insert new task row / update previous accordingly.
As Sam pointed out there are some nice topics addressing the same problem:
Quartz Performance
Quartz FAQ
In a system like the mentioned it should not a problem mostly to handle this amount of triggers. But according to my experiance it is a better way to create something like a "JobChecker". If you enable your users to create own triggers it could really break Quartz in some cases. For example if 5000 user creates an event to the exact same time, Quartz will have a hard time to handle them correctly. (It is not likely a situation that will occur often, but it is possible as your specification does not excludes it.) Quartz has difficulties only when a lot of triggers should be fired at the same time.
My recommendation to this problem is to create one job that is running in every hour/minute etc and that should handle every user set events. This way is simmilar to a cron job in bash. With this kind of processing your system will be pretty stable even if the number of "triggers" increases dramatically. Basically your line of thought is correct if you thrive for scalability.

Executorservice exception handling in java

I am using executor service feature of Java. I want to understand the design perspective.
If something goes wrong in one of the batch what will be best approach to handle it?
I am creating fixed thread pool as,
ExecutorService pool = Executors.newFixedThreadPool(10);
Also I am using invokeall() to invoke all callable which is returning future object.
Here is my scenario -
I have 1000 records coming from xml-file and I wanted to save into DB.
I created batch of 10, each batch containing 100 records.
Batches started processing(say batch1, batch2, batch3... batch10) and lets say one of batch(batch7) came across error for a particular record while parsing the record from xml and it could not save into DB.
So my question is how I can handle this situation ?
How I can get/store failed batch information (batch7 above) ?
I mean, if there is any error in any of batch should i stop all other batches ?
Or where i can store information for failed batch and how I can take it for further processing once error corrected ?
The handler that has the logic to process the records should have an variable that stores the batch number.
The handler ideally should have a finite retry logic for few set of database errors.
Once the retry counts exhausts, it warrants a human intervention and it should exit throwing exceptions and the batch number . The executor should ideally should call shutDown . If your logic demands to stop the process immediately , then you should call shutDownNow . Ideally your design should be resistive to such failures and let other batches continue its work even if one fails. Hope it helped you
You should use CompletableFuture to do this
Use CompletableFuture.runAsync( ) to start a process asynchronous, it returns a future. On this future, you can use thenAccept(..) or thenRun(..) methods to do something when process is complete.
There is also a method, exceptionally(..) to do something when an exception is thrown.
By default, it uses a default executor service to do this async, but you can use your own if necessary.
So my question is how I can handle this situation ?
It all depends on your requirement.
How I can get/store failed batch information (batch7 above) ?
You can store it either in a file or database.
I mean, if there is any error in any of batch should i stop all other batches ?
This depends on your business use case. If you have requirement to stop batch processing even with single batch failure, you have to stop next batches. Otherwise you can continue with next set of batches.
Or where i can store information for failed batch and how I can take it for further processing once error corrected ?
This also depends on your requirement & design. You may have to inform the source about problematic XML file so that they can correct the file and sent it back to you. Once you receive the new copy, you have to push new file for processing. It can be manual or automated which depends on your design.

java jdbc design pattern : handle many inserts

I would like to ask for some advices concerning my problem.
I have a batch that does some computation (multi threading environement) and do some inserts in a table.
I would like to do something like batch insert, meaning that once I got a query, wait to have 1000 queries for instance, and then execute the batch insert (not doing it one by one).
I was wondering if there is any design pattern on this.
I have a solution in mind, but it's a bit complicated:
build a method that will receive the queries
add them to a list (the string and/or the statements)
do not execute until the list has 1000 items
The problem : how do I handle the end ?
What I mean is, the last 999 queries, when do I execute them since I'll never get to 1000 ?
What should I do ?
I'm thinking at a thread that wakes up every 5 minutes and check the number of items in a list. If he wakes up twice and the number is the same , execute the existing queries.
Does anyone has a better idea ?
Your database driver needs to support batch inserting. See this.
Have you established your system is choking on network traffic because there is too much communication between the service and the database? If not, I wouldn't worry about batching, until you are sure you need it.
You mention that in your plan you want to check every 5 minutes. That's an eternity. If you are going to get 1000 items in 5 minutes, you shouldn't need batching. That's ~ 3 a second.
Assuming you do want to batch, have a process wake up every 2 seconds and commit whatever is queued up. Don't wait five minutes. It might commit 0 rows, it might commit 10...who cares...With this approach, you don't need to worry that your arbitrary threshold hasn't been met.
I'm assuming that the inserts come in one at a time. If your incoming data comes in n at once, I would just commit every incoming request, no matter how many inserts happen. If your messages are coming in as some sort of messaging system, it's asynchronous anyway, so you shouldn't need to worry about batching. Under high load, the incoming messages just wait till there is capacity to handle them.
Add a commit kind of method to that API that will be called to confirm all items have been added. Also, the optimum batch size is somewhere in the range 20-50. After that the potential gain is outweighed by the bookkeeping necessary for a growing number of statements. You don't mention it explicitly, but of course you must use the dedicated batch API in JDBC.
If you need to keep track of many writers, each in its own thread, then you'll also need a begin kind of method and you can count how many times it was called, compared to how many times commit was called. Something like reference-counting. When you reach zero, you know you can flush your statement buffer.
This is most amazing concept , I have faced many time.So, according to your problem you are creating a batch and that batch has 1000 or more queries for insert . But , if you are inserting into same table with repeated manner.
To avoid this type of situation you can make the insert query like this:-
INSERT INTO table1 VALUES('4','India'),('5','Odisha'),('6','Bhubaneswar')
It can execute only once with multiple values.So, better you can keep all values inside any collections elements (arraylist,list,etc) and finally make a query like above and insert it once.
Also you can use SQL Transaction API.(Commit,rollback,setTraction() ) etc.
Hope ,it will help you.
All the best.

Records in DB that should trigger timed Events = How to implement efficiently?

This is a very interesting Problem: I hava a large number of Records in a database that itself have associated with them a "trigger-time" (=a date in the future). If this time is right/reached, the record should do/fire a specific action. Multiple threads will update this time for a records. So this time is not fixed, it can be changed by the different threads...
I can of course over and over query for Records, that have "timedout". In the End I would have to write a loop that only does querying (via SQL), if the event/record has timedout. But this is no good having such a polling loop for a DB!?
Threads: Another approach ist to keep all of them in memory for example with the "Executor Framework" or be using Quartz as Threads. This would be logical from a JAva perspective an most likely very good timed. But then I would have thousands of threads...
What better approaches are there to solve this problem? Any suggestions/ideas are welcome, so I can do a further research on them.
Thank very much!!
Depending on the database, some have 'notifications' (I am thinking of Postgres here). It allows you to start up a process and let other things in PG notify you when they happen.
I.e. in this scenario, when a record is changed with a time-out you could have a trigger notifying your timing process (that sits on a totally different DB connection) and it could then insert #at records, or cron entries, or whatever it is you need to do on your side to manage and execute the actions.
In the newest releases of PG, you can send data along with the notify , i.e. you can send the PK value of the record that changed.
Clients --> [Postgres] -----------<> Record Monitor client ---- > process records()
| |
records_table |
| |
\_ timing_Trigger() --/
on_update/insert/delete notify RecordMonitorClientOfChange.
A super crappy diagram of what I would do.

How to share data between scheduled jobs

I am writing a scheduler which grabs XML data and inserts into MySQL DB - simple isn't. But the problem or the logic that I am trying to find is here. NOTE: I want to execute this in windows environment in future it might be configured for other platforms.
Scheduler should run on every 5 mins.
This script should fetch condition/configuration on what to parse and collect the data-fields from XML and these conditions are available from MySQL table.
This table also defines a delay in which this script should check for the difference in the XML fields & delay.
This script does both, one is running for every 5 mins to collect XML and check the difference in the table (MySQL) for every said delay.
This script then reads the XML data-fields and parses it, then collects only those data-fields that is defined from the above MySQL table.
The collected data will be inserted into MySQL DB only when there is change in the state and this state is defined from MySQL table.
Due to the delay, I am not sure how should I store the configuration in the script which will be shared between each schedules.
Is there anyway to use static variable in the code to store this data? Which will be shared b/w different jobs? or different schedules?
Basically, how should I implement this? A better approach in terms of performance.
Thanks for your time.
One of the suggestion is to use Java Code as a windows service (?) we could have some common data shared between different jobs? - does it make sense?
Java Service Wrapper
Concurensy is the answer, try creating Thread pool or Executor servises, and stop certain threads for 5min , you coud even use Synchronization if few threads will be working with the same resource.
Remember not always the more threads use the faster you will finish your job f.e. 3 threads- 2 min
5threads-6 min
*Read tutorial about threads
*create fe simple threads with wait for 5 min
*read some tutorials about thread pool/synchnizations and sharing resourses (script part)
*test to find the most optimal way
