Multithreaded batch processing to write to and read from database

Multithreaded batch processing to write to and read from database - java

I am supposed to design a component which should achieve the following tasks by using multiThreading in Java as the files are huge/multiple and the task has to happen in a very short window:
Read multiple csv/xml files and save all the data in database
Read the database and write the data in separate files csv & xmls as per the txn types. (Each file may contain different types of records life file header, batch header, batchfooter, file footer, different transactions, and checksum record)
I am very new to multithreading & doing some research on Spring Batch in order to use it for the above tasks.
Please let me know what you suggest to use traditional multithreading in Java or Spring Batch. The input sources are multiple here and output sources are also multiple.

I would recommend going with something from framework rather than writing whole threading part yourself. I've quite successfully used Sping's tasks and scheduling for scheduled tasks that involved reaching data from DB, doing some processing, and sending emails, writing data back to database).

Spring Batch is ideal to implement your requirement. First of all you can use the builtin readers and writers to simplify your implementation - there is very good support for parsing CSV files, XML files, reading from database via JDBC etc. You also get the benefit of features like retrying in case of failure, skipping input that is invalid, restarting the whole job if something fails in between - the framework will track the status & restart from where it left off. Implementing all this by yourself is very complex and doing it well requires a lot of effort.
Once you implement your batch jobs with spring batch it gives you simple ways of parallelizing it. A single step can be run in multiple threads - it is mostly a configuration change. If you have multiple steps to be performed you can configure that as well. There is also support for distributing the processing over multiple machines if required. Most of the work to achieve parallelism is done by Spring Batch.
I would strongly suggest that you prototype a couple of your most complex scenarios with Spring Batch. If that works out you can go ahead with Spring Batch. Implementing it on your own especially when you are new to multi threading is a sure recipe for disaster.

Related

What happens on the DB side when I use multi-threading for update operations?

Context of My question:
I use a proprietary Database (target database) and I can not reveal the name of the DB (you may not know even If I reveal the name).
Here, I usually need to update the records using java. (The number of records vary from 20000 to 40000)
Each update transaction is taking one or two seconds for this DB. So, you see that the execution time would be in hours. There are no Batch execution functions are available for this Database API. For this, I am thinking to use Java multi-threaded feature, instead of executing all the records in single process I want to create a thread for every 100 records. We know that Java can make these threads run parallelly.
But, I want to know how does the DB process these threads sharing the same connection? I can find this by running a trail program and compare time intervals. I feel that it may be deceiving to some extent. I know that you don't have much information about the database. You can just answer this question assuming the DB as MS SQL/MySQL.
Please suggest me if there is any other feature in java I can utilize to make this program execute faster if not multi-threading.

It is not recommended to use single connection with multiple threads, you can read the pitfalls of doing so here.
If you really need to use a single connection with multiple threads, then I would suggest making sure threads start and stop successfully within a transaction. If one of them fails you have to make sure to rollback the changes. So, first get the count, make cursor ranges and for each range start a thread that will execute that on that range. One thing to look for is to not close the connection after executing the partitions individually, but to close it when the transaction is complete and the db is committed.
If you have an option to use Spring Framework, check out Spring Batch.
Spring Batch provides reusable functions that are essential in processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management. It also provides more advanced technical services and features that will enable extremely high-volume and high performance batch jobs through optimization and partitioning techniques. Simple as well as complex, high-volume batch jobs can leverage the framework in a highly scalable manner to process significant volumes of information.
Hope this helps.

Luigi vs Spring Batch

I have to load txt files into oracle tables. Nowadays the process is being done using bash scripting, sql loader and command line tools for validations.
I'm trying to find more robust alternatives. The two options I came up with are Luigi (Python framework) and Spring Batch.
I made a little POC using Spring Batch, but I believe it has a lot of boilerplate code and might be an overkill. I also prefer Python over Java. The good thing about Batch is the job tracking schema that comes out of the box with the framework.
Files contain from 200k to 1kk records. No transformations are performed, only datatype and lenght validations. First steps of the job consist on checking header, trailer, some dates, making queries to parameters table and truncating the staging table.
Could you give me some pro and cons of each framework for this use case?

I would argue they are not equivalent technologies. Luigi is more of a workflow/process management framework that can help organize and orchestrate many different batch job
The purpose of Luigi is to address all the plumbing typically associated with long-running batch processes. You want to chain many tasks, automate them, and failures will happen. These tasks can be anything, but are typically long running things like Hadoop jobs, dumping data to/from databases, running machine learning algorithms, or anything else. https://luigi.readthedocs.io/en/stable/
Spring Batch gives you a reusable framework for structuring a batch job. It gives you a lot of things out of the box, like being able to read input from text files and write output to databases.
A lightweight, comprehensive batch framework designed to enable the development of robust batch applications vital for the daily operations of enterprise systems.
Spring Batch provides reusable functions that are essential in processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management https://spring.io/projects/spring-batch
You could theoretically run Spring Batch jobs with Luigi.
Based on the brief description of your use case, it sounds like the bread and butter of what inspired Spring Batch in the first place. In fact, their 15 minute demo application covers the use case of reading from a file and loading records into a JDBC database https://spring.io/guides/gs/batch-processing/.

Read data from database and write into file using multithreading

I want to develop a program that reads data from the database and written into file.
For a better performance, I want to use multithreading.
The solution I plan to implement is based on these assumptions:
it is not necessary to put multiple threads to read from the database because there is a concurrency problem to be managed by the DBMS (similarly to the writing into the file). Given that each read element from the database will be deleted in the same transaction.
Using the model producer-consumer: a thread to read the data (main program). and another thread to write the data in the file.
For implementation I will use the executor framework: a thread pool (size=1) to represent the consumer thread.
Can these assumptions make a good solution ?
Is this problem requires a solution based on multithreading?

it is not necessary to put multiple threads to read from the database because there is a concurrency problem to be managed by the DBMS
Ok. So you want one thread that is reading from the database.
Can these assumptions make a good solution ? Is this problem requires a solution based on multithreading?
Your solution will work but as mentioned by others, there are questions about the performance improvements (if any). Threading programs work because you can make use of the multiple processor (or core) hardware on your computer. In your case, if the threads are blocked by the database or blocked by the file-system, the performance improvement may be minimal if at all. If you were doing a lot of processing of the data, then having multiple threads handle the task would work well.

This is more of a comment:
For your first assumption: You should post the db part on https://dba.stackexchange.com/ .
A simple search returned :
https://dba.stackexchange.com/questions/2918/about-single-threaded-versus-multithreaded-databases-performance - so you need to check if your read action is complex enough and if multithread even serves your need for db connection.
Also, your program seems to be sequential read and write. I dont think you even need multithreading unless you want multiple writes on the same file at the same time.

You should have a look at Spring Batch, http://projects.spring.io/spring-batch/, which relates to JSR 352 specs.
This framework comes with pretty good patterns to manage ETL related operations, including multi-threaded processing, data partitioning, etc.

Configuring a spring-batch without using any XML

Disclaimer: I am a noob in Spring. What I am asking may be very "odd" as I don't even know what I don't know.
I am trying to create a batch data movement/manipulation tool (may I say a ETL tool) using Java. Someone suggested me to check out spring-batch which I really liked as it has many libraries for data reading/writing and processing.
But my trouble is- my data sources (flatfile or table) are not fixed. There is a fronted where user will select which flatfile or database table(s) they want to load and the program will automatically load that. This means, usual things like:
Source / target entity structures
source or target database URL/DSN
Job parameters etc.
are not pre-determined in my case. They are determined in runtime. But, so far, whatever spring-batch examples I have seen - they have configured these information in XML. I can't do that as that will make these information static.
My Question is - If I do not want to use Spring Container (and all its XML based bean configuration) but still want to use spring-batch to take advantage of it's batch processing libraries, will that be possible/viable?

No, you need to use the Spring Container for using spring batch and all its XML or annotation based bean configuration. However, what you are trying is achievable, you just need to find way to make it configurable by using parameters in Spring batch. You can take anyone example from internet and start working on it to make it configurable.
Like you can utilize file reader from Spring by simply writing custom mapper. You save the effort to create and maintain file reading logic.
You can have writer which can query which you create dynamically based on your table and file at run time.
Examples shows everything in xml for making simple to understand, how ever if you explore little bit almost everything can be done at runtime.

Design options for Spring Batch implementation

We are to replace a legacy scheduler implementation written in PL/SQL in a large enterprise environment and I'm evaluating Spring Batch for the job.
Although the domain is pretty close to the one provided by Spring Batch, there are few subtleties we have to address before deciding to embark on Spring Batch.
One of them has to do with jobs. In our domain, a job can be one of two things: a shell script or a stored procedure. Jobs and steps definitions are stored in a table in the database. Before each execution, they are instanciated into another table similiar to job_execution table in Spring Batch.
There are hundreds of jobs and thousands of steps. Keeping and managing these in xml files don't seem like a viable option. We'd like to continue to manage jobs description in tables.
So what are our options to CRUD jobs definitions at runtime and in a database?

AFAIK you're out of luck. Following suggestions are hardly worth the effort for you in my opinion.
Your jobs are brought to limited amount of parameterized templates that are predefined and filled from DB on (parallel) execution.
You develop a tool that creates configuration from DB on request.
e.g. Spring configuration from database

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.