I want to develop a program that reads data from the database and written into file.
For a better performance, I want to use multithreading.
The solution I plan to implement is based on these assumptions:
it is not necessary to put multiple threads to read from the database because there is a concurrency problem to be managed by the DBMS (similarly to the writing into the file). Given that each read element from the database will be deleted in the same transaction.
Using the model producer-consumer: a thread to read the data (main program). and another thread to write the data in the file.
For implementation I will use the executor framework: a thread pool (size=1) to represent the consumer thread.
Can these assumptions make a good solution ?
Is this problem requires a solution based on multithreading?
it is not necessary to put multiple threads to read from the database because there is a concurrency problem to be managed by the DBMS
Ok. So you want one thread that is reading from the database.
Can these assumptions make a good solution ? Is this problem requires a solution based on multithreading?
Your solution will work but as mentioned by others, there are questions about the performance improvements (if any). Threading programs work because you can make use of the multiple processor (or core) hardware on your computer. In your case, if the threads are blocked by the database or blocked by the file-system, the performance improvement may be minimal if at all. If you were doing a lot of processing of the data, then having multiple threads handle the task would work well.
This is more of a comment:
For your first assumption: You should post the db part on https://dba.stackexchange.com/ .
A simple search returned :
https://dba.stackexchange.com/questions/2918/about-single-threaded-versus-multithreaded-databases-performance - so you need to check if your read action is complex enough and if multithread even serves your need for db connection.
Also, your program seems to be sequential read and write. I dont think you even need multithreading unless you want multiple writes on the same file at the same time.
You should have a look at Spring Batch, http://projects.spring.io/spring-batch/, which relates to JSR 352 specs.
This framework comes with pretty good patterns to manage ETL related operations, including multi-threaded processing, data partitioning, etc.
Related
Context of My question:
I use a proprietary Database (target database) and I can not reveal the name of the DB (you may not know even If I reveal the name).
Here, I usually need to update the records using java. (The number of records vary from 20000 to 40000)
Each update transaction is taking one or two seconds for this DB. So, you see that the execution time would be in hours. There are no Batch execution functions are available for this Database API. For this, I am thinking to use Java multi-threaded feature, instead of executing all the records in single process I want to create a thread for every 100 records. We know that Java can make these threads run parallelly.
But, I want to know how does the DB process these threads sharing the same connection? I can find this by running a trail program and compare time intervals. I feel that it may be deceiving to some extent. I know that you don't have much information about the database. You can just answer this question assuming the DB as MS SQL/MySQL.
Please suggest me if there is any other feature in java I can utilize to make this program execute faster if not multi-threading.
It is not recommended to use single connection with multiple threads, you can read the pitfalls of doing so here.
If you really need to use a single connection with multiple threads, then I would suggest making sure threads start and stop successfully within a transaction. If one of them fails you have to make sure to rollback the changes. So, first get the count, make cursor ranges and for each range start a thread that will execute that on that range. One thing to look for is to not close the connection after executing the partitions individually, but to close it when the transaction is complete and the db is committed.
If you have an option to use Spring Framework, check out Spring Batch.
Spring Batch provides reusable functions that are essential in processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management. It also provides more advanced technical services and features that will enable extremely high-volume and high performance batch jobs through optimization and partitioning techniques. Simple as well as complex, high-volume batch jobs can leverage the framework in a highly scalable manner to process significant volumes of information.
Hope this helps.
In my database, i have many records of a certain table that need to be processed from time to time by my java spring app.
There is a boolean flag, on each row of that table saying whether a given record is currently being processed.
What I'm looking at is having my java spring app deployed multiple times on different servers, all accessing the same shared DB, the same app duplicated with some load balancer, etc.
But only one java app instance at a time can process a given DB record of that particular table.
What are the different approaches to enforce that constraint?
I can think of some unique queue that would dispatch those processing tasks to different java running instances making sure that the same DB record is not processed simultaneously by two different java instances. But that sounds quite complicated for what it is. Maybe there is something simpler? Anything else? Thanks in advance.
you can use the locking strategies to enforce the exclusiveness of access to the particular records in you table. there are 2 different approaches that can be applied to reach this requirement. optimistic locking or pessimistic locking, take a look at hibernate docs
additionally, there's another issue that you should think of. with current approach, if the server would crash during the time when it was processing a certain record and eventually would not succeed to complete, then this record would stay in "incomplete" state and would not be processed by others. one possible solution that come to my mind is to use the 'node id' of server that took responsibility for processing instead of state flag.
I need to implement some kind of inter-process mutex in Java. I'm considering using the FileLock API as recommended in this thread. I'll basically be using a dummy file and locking it in each process.
Is this the best approach? Or is something like this built in the standard API (I can't find it).
For more details see below:
I have written an application which reads some input files and updates some database tables according to what it finds in them (it's more complex, but business logic is irrelevant here).
I need to ensure mutual exclusion between multiple database updates. I tried to implement this with LOCK TABLE, but this is unsupported by the engine I'm using. So, I want to implement the locking support in the application code.
I went for the FileLock API approach and implemented a simple mutex based on:
FileChannel.lock
FileLock.release
All the processes use the same dummy file for acquiring locks.
Why bother using files, if you have database on your hands? Try using database locking like this (https://dev.mysql.com/doc/refman/5.0/en/innodb-locking-reads.html).
By the way, what database engine do you use? It might help if you list it.
I am supposed to design a component which should achieve the following tasks by using multiThreading in Java as the files are huge/multiple and the task has to happen in a very short window:
Read multiple csv/xml files and save all the data in database
Read the database and write the data in separate files csv & xmls as per the txn types. (Each file may contain different types of records life file header, batch header, batchfooter, file footer, different transactions, and checksum record)
I am very new to multithreading & doing some research on Spring Batch in order to use it for the above tasks.
Please let me know what you suggest to use traditional multithreading in Java or Spring Batch. The input sources are multiple here and output sources are also multiple.
I would recommend going with something from framework rather than writing whole threading part yourself. I've quite successfully used Sping's tasks and scheduling for scheduled tasks that involved reaching data from DB, doing some processing, and sending emails, writing data back to database).
Spring Batch is ideal to implement your requirement. First of all you can use the builtin readers and writers to simplify your implementation - there is very good support for parsing CSV files, XML files, reading from database via JDBC etc. You also get the benefit of features like retrying in case of failure, skipping input that is invalid, restarting the whole job if something fails in between - the framework will track the status & restart from where it left off. Implementing all this by yourself is very complex and doing it well requires a lot of effort.
Once you implement your batch jobs with spring batch it gives you simple ways of parallelizing it. A single step can be run in multiple threads - it is mostly a configuration change. If you have multiple steps to be performed you can configure that as well. There is also support for distributing the processing over multiple machines if required. Most of the work to achieve parallelism is done by Spring Batch.
I would strongly suggest that you prototype a couple of your most complex scenarios with Spring Batch. If that works out you can go ahead with Spring Batch. Implementing it on your own especially when you are new to multi threading is a sure recipe for disaster.
Here is my requirement:
a date is inserted in to a db table with each record. Two weeks
before that particulate date, a separate record should be entered to a
different table.
My initial solution was to put up a SQL schedule job, but my client insisted on it being handled through java.
What is the best approach for this?
What are the pros and cons of using SQL schedule job and Java scheduling for this task?
Ask yourself the question: to what domain does this piece of work belong? If it's required for data integrity, then it's obviously the DBMS' problem and would probably best be handled there. If it's part of the business domain rather than the data, or might require information or processing that's not available or natural to the DBMS, it's probably best made external.
I'd say, use the best tool for the job. Having stuff handled by the database using whatever features it offers is often nice. For example, a log table that keeps "snapshots" of status updates of records in another table is something I typically like to have a trigger for, taking that responsibility out of my app's hands.
But that's something that's available in practically any DBMS. There's the possibility that other databases won't offer the job scheduling capacities you require. If it's conceivable that some day you'll be switching to a different DBMS, you'll then be forced to do it in Java anyway. That's the advantage of the Java approach: you've got the functionality independently of the database. If you're using pure JDBC with standard SQL queries, you've got a fully portable solution.
Both approaches seem valid. Consider what induces the least work and worries. If it's done in Java you'll need to make sure that process is running or scheduled. That's some external dependency. If it's in the database, you'll be sure the job is done as long as the DB is up.
Well, first off, if you want to do it in Java, you can use the Timer for a simple basic repetitive job, or Quartz for more advanced stuff.
Personally I also think that it would be better to have the same entity (application) deal with all related database actions. In other words, if your Java app is reading/writing to/from the db, it should be consistent and also deal with scheduled reading/writings. And as a plus, this way you can synchronize your actions easier, like, if you want to make sure that a scheduled job is running, has started, has finished, you can do that a lot easier if all is done in Java as compared with having a different process (like the SQL Scheduler) doing it.