Are multithreaded database transactions via connection-pooling threadsafe? - java

Example Scenario:
Using a threadpool in java where each thread gets a new connection from the connectionpool and then all threads proceed to do some db transaction in parallel. For example inserting 100 values into the same table.
Will this somehow mess with the table/database or is it entirely safe without any kind of synchronization required between the threads?
I find it hard to find reliable information about this subject. From what I gather DB engines handle this on their own/if at all (PostgresQL apparently since version 9.X). Are there any well written articles explaining this further?
Bonus question: Is there even a point to make use of parallel transactions when the DB runs on a single hdd?

As long as the database itself is conforming to ACID you are fine (although every now and then someone finds a bug in some really strange situation).
To the bonus question: for PostgreSQL it totally does make sense as long as you have some time for collecting concurrent transactions (increase value for commit_delay), which then can help combining disk I/O's into batches. There are also other parameters for transaction throughput tuning, most of which can be pretty dangerous if Durability is one of your major concerns.
Also, please keep in mind that the database client also needs to do some work between database calls which, when executed sequentially, will just add idle time for the database. So even here, parallelism helps (as long as you have actual resources for it (CPU, ...).

Related

Multi threaded transactional inserts with spring data jdbc

I'm using NamedParameterJdbcTemplate. I need to inserts data to 5 different tables within a transaction.
The sequential execution of inserts take long time & I need to optimize the time taken for inserts.
One possible option is make all inserts parallel using threads. As far as I understood transaction is not propagate to multi threads.
How can I improve time taken for this operation within a transaction boundary ?
I don't think what you are trying to do can possibly work.
As far as I know a database transaction is always bound to a single connection.
And the JDBC connection API is blocking, i.e. you can only execute a single statement at a time. So even when you share the Spring transaction across multiple threads you'll still execute your SQL sequential.
I therefore see the following options which might be combined available to you:
Tune your database/SQL: batched inserts, disabled constraints, adding or removing indexes and so one might have a effect on the execution time.
Drop the transactional constraint.
If you can break your process into multiple processes you might be able to run them in parallel and actually gaining performance.
Tune/parallelise the part happening in your Java application so you can do other stuff while your SQL statements are running.
To decide which approach is most promising we'd need to know more about your actual scenario.

Single transaction across multiple threads solution

As I understand it, all transactions are Thread-bound (i.e. with the context stored in ThreadLocal). For example if:
I start a transaction in a transactional parent method
Make database insert #1 in an asynchronous call
Make database insert #2 in another asynchronous call
Then that will yield two different transactions (one for each insert) even though they shared the same "transactional" parent.
For example, let's say I perform two inserts (and using a very simple sample, i.e. not using an executor or completable future for brevity, etc.):
#Transactional
public void addInTransactionWithAnnotation() {
addNewRow();
addNewRow();
}
Will perform both inserts, as desired, as part of the same transaction.
However, if I wanted to parallelize those inserts for performance:
#Transactional
public void addInTransactionWithAnnotation() {
new Thread(this::addNewRow).start();
new Thread(this::addNewRow).start();
}
Then each one of those spawned threads will not participate in the transaction at all because transactions are Thread-bound.
Key Question: Is there a way to safely propagate the transaction to the child threads?
The only solutions I've thought of to solve this problem:
Use JTA or some XA manager, which by definition should be able to do
this. However, I ideally don't want to use XA for my solution
because of it's overhead
Pipe all of the transactional work I want performed (in the above example, the addNewRow() function) to a single thread, and do all of the prior work in the multithreaded fashion.
Figuring out some way to leverage InheritableThreadLocal on the Transaction status and propagate it to the child threads. I'm not sure how to do this.
Are there any more solutions possible? Even if it's tastes a little bit of like a workaround (like my solutions above)?
The JTA API has several methods that operate implicitly on the current Thread's Transaction, but it doesn't prevent you moving or copying a Transaction between Threads, or performing certain operations on a Transaction that's not bound to the current (or any other) Thread. This causes no end of headaches, but it's not the worst part...
For raw JDBC, you don't have a JTA Transaction at all. You have a JDBC Connection, which has its own ideas about transaction context. In which case, the transaction is Connection bound, not thread bound. Pass the Connection around and the tx goes with it. But Connections aren't necessarily threadsafe and are probably a performance bottleneck anyhow, so sharing one between multiple concurrent threads doesn't really help you. You likely need multiple Connections that think they are in the same Transaction, which means you need XA, since that's how the db identifies such cases. At which point you're back to JTA, but now with a JCA in the picture to handle the Connection management properly. In short, you've reinvented the JavaEE application server.
For frameworks that layer on JDBC e.g. ORMs like Hibernate, you have an additional complication: their abstractions are not necessarily threadsafe. So you can't have a Session that is bound to multiple Threads concurrently. But you can have multiple concurrent Sessions that each participate in the same XA transaction.
As usual it boils down to Amdahl's law. If the speedup you get from using multiple Connections per tx to allow for multiple concurrent Threads to share the db I/O work is large relative to what you get from batching, then the overhead of XA is worthwhile. If the speedup is in local computation and the db I/O is a minor concern, then a single Thread that handles the JDBC Connection and offloads non-IO computation work to a Thread pool is the way to go.
First, a clarification: if you want to speed up several inserts of the same kind, as your example suggests, you will probably get the best performance by issuing the inserts in the same thread and using some type of batch inserting. Depending on your DBMS there are several techniques available, look at:
Efficient way to do batch INSERTS with JDBC
What's the fastest way to do a bulk insert into Postgres?
As for your actual question, I would personally try to pipe all the work to a worker thread. It is the simplest option as you don't need to mess with either ThreadLocals or transaction enlistment/delistment. Furthermore, once you have your units of work in the same thread, if you are smart you might be able to apply the batching techniques above for better performance.
Lastly, piping work to worker threads does not mean that you must have a single worker thread, you could have a pool of workers and achieve some parallelism if it is really beneficial to your application. Think in terms of producers/consumers.

Mutliple SQL Query Request - Is Better with Single or Multiple Connection?

I created an application, which deals with multiple database table at a same time. At present I created a single connection for the process and trying to execute query like select query for multiple tables parallel.
Each table may have hundreds of thousands or millions of records.
I have a connection and multiple statements that are executing parallel in threads.
I want to find out is there any better solution or approach?
I am thinking that if I use connection pool of for example 10 connections and run multiple thread (less than 10) to execute select query. Will this increase my application's performance?
Is my first approach okay?
Is it not a good approach to execute multiple statement same time (parallel) on the database?
In this forum link mentioned that single connection is better.
Databases are designed to run multiple parallel queries. Using a pool will almost certainly enhance your throughput if you are experiencing latency not caused by the database.
If the latency is caused by the database then parallelising may not help - and may even make it worse. Obviously it depends on the kind of query you are running.
I understand from your question that you are using a single Connection object and sharing it across threads. Each of those threads then executes it own statement. I will attempt to respond to your queries in reverse order.
Is it not good approach to execute multiple statement same time
(parallel) on the database?
This is not really a relevant point for this question. Almost all databases should be able to run queries in parallel. And if it cannot then either of your approaches would be almost identical for a concurrency benefit perspective.
Is my first approach Okay?
If you are just doing SELECTs it may not cause issues but you have to very cautious about sharing a Connection object. A number of transactional attributes such as autoCommit and isolation are set on the Connection object - this would mean all those would be shared by all your statements. You have to understand how that works in your case.
See the following links for more information
Is MySQL Connector/JDBC thread safe?
https://db.apache.org/derby/docs/10.2/devguide/cdevconcepts89498.html
Bottomline is if you can use a Connection pool, please do so.
Will this increase my application's performance ?
The best way to check this is to try it out. Theoretical analysis for performance in a multithreaded environment and with database functions rarely gets you accurate results. But then again, considering point 2 it seems you should just go with Connection pool.
EDIT
I just realized what I am thinking as the concern here and what your concern actually is may be different. I was thinking purely from sharing the Connection object perspective to avoid creating additional Connection objects [either pooled or new].
For performance of getting all the data from the database either way (assuming the the 1st way doesn't pose a problem) should be almost identical. In fact even if you create a new Connection object in each thread the overhead of that should typically be insignificant compared to querying millions of records.

Tutorial about Using multi-threading in jdbc

Our company has a Batch Application which runs every day, It does some database related jobs mostly, import data into database table from file for example.
There are 20+ tasks defined in that application, each one may depends on other ones or not.
The application execute tasks one by one, the whole application runs in a single thread.
It takes 3~7 hours to finish all the tasks. I think it's too long, so I think maybe I can improve performance by multi-threading.
I think as there is dependency between tasks, it not good (or it's not easy) to make tasks run in parallel, but maybe I can use multi-threading to improve performance inside a task.
for example : we have a task defined as "ImportBizData", which copy data into a database table from a data file(usually contains 100,0000+ rows). I wonder is that worth to use multi-threading?
As I know a little about multi-threading, I hope some one provide some tutorial links on this topic.
Multi-threading will improve your performance but there are a couple of things you need to know:
Each thread needs its own JDBC connection. Connections can't be shared between threads because each connection is also a transaction.
Upload the data in chunks and commit once in a while to avoid accumulating huge rollback/undo tables.
Cut tasks into several work units where each unit does one job.
To elaborate the last point: Currently, you have a task that reads a file, parses it, opens a JDBC connection, does some calculations, sends the data to the database, etc.
What you should do:
One (!) thread to read the file and create "jobs" out of it. Each job should contains a small, but not too small "unit of work". Push those into a queue
The next thread(s) wait(s) for jobs in the queue and do the calculations. This can happen while the threads in step #1 wait for the slow hard disk to return the new lines of data. The result of this conversion step goes into the next queue
One or more threads to upload the data via JDBC.
The first and the last threads are pretty slow because they are I/O bound (hard disks are slow and network connections are even worse). Plus inserting data in a database is a very complex task (allocating space, updating indexes, checking foreign keys)
Using different worker threads gives you lots of advantages:
It's easy to test each thread separately. Since they don't share data, you need no synchronization. The queues will do that for you
You can quickly change the number of threads for each step to tweak performance
Multi threading may be of help, if the lines are uncorrelated, you may start off two processes one reading even lines, another uneven lines, and get your db connection from a connection pool (dbcp) and analyze performance. But first I would investigate whether jdbc is the best approach normally databases have optimized solutions for imports like this. These solutions may also temporarily switch of constraint checking of your table, and turn that back on later, which is also great for performance. As always depending on your requirements.
Also you may want to checkout springbatch which is designed for batch processing.
As far as I know,the JDBC Bridge uses synchronized methods to serialize all calls to ODBC so using mutliple threads won't give you any performance boost unless it boosts your application itself.
I am not all that familiar with JDBC but regarding the multithreading bit of your question, what you should keep in mind is that parallel processing relies on effectively dividing your problem into bits that are independent of one another and in some way putting them back together (their output that is). If you dont know the underlying dependencies between tasks you might end up having really odd errors/exceptions in your code. Even worse, it might all execute without any problems, but the results might be off from true values. Multi-threading is tricky business, in a way fun to learn (at least I think so) but pain in the neck when things go south.
Here are a couple of links that might provide useful:
Oracle's java trail: best place to start
A good tutorial for java concurrency
an interesting article on concurrency
If you are serious about putting effort to getting into multi-threading I can recommend GOETZ, BRIAN: JAVA CONCURRENCY, amazing book really..
Good luck
I had a similar task. But in my case, all the tables were unrelated to each other.
STEP1:
Using SQL Loader(Oracle) for uploading data into database(very fast) OR any similar bulk update tools for your database.
STEP2:
Running each uploading process in a different thread(for unrelated tasks) and in a single thread for related tasks.
P.S. You could identify different inter-related jobs in your application and categorize them in groups; and running each group in different threads.
Links to run you up:
JAVA Threading
follow the last example in the above link(Example: Partitioning a large task with multiple threads)
SQL Loader can dramatically improve performance
The fastest way I've found to insert large numbers of records into Oracle is with array operations. See the "setExecuteBatch" method, which is specific to OraclePreparedStatement. It's described in one of the examples here:
http://betteratoracle.com/posts/25-array-batch-inserts-with-jdbc
If Multi threading would complicate your work, you could go with Async messaging. I'm not fully aware of what your needs are, so, the following is from what I am seeing currently.
Create a file reader java whose purpose is to read the biz file and put messages into the JMS queue on the server. This could be plain Java with static void main()
Consume the JMS messages in the Message driven beans(You can set the limit on the number of beans to be created in the pool, 50 or 100 depending on the need) if you have mutliple servers, well and good, your job is now split into multiple servers.
Each row of data is asynchronously split between 2 servers and 50 beans on each server.
You do not have to deal with threads in the whole process, JMS is ideal because your data is within a transaction, if something fails before you send an ack to the server, the message will be resent to the consumer, the load will be split between the servers without you doing anything special like multi threading.
Also, spring is providing spring-batch which can help you. http://docs.spring.io/spring-batch/reference/html/spring-batch-intro.html#springBatchUsageScenarios

Using multiple statement objects in a multi-threaded application

Im developing a multi-threaded application in which different threads are required to update the database concurrently. Hence,i passed a new statement object to each thread, while creating it(to avoid locking,if i send a single object). My doubts are :
Is there a limit on the number of statement objects that could be obtained from a single jdbc connection ? would the database connection fail if i create too many statement objects ?
If i close the statement properly before the thread dies out,what would be the number of threads that can be spawned at a time (on a system with 512Mb RAM) ?
Wouldn't the driver update the database while keeping the data consistent,no matter how many statement objects i use to update the db parallelly ? i use mysql.
Practically the number of statement objects you would be able to create should suffice your needs. Then again, how much is "too many" in your case?
The number of threads that can be created depends on a lot of factors. Do realize that these threads you create would be "OS level" threads and not real threads (assuming you have a dual core system, that would make it 2 hardware threads or 4 if hyper-threading is available). Profiling you would be of prime importance here to determine how many threads can be created before your system slows down to a crawl.
This would depend on the locking mechanism used by the database. What are you aiming for; high integrity or high performance? Read this.
IMO, you would be better off looking up Connection objects from a connection pool in each of those threads rather than trying to pass around "statement" objects.
Although I am not a java programmer, sharing a single connection between multiple threads is a bad idea. What happens when 2 threads are trying to write on the same socket? - so - each thread must have its own db connection
Yes, the data should be consistent in the DB if many threads are writing at the same time - anyway, you will have to take care in code of managing the transactions correctly - and of course, use InnoDB as the storage engine for MySQL because MyISAM does not permit transactions
that's probably up to the jdbc implementation, but, in general, just about everything has limits.
who knows. in practice, probably thousands. however, that many probably won't increase your performance.
yes, you should be able to share 1 connection across multiple threads, however, many jdbc implementations perform poorly in this scenario. better to have a connection per thread (for some reasonable number of connections/threads).

Categories