Im developing a multi-threaded application in which different threads are required to update the database concurrently. Hence,i passed a new statement object to each thread, while creating it(to avoid locking,if i send a single object). My doubts are :
Is there a limit on the number of statement objects that could be obtained from a single jdbc connection ? would the database connection fail if i create too many statement objects ?
If i close the statement properly before the thread dies out,what would be the number of threads that can be spawned at a time (on a system with 512Mb RAM) ?
Wouldn't the driver update the database while keeping the data consistent,no matter how many statement objects i use to update the db parallelly ? i use mysql.
Practically the number of statement objects you would be able to create should suffice your needs. Then again, how much is "too many" in your case?
The number of threads that can be created depends on a lot of factors. Do realize that these threads you create would be "OS level" threads and not real threads (assuming you have a dual core system, that would make it 2 hardware threads or 4 if hyper-threading is available). Profiling you would be of prime importance here to determine how many threads can be created before your system slows down to a crawl.
This would depend on the locking mechanism used by the database. What are you aiming for; high integrity or high performance? Read this.
IMO, you would be better off looking up Connection objects from a connection pool in each of those threads rather than trying to pass around "statement" objects.
Although I am not a java programmer, sharing a single connection between multiple threads is a bad idea. What happens when 2 threads are trying to write on the same socket? - so - each thread must have its own db connection
Yes, the data should be consistent in the DB if many threads are writing at the same time - anyway, you will have to take care in code of managing the transactions correctly - and of course, use InnoDB as the storage engine for MySQL because MyISAM does not permit transactions
that's probably up to the jdbc implementation, but, in general, just about everything has limits.
who knows. in practice, probably thousands. however, that many probably won't increase your performance.
yes, you should be able to share 1 connection across multiple threads, however, many jdbc implementations perform poorly in this scenario. better to have a connection per thread (for some reasonable number of connections/threads).
Related
Example Scenario:
Using a threadpool in java where each thread gets a new connection from the connectionpool and then all threads proceed to do some db transaction in parallel. For example inserting 100 values into the same table.
Will this somehow mess with the table/database or is it entirely safe without any kind of synchronization required between the threads?
I find it hard to find reliable information about this subject. From what I gather DB engines handle this on their own/if at all (PostgresQL apparently since version 9.X). Are there any well written articles explaining this further?
Bonus question: Is there even a point to make use of parallel transactions when the DB runs on a single hdd?
As long as the database itself is conforming to ACID you are fine (although every now and then someone finds a bug in some really strange situation).
To the bonus question: for PostgreSQL it totally does make sense as long as you have some time for collecting concurrent transactions (increase value for commit_delay), which then can help combining disk I/O's into batches. There are also other parameters for transaction throughput tuning, most of which can be pretty dangerous if Durability is one of your major concerns.
Also, please keep in mind that the database client also needs to do some work between database calls which, when executed sequentially, will just add idle time for the database. So even here, parallelism helps (as long as you have actual resources for it (CPU, ...).
I've seen two ways to deal with database connections:
1) Connection pool
2) Bind connection to a thread (when we have fixed and constant threads count)
But I don't undestand what is the purpose of using #2. What are the advantagase of the second behaviour over the first one?
If you're working with a single thread or a very small set of threads (that need database functionality), binding a connection to a thread would act like a poor man's connection pool. Instead of checking out a connection from the pool every time you use it, you would just use the single connection bound to the thread. This would allow for quick execution of database queries, even with code that hasn't been very well designed.
However in many cases you're not working with a single thread or a small set of threads. As soon as you're developing an application with even dozens of simultaneous users, you're better off working with a connection pool as it will become impossible to dedicate a connection to every thread (see next paragraph).
Some people also have the misunderstanding that a connection pool can and should have a lot of connections (100 or more), even though it's often more advantageous to have fewer. Since all of the connections use the database's resources, the effect is similar to having a store with a single cash register. It's not more efficient to have 10 doors to the store instead of 1, since it will just fill up with customers but the payments won't happen any faster.
I created an application, which deals with multiple database table at a same time. At present I created a single connection for the process and trying to execute query like select query for multiple tables parallel.
Each table may have hundreds of thousands or millions of records.
I have a connection and multiple statements that are executing parallel in threads.
I want to find out is there any better solution or approach?
I am thinking that if I use connection pool of for example 10 connections and run multiple thread (less than 10) to execute select query. Will this increase my application's performance?
Is my first approach okay?
Is it not a good approach to execute multiple statement same time (parallel) on the database?
In this forum link mentioned that single connection is better.
Databases are designed to run multiple parallel queries. Using a pool will almost certainly enhance your throughput if you are experiencing latency not caused by the database.
If the latency is caused by the database then parallelising may not help - and may even make it worse. Obviously it depends on the kind of query you are running.
I understand from your question that you are using a single Connection object and sharing it across threads. Each of those threads then executes it own statement. I will attempt to respond to your queries in reverse order.
Is it not good approach to execute multiple statement same time
(parallel) on the database?
This is not really a relevant point for this question. Almost all databases should be able to run queries in parallel. And if it cannot then either of your approaches would be almost identical for a concurrency benefit perspective.
Is my first approach Okay?
If you are just doing SELECTs it may not cause issues but you have to very cautious about sharing a Connection object. A number of transactional attributes such as autoCommit and isolation are set on the Connection object - this would mean all those would be shared by all your statements. You have to understand how that works in your case.
See the following links for more information
Is MySQL Connector/JDBC thread safe?
https://db.apache.org/derby/docs/10.2/devguide/cdevconcepts89498.html
Bottomline is if you can use a Connection pool, please do so.
Will this increase my application's performance ?
The best way to check this is to try it out. Theoretical analysis for performance in a multithreaded environment and with database functions rarely gets you accurate results. But then again, considering point 2 it seems you should just go with Connection pool.
EDIT
I just realized what I am thinking as the concern here and what your concern actually is may be different. I was thinking purely from sharing the Connection object perspective to avoid creating additional Connection objects [either pooled or new].
For performance of getting all the data from the database either way (assuming the the 1st way doesn't pose a problem) should be almost identical. In fact even if you create a new Connection object in each thread the overhead of that should typically be insignificant compared to querying millions of records.
I have an application that processes lots of data in files and puts this data into a database. It has been single threaded; so I create a database connection, create prepared statements on that connection, and then reuse these statements while processing the data. I might process thousands of files and can reuse the same prepared statements over and over but only updating the values. This has been working great, however ...
It has come to the point where it is taking too long to process the files, and since they are all independent, I'd like to process them concurrently. The problem is that each file might use, say, 10 prepared statements. So now for each file I'm making a new database connection (even though they are pooled), setting up these 10 prepared statements, and then closing them and the connection down for each file; so this is happening thousands and thousands of times instead of just a single time before.
I haven't actually done any timings but I'm curious if this use of connections and prepared statements is the best way? Is it really expensive to set up these prepared statements over and over again? Is there a better way to do this? I've read that you don't want to share connections between threads but maybe there's a better solution I haven't thought of?
if this use of connections and prepared statements is the best way? Is it really expensive to set up these prepared statements over and over again?
You can reuse the connections and prepared statements over and over again for sure. You do not have to re-create them and for the connections, you certainly do not have to reconnect to the database server every time. You should be using a database connection pool at the very least. Also, you cannot not use a prepared statement in multiple threads at the same time. And I also think that for most database connections, you cannot use the same connection in different threads.
That said, it might make sense to do some profiler runs because threading database code typically provides minimal speed increase because you are often limited by the database server IO and not by the threads. This may not be true if you are mixing queries and inserts and transactions. You might get some concurrency if you are making a remote connection to a database.
To improve the speed of your database operations, consider turing off auto-commit before a large number of transactions or otherwise batching up your requests if you can.
I advice you to use C3P0 API Check it http://www.mchange.com/projects/c3p0/
Enhanced performance is the purpose of Connection and Statement pooling especially if you are acquiring an unpooled Connection for each client access, this is the major goal of the c3p0 library.
This part is taken from C3P0 Doc about threads and heavy load:
numHelperThreads and maxAdministrativeTaskTime help to configure the behavior of DataSource thread pools. By default, each DataSource has only three associated helper threads. If performance seems to drag under heavy load, or if you observe via JMX or direct inspection of a PooledDataSource, that the number of "pending tasks" is usually greater than zero, try increasing numHelperThreads. maxAdministrativeTaskTime may be useful for users experiencing tasks that hang indefinitely and "APPARENT DEADLOCK" messages.
In addition, I recommend you user Executor and ExecutorService in (java.util.concurrent) to pool your threads.
Look like the following:
Executor executor = Executors.newFixedThreadPool(int numberOfThreadsNeeded);
// Executor executor =Executors.newCachedThreadPool(); // Or this one
executor.execute(runnable);
.
.
.
etc
I have a Java program consisting of about 15 methods. And, these methods get invoked very frequently during the exeuction of the program. At the moment, I am creating a new connection in every method and invoking statements on them (Database is setup on another machine on the network).
What I would like to know is: Should I create only one connection in the main method and pass it as an argument to all the methods that require a connection object since it would significantly reduce the number of connections object in the program, instead of creating and closing connections very frequently in every method.
I suspect I am not using the resources very efficiently with the current design, and there is a lot of scope for improvement, considering that this program might grow a lot in the future.
Yes, you should consider re-using connections rather than creating a new one each time. The usual procedure is:
make some guess as to how many simultaneous connections your database can sensibly handle (e.g. start with 2 or 3 per CPU on the database machine until you find out that this is too few or too many-- it'll tend to depend on how disk-bound your queries are)
create a pool of this many connections: essentially a class that you can ask for "the next free connection" at the beginning of each method and then "pass back" to the pool at the end of each method
your getFreeConnection() method needs to return a free connection if one is available, else either (1) create a new one, up to the maximum number of connections you've decided to permit, or (2) if the maximum are already created, wait for one to become free
I'd recommend the Semaphore class to manage the connections; I actually have a short article on my web site on managing a resource pool with a Semaphore with an example I think you could adapt to your purpose
A couple of practical considerations:
For optimum performance, you need to be careful not to "hog" a connection while you're not actually using it to run a query. If you take a connection from the pool once and then pass it to various methods, you need to make sure you're not accidentally doing this.
Don't forget to return your connections to the pool! (try/finally is your friend here...)
On many systems, you can't keep connections open 'forever': the O/S will close them after some maximum time. So in your 'return a connection to the pool' method, you'll need to think about 'retiring' connections that have been around for a long time (build in some mechanism for remembering, e.g. by having a wrapper object around an actual JDBC Connection object that you can use to store metrics such as this)
You may want to consider using prepared statements.
Over time, you'll probably need to tweak the connection pool size
You can either pass in the connection or better yet use something like Jakarta Database Connection Pooling.
http://commons.apache.org/dbcp/
You should use a connection pool for that.
That way you could ask for the connection and release it when you are finish with it and return it to the pool
If another thread wants a new connection and that one is in use, a new one could be created. If no other thread is using a connection the same could be re-used.
This way you can leave your app somehow the way it is ( and not passing the connection all around ) and still use the resources properly.
Unfortunately first class ConnectionPools are not very easy to use in standalone applications ( they are the default in application servers ) Probably a microcontainer ( such as Sping ) or a good framework ( such as Hibernate ) could let you use one.
They are no too hard to code one from the scratch though.
:)
This google search will help you to find more about how to use one.
Skim through
Many JDBC drivers do connection pooling for you, so there is little advantage doing additional pooling in this case. I suggest you check the documentation for you JDBC driver.
Another approach to connection pools is to
Have one connection for all database access with synchronised access. This doesn't allow concurrency but is very simple.
Store the connections in a ThreadLocal variable (override initialValue()) This works well if there is a small fixed number of threads.
Otherwise, I would suggest using a connection pool.
If your application is single-threaded, or does all its database operations from a single thread, it's ok to use a single connection. Assuming you don't need multiple connections for any other reason, this would be by far the simplest implementation.
Depending on your driver, it may also be feasible to share a connection between threads - this would be ok too, if you trust your driver not to lie about its thread-safety. See your driver documentation for more info.
Typically the objects below "Connection" cannot safely be used from multiple threads, so it's generally not advisable to share ResultSet, Statement objects etc between threads - by far the best policy is to use them in the same thread which created them; this is normally easy because those objects are not generally kept for too long.