I have an application that processes lots of data in files and puts this data into a database. It has been single threaded; so I create a database connection, create prepared statements on that connection, and then reuse these statements while processing the data. I might process thousands of files and can reuse the same prepared statements over and over but only updating the values. This has been working great, however ...
It has come to the point where it is taking too long to process the files, and since they are all independent, I'd like to process them concurrently. The problem is that each file might use, say, 10 prepared statements. So now for each file I'm making a new database connection (even though they are pooled), setting up these 10 prepared statements, and then closing them and the connection down for each file; so this is happening thousands and thousands of times instead of just a single time before.
I haven't actually done any timings but I'm curious if this use of connections and prepared statements is the best way? Is it really expensive to set up these prepared statements over and over again? Is there a better way to do this? I've read that you don't want to share connections between threads but maybe there's a better solution I haven't thought of?
if this use of connections and prepared statements is the best way? Is it really expensive to set up these prepared statements over and over again?
You can reuse the connections and prepared statements over and over again for sure. You do not have to re-create them and for the connections, you certainly do not have to reconnect to the database server every time. You should be using a database connection pool at the very least. Also, you cannot not use a prepared statement in multiple threads at the same time. And I also think that for most database connections, you cannot use the same connection in different threads.
That said, it might make sense to do some profiler runs because threading database code typically provides minimal speed increase because you are often limited by the database server IO and not by the threads. This may not be true if you are mixing queries and inserts and transactions. You might get some concurrency if you are making a remote connection to a database.
To improve the speed of your database operations, consider turing off auto-commit before a large number of transactions or otherwise batching up your requests if you can.
I advice you to use C3P0 API Check it http://www.mchange.com/projects/c3p0/
Enhanced performance is the purpose of Connection and Statement pooling especially if you are acquiring an unpooled Connection for each client access, this is the major goal of the c3p0 library.
This part is taken from C3P0 Doc about threads and heavy load:
numHelperThreads and maxAdministrativeTaskTime help to configure the behavior of DataSource thread pools. By default, each DataSource has only three associated helper threads. If performance seems to drag under heavy load, or if you observe via JMX or direct inspection of a PooledDataSource, that the number of "pending tasks" is usually greater than zero, try increasing numHelperThreads. maxAdministrativeTaskTime may be useful for users experiencing tasks that hang indefinitely and "APPARENT DEADLOCK" messages.
In addition, I recommend you user Executor and ExecutorService in (java.util.concurrent) to pool your threads.
Look like the following:
Executor executor = Executors.newFixedThreadPool(int numberOfThreadsNeeded);
// Executor executor =Executors.newCachedThreadPool(); // Or this one
executor.execute(runnable);
.
.
.
etc
Related
Due to some previous questions that I've had answered about the synchronous nature of MySQL I'm starting to question the reason people use Connection pools, and if in my scenario I should move to a pool.
Currently my application keeps a single connection active. There's only a single connection, statement, and result set being used in my application that's recycled. All of my database tasks are placed in a queue and executed back to back on a seperate thread. One thread for database queries, One connection for database access. In the event that the connection has an issue, it will dispose of the connection and create a new one.
From my understanding regardless of how many queries are sent to MySQL to be processed they will all be processed synchornously in the order they are received. It does not matter if these queries are coming from a single source or multiple, they will be executed in the order received.
With this being said, what's the point in having multiple connections and threads to smash queries into the database's processing queue, when regardless it's going to process them one by one anyway. A query is not going to execute until the query before it has completed processing, and like-wise in my scenario where I'm not using a pool, the next query is not going to be executed until the previous query has completed processing.
Now you may say:
The amount of time spent on processing the results provided by the MySQL query will increase the amount of time between queries being executed.
That's obviously correct, which is why I have a worker thread that handles the results of a query. When a query is completed, I convert the results into Map<> format and release the statement/resultset from memory and start processing the next query. The Map<> is sent off to a separate Worker thread for processing, so it doesn't congest the query execution thread.
Can anyone tell me if the way I'm doing things is alright, and if I should take the time to move to a pool of connections rather than a persistent connection. The most important thing is why. I'm starting this thread strictly for informational purposes.
EDIT: 4/29/2016
I would like to add that I know what a connection pool is, however I'm more curious about the benefits of using a pool over a single persistent connection when the table locks out requests from all connections during query processing to begin with.
Just trying this StackOverflow thing out but,
In every connection to a database, most of the time, it's idle. When you execute a query in the connection to INSERT or UPDATE a table, it locks the table, preventing concurrent edits. While this is good and all, preventing data overwriting or corruption, this means that no other connections may make edits while the first connection/query is still running.
However, starting a new connection takes time, and in larger infrastructures trying to skim and skin off all excess time wastage, this is not good. As such, connection pools are a whole group of connections left in the idle state, ready for the next query.
Lastly, if you are running a small project, there's usually no reason for a connection pool but if you are running a large site with UPDATEs and INSERTs flying around every millisecond, a connection pool reduces overhead time.
Slightly related answer:
a pool can do additional "connection health checks" (by examining SQL exception codes)
and refresh connections to reduce memory usage (see the note on "maxLifeTime" in the answer).
But all those things might not outweigh the simpler approach using one connection.
Another factor to consider is (blocking) network I/O times. Consider this (rough) scenario:
client prepares query --> client sends data over the network
--> server receives data from the network --> server executes query, prepares results
--> server sends data over the network --> client receives data from the network
--> client prepares resultset
If the database is local (on the same machine as the client) then network times are barely noticeable.
But if the database is remote, network I/O times can become measurable and impact performance.
Assuming the isolation level is at "read committed", running select-statements in parallel could become faster.
In my experience, using 4 connections at the same time instead of 1 generally improves performance (or throughput).
This does depend on your specific situation: if MySQL is indeed just mostly waiting on locks to get released,
adding additional connections will not do much in terms of speed.
And likewise, if the client is single-threaded, the client may not actually perceive any noticeable speed improvements.
This should be easy enough to test though: compare execution times for one program with 1 thread using 1 connection to execute an X-amount of select-queries
(i.e. re-use your current program) with another program using 4 threads with each thread using 1 separate connection
to execute the same X-amount of select-queries divided by the 4 threads (or just run the first program 4 times in parallel).
One note on the connection pool (like HikariCP): the pool must ensure no transaction
remains open when a connection is returned to the pool and this could mean a "rollback" is send each time a connection is returned to the pool
(closed) when auto-commit is off and no "commit" or "rollback" was send previously.
This in turn can increase network I/O times instead of reducing it. So make sure to test with either
auto-commit on or make sure to always send a commit or rollback after your query or set of queries is done.
Connection pool and persistent connection are not the same thing.
One is the limit of the number of SQL connections, the other is single Pipe issues.
The problem is generally the time taken to transfer the SQL output to the server than the query execution time. So if you open two cli SQL clients and fire two queries, one with a large output and one with a small output (in that sequence), the smaller one finishes first while the larger one is still scrolling its output.
The point here is that multiple connection does solve problems for cases like the above.
When you have multiple front end requests asking for queries, you may prefer persistent connections because it gives you the benefit of multiplex over different connections (large versus small outputs) and prevents the overhead of session setup/teardown.
Connection pool APIs have inbuilt error checks and handling but most APIs still expect you to manually declare if you want a persistent connection or not.
So in effect there are 3 variables, pool, persistence and config parameters via the API. One has to make a mix and match of pool size, persistence and number of connections to suite one's environment
I created an application, which deals with multiple database table at a same time. At present I created a single connection for the process and trying to execute query like select query for multiple tables parallel.
Each table may have hundreds of thousands or millions of records.
I have a connection and multiple statements that are executing parallel in threads.
I want to find out is there any better solution or approach?
I am thinking that if I use connection pool of for example 10 connections and run multiple thread (less than 10) to execute select query. Will this increase my application's performance?
Is my first approach okay?
Is it not a good approach to execute multiple statement same time (parallel) on the database?
In this forum link mentioned that single connection is better.
Databases are designed to run multiple parallel queries. Using a pool will almost certainly enhance your throughput if you are experiencing latency not caused by the database.
If the latency is caused by the database then parallelising may not help - and may even make it worse. Obviously it depends on the kind of query you are running.
I understand from your question that you are using a single Connection object and sharing it across threads. Each of those threads then executes it own statement. I will attempt to respond to your queries in reverse order.
Is it not good approach to execute multiple statement same time
(parallel) on the database?
This is not really a relevant point for this question. Almost all databases should be able to run queries in parallel. And if it cannot then either of your approaches would be almost identical for a concurrency benefit perspective.
Is my first approach Okay?
If you are just doing SELECTs it may not cause issues but you have to very cautious about sharing a Connection object. A number of transactional attributes such as autoCommit and isolation are set on the Connection object - this would mean all those would be shared by all your statements. You have to understand how that works in your case.
See the following links for more information
Is MySQL Connector/JDBC thread safe?
https://db.apache.org/derby/docs/10.2/devguide/cdevconcepts89498.html
Bottomline is if you can use a Connection pool, please do so.
Will this increase my application's performance ?
The best way to check this is to try it out. Theoretical analysis for performance in a multithreaded environment and with database functions rarely gets you accurate results. But then again, considering point 2 it seems you should just go with Connection pool.
EDIT
I just realized what I am thinking as the concern here and what your concern actually is may be different. I was thinking purely from sharing the Connection object perspective to avoid creating additional Connection objects [either pooled or new].
For performance of getting all the data from the database either way (assuming the the 1st way doesn't pose a problem) should be almost identical. In fact even if you create a new Connection object in each thread the overhead of that should typically be insignificant compared to querying millions of records.
I have some data to be read from multiple sql server databases (like 200). There will be like 10 tables in each of these databases where I need to read the data from, how can I do this in the best possible way using java?
Thanks in advance
Concurrency to the rescue.
To achieve the best throughput for your heavy workload, write your application as multithreaded from the start, then you can speed it up or throttle it back, depending on performance constraints.
ExecutorService is a nice way to break down tasks in a scalable way. I would suggest you define each database-import task as a Callable, and then 'invoke' all the tasks from an ExecutorService.
I'd do something like this:
List<YourCallableImportJobs> work= yourFactory.getAllWork();
// this variable can be used to tweak performance.
// Begin with a low number and then ramp it up if it's too slow.
int nThreads=10;
ExecutorService service = ExecutorService.newFixedThreadPool(nThreads);
List<Future<T>> futures= service.invokeAll(work);
You can poll the Futures to check when the work is done...
Finally, if you wanted concurrent access to each database (particularly for your destination database), I recommend using a connection pooling mechanism such as C3PO. This means that you don't spend too much time opening and closing connections. (You could even break down each import into individual queries - this is when connection pooling would help as well).
Hope this helps
Maintain a queue of database connections, with ipaddresses of those databases, use multithreading to connect to each of the database, now as the work from a database finish, close the connection from that database and remove the connection from queue.
Im developing a multi-threaded application in which different threads are required to update the database concurrently. Hence,i passed a new statement object to each thread, while creating it(to avoid locking,if i send a single object). My doubts are :
Is there a limit on the number of statement objects that could be obtained from a single jdbc connection ? would the database connection fail if i create too many statement objects ?
If i close the statement properly before the thread dies out,what would be the number of threads that can be spawned at a time (on a system with 512Mb RAM) ?
Wouldn't the driver update the database while keeping the data consistent,no matter how many statement objects i use to update the db parallelly ? i use mysql.
Practically the number of statement objects you would be able to create should suffice your needs. Then again, how much is "too many" in your case?
The number of threads that can be created depends on a lot of factors. Do realize that these threads you create would be "OS level" threads and not real threads (assuming you have a dual core system, that would make it 2 hardware threads or 4 if hyper-threading is available). Profiling you would be of prime importance here to determine how many threads can be created before your system slows down to a crawl.
This would depend on the locking mechanism used by the database. What are you aiming for; high integrity or high performance? Read this.
IMO, you would be better off looking up Connection objects from a connection pool in each of those threads rather than trying to pass around "statement" objects.
Although I am not a java programmer, sharing a single connection between multiple threads is a bad idea. What happens when 2 threads are trying to write on the same socket? - so - each thread must have its own db connection
Yes, the data should be consistent in the DB if many threads are writing at the same time - anyway, you will have to take care in code of managing the transactions correctly - and of course, use InnoDB as the storage engine for MySQL because MyISAM does not permit transactions
that's probably up to the jdbc implementation, but, in general, just about everything has limits.
who knows. in practice, probably thousands. however, that many probably won't increase your performance.
yes, you should be able to share 1 connection across multiple threads, however, many jdbc implementations perform poorly in this scenario. better to have a connection per thread (for some reasonable number of connections/threads).
What is a Connection Object in JDBC ? How is this Connection maintained(I mean is it a Network connection) ? Are they TCP/IP Connections ? Why is it a costly operation to create a Connection every time ? Why do these connections become stale after sometime and I need to refresh the Pool ? Why can't I use one connection to execute multiple queries ?
These connections are TCP/IP connections. To not have to overhead of creating every time a new connection there are connection pools that expand and shrink dynamically. You can use one connection for multiple queries. I think you mean that you release it to the pool. If you do that you might get back the same connection from the pool. In this case it just doesn't matter if you do one or multiple queries
The cost of a connection is to connect which takes some time. ANd the database prepares some stuff like sessions, etc for every connection. That would have to be done every time. Connections become stale through multiple reasons. The most prominent is a firewall in between. Connection problems could lead to connection resetting or there could be simple timeouts
To add to the other answers:
Yes, you can reuse the same connection for multiple queries. This is even advisable, as creating a new connection is quite expensive.
You can even execute multiple queries concurrently. You just have to use a new java.sql.Statement/PreparedStatement instance for every query. Statements are what JDBC uses to keep track of ongoing queries, so each parallel query needs its own Statement. You can and should reuse Statements for consecutive queries, though.
The answers to your questions is that they are implementation defined. A JDBC connection is an interface that exposes methods. What happens behind the scenes can be anything that delivers the interface. For example, consider the Oracle internal JDBC driver, used for supporting java stored procedures. Simultaneous queries are not only possible on that, they are more or less inevitable, since each request for a new connection returns the one and only connection object. I don't know for sure whether it uses TCP/IP internally but I doubt it.
So you should not assume implementation details, without being clear about precisely which JDBC implementation you are using.
since I cannot comment yet, wil post answer just to comment on Vinegar's answer, situation with setAutoCommit() returning to default state upon returning connection to pool is not mandatory behaviour and should not be taken for granted, also as closing of statements and resultsets; you can read that it should be closed, but if you do not close them, they will be automatically closed with closing of connection. Don't take it for granted, since it will take up on your resources on some versions of jdbc drivers.
We had serious problem on DB2 database on AS400, guys needing transactional isolation were calling connection.setAutoCommit(false) and after finishing job they returned such connection to pool (JNDI) without connection.setAutoCommit(old_state), so when another thread got this connection from pool, inserts and updates have not commited, and nobody could figure out why for a long time...