Due to some previous questions that I've had answered about the synchronous nature of MySQL I'm starting to question the reason people use Connection pools, and if in my scenario I should move to a pool.
Currently my application keeps a single connection active. There's only a single connection, statement, and result set being used in my application that's recycled. All of my database tasks are placed in a queue and executed back to back on a seperate thread. One thread for database queries, One connection for database access. In the event that the connection has an issue, it will dispose of the connection and create a new one.
From my understanding regardless of how many queries are sent to MySQL to be processed they will all be processed synchornously in the order they are received. It does not matter if these queries are coming from a single source or multiple, they will be executed in the order received.
With this being said, what's the point in having multiple connections and threads to smash queries into the database's processing queue, when regardless it's going to process them one by one anyway. A query is not going to execute until the query before it has completed processing, and like-wise in my scenario where I'm not using a pool, the next query is not going to be executed until the previous query has completed processing.
Now you may say:
The amount of time spent on processing the results provided by the MySQL query will increase the amount of time between queries being executed.
That's obviously correct, which is why I have a worker thread that handles the results of a query. When a query is completed, I convert the results into Map<> format and release the statement/resultset from memory and start processing the next query. The Map<> is sent off to a separate Worker thread for processing, so it doesn't congest the query execution thread.
Can anyone tell me if the way I'm doing things is alright, and if I should take the time to move to a pool of connections rather than a persistent connection. The most important thing is why. I'm starting this thread strictly for informational purposes.
EDIT: 4/29/2016
I would like to add that I know what a connection pool is, however I'm more curious about the benefits of using a pool over a single persistent connection when the table locks out requests from all connections during query processing to begin with.
Just trying this StackOverflow thing out but,
In every connection to a database, most of the time, it's idle. When you execute a query in the connection to INSERT or UPDATE a table, it locks the table, preventing concurrent edits. While this is good and all, preventing data overwriting or corruption, this means that no other connections may make edits while the first connection/query is still running.
However, starting a new connection takes time, and in larger infrastructures trying to skim and skin off all excess time wastage, this is not good. As such, connection pools are a whole group of connections left in the idle state, ready for the next query.
Lastly, if you are running a small project, there's usually no reason for a connection pool but if you are running a large site with UPDATEs and INSERTs flying around every millisecond, a connection pool reduces overhead time.
Slightly related answer:
a pool can do additional "connection health checks" (by examining SQL exception codes)
and refresh connections to reduce memory usage (see the note on "maxLifeTime" in the answer).
But all those things might not outweigh the simpler approach using one connection.
Another factor to consider is (blocking) network I/O times. Consider this (rough) scenario:
client prepares query --> client sends data over the network
--> server receives data from the network --> server executes query, prepares results
--> server sends data over the network --> client receives data from the network
--> client prepares resultset
If the database is local (on the same machine as the client) then network times are barely noticeable.
But if the database is remote, network I/O times can become measurable and impact performance.
Assuming the isolation level is at "read committed", running select-statements in parallel could become faster.
In my experience, using 4 connections at the same time instead of 1 generally improves performance (or throughput).
This does depend on your specific situation: if MySQL is indeed just mostly waiting on locks to get released,
adding additional connections will not do much in terms of speed.
And likewise, if the client is single-threaded, the client may not actually perceive any noticeable speed improvements.
This should be easy enough to test though: compare execution times for one program with 1 thread using 1 connection to execute an X-amount of select-queries
(i.e. re-use your current program) with another program using 4 threads with each thread using 1 separate connection
to execute the same X-amount of select-queries divided by the 4 threads (or just run the first program 4 times in parallel).
One note on the connection pool (like HikariCP): the pool must ensure no transaction
remains open when a connection is returned to the pool and this could mean a "rollback" is send each time a connection is returned to the pool
(closed) when auto-commit is off and no "commit" or "rollback" was send previously.
This in turn can increase network I/O times instead of reducing it. So make sure to test with either
auto-commit on or make sure to always send a commit or rollback after your query or set of queries is done.
Connection pool and persistent connection are not the same thing.
One is the limit of the number of SQL connections, the other is single Pipe issues.
The problem is generally the time taken to transfer the SQL output to the server than the query execution time. So if you open two cli SQL clients and fire two queries, one with a large output and one with a small output (in that sequence), the smaller one finishes first while the larger one is still scrolling its output.
The point here is that multiple connection does solve problems for cases like the above.
When you have multiple front end requests asking for queries, you may prefer persistent connections because it gives you the benefit of multiplex over different connections (large versus small outputs) and prevents the overhead of session setup/teardown.
Connection pool APIs have inbuilt error checks and handling but most APIs still expect you to manually declare if you want a persistent connection or not.
So in effect there are 3 variables, pool, persistence and config parameters via the API. One has to make a mix and match of pool size, persistence and number of connections to suite one's environment
Related
I am working on a Java web application that uses Weblogic to connect to an Informix database. In the application we have multiple threads creating records in a table.
It happens pretty often that it fails and the following error is thrown:
java.sql.SQLException: Could not do a physical-order read to fetch next row....
Caused by: java.sql.SQLException: ISAM error: record is locked.
I am assuming that both threads are trying to insert or update when the record is locked.
I did some research and found that there is an option to set the database that instead of throwing an error, it should wait for the lock to be released.
SET LOCK MODE TO WAIT;
SET LOCK MODE TO WAIT 17;
I don't think that there is an option in JDBC to use this setting. How do I go about using this setting in my java web app?
You can always just send that SQL straight up, using createStatement(), and then send that exact SQL.
The more 'normal' / modern approach to this problem is a combination of MVCC, the transaction level 'SERIALIZABLE', retry, and random backoff.
I have no idea if Informix is anywhere near that advanced, though. Modern DBs such as Postgres are (mysql does not count as modern for the purposes of MVCC/serializable/retry/backoff, and transactional safety).
Doing MVCC/Serializable/Retry/Backoff in raw JDBC is very complicated; use a library such as JDBI or JOOQ.
MVCC: A mechanism whereby transactions are shallow clones of the underlying data. 2 separate transactions can both read and write to the same records in the same table without getting in each other's way. Things aren't 'saved' until you commit the transaction.
SERIALIZABLE: A transaction level (also called isolationlevel), settable with jdbcDbObj.setTransactionIsolation(Connection.TRANSACTION_SERIALIZABLE); - the safest level. If you know how version control systems work: You're asking the database to aggressively rebase everything so that the entire chain of commits is ordered into a single long line of events: Each transaction acts as if it was done after the previous transaction was completed. The simplest way to implement this level is to globally lock all the things. This is, of course, very detrimental to multithread performance. In practice, good DB engines (such as postgres) are smarter than that: Multiple threads can simultaneously run transactions without just being frozen and waiting for locks; the DB engine instead checks if the things that the transaction did (not just writing, also reading) is conflict-free with simultaneous transactions. If yes, it's all allowed. If not, all but one simultaneous transaction throw a retry exception. This is the only level that lets you do this sequence of events safely:
Fetch the balance of isaace's bank account.
Fetch the balance of rzwitserloot's bank account.
subtract €10,- from isaace's number, failing if the balance is insufficient.
add €10,- to rzwitserloot's number.
Write isaace's new balance to the db.
Write rzwitserloot's new balance to the db.
commit the transaction.
Any level less than SERIALIZABLE will silently fail the job; if multiple threads do the above simultaneously, no SQLExceptions occur but the sum of the balance of isaace and rzwitserloot will change over time (money is lost or created – in between steps 1 & 2 vs. step 5/6/7, another thread sets new balances, but these new balances are lost due to the update in 5/6/7). With serializable, that cannot happen.
RETRY: The way smart DBs solve the problem is by failing (with a 'retry' error) all but one transaction, by checking if all SELECTs done by the entire transaction are not affected by any transactions that been committed to the db after this transaction was opened. If the answer is yes (some selects would have gone differently), the transaction fails. The point of this error is to tell the code that ran the transaction to just.. start from the top and do it again. Most likely this time there won't be a conflict and it will work. The assumption is that conflicts CAN occur but usually do not occur, so it is better to assume 'fair weather' (no locks, just do your stuff), check afterwards, and try again in the exotic scenario that it conflicted, vs. trying to lock rows and tables. Note that for example ethernet works the same way (assume fair weather, recover errors afterwards).
BACKOFF: One problem with retry is that computers are too consistent: If 2 threads get in the way of each other, they can both fail, both try again, just to fail again, forever. The solution is that the threads twiddle their thumbs for a random amount of time, to guarantee that at some point, one of the two conflicting retriers 'wins'.
In other words, if you want to do it 'right' (see the bank account example), but also relatively 'fast' (not globally locking), get a DB that can do this, and use JDBI or JOOQ; otherwise, you'd have to write code to run all DB stuff in a lambda block, catch the SQLException, check the SqlState to see if it is indicating that you should retry (sqlstate codes are DB-engine specific), and if yes, rerun that lambda, after waiting an exponentially increasing amount of time that also includes a random factor. That's fairly complicated, which is why I strongly advise you rely on JOOQ or JDBI to take care of this for you.
If you aren't ready for that level of DB usage, just make a statement and send "SET LOCK MDOE TO WAIT 17;" as SQL statement straight up, at the start of opening any connection. If you're using a connection pool there is usually a place you can configure SQL statements to be run on connection start.
The Informix JDBC driver does allow you to automatically set the lock wait mode when you connect to the server.
Simply pass via the DataSource or connection URL the following parameter
IFX_LOCK_MODE_WAIT=17
The values for JDBC are
(-1) Wait forever
(0) not wait (default)
(> 0) wait this many seconds
See https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.jdbc.doc/ids_jdbc_040.htm
Connection conn = DriverManager.getConnection ( "jdbc:Informix-sqli://cleo:1550:
IFXHOST=cleo;PORTNO=1550;user=rdtest;password=my_passwd;IFX_LOCK_MODE_WAIT=17";);
I've seen two ways to deal with database connections:
1) Connection pool
2) Bind connection to a thread (when we have fixed and constant threads count)
But I don't undestand what is the purpose of using #2. What are the advantagase of the second behaviour over the first one?
If you're working with a single thread or a very small set of threads (that need database functionality), binding a connection to a thread would act like a poor man's connection pool. Instead of checking out a connection from the pool every time you use it, you would just use the single connection bound to the thread. This would allow for quick execution of database queries, even with code that hasn't been very well designed.
However in many cases you're not working with a single thread or a small set of threads. As soon as you're developing an application with even dozens of simultaneous users, you're better off working with a connection pool as it will become impossible to dedicate a connection to every thread (see next paragraph).
Some people also have the misunderstanding that a connection pool can and should have a lot of connections (100 or more), even though it's often more advantageous to have fewer. Since all of the connections use the database's resources, the effect is similar to having a store with a single cash register. It's not more efficient to have 10 doors to the store instead of 1, since it will just fill up with customers but the payments won't happen any faster.
I have a MySQL database with ~8.000.000 records. Since I need to process them all I use a BlockingQueue which as Producer reads from the database and puts 1000 records in a queue. The Consumer is the processor that takes records from the queue.
I am writing this in Java, however I'm stuck to figure out how I can (in a clean, elegant way) read from my database and 'suspend' reading once the BlockingQueue is full. After this the control is being handed to the Consumer until there are free spots available again in the BlockingQueue. From here on the Producer should continue reading in records from the database.
Is it clean/elegant/efficient keeping my database connection open inorder for it to continuously read? Or should, once the control is shifted from Producer to Consumer, close the connection, store the id of the record read so far and later open the connection and start reading from that id? The latter seems to me not really good since my database will have to open/close a lot! However, the former is not so elegant in my opinion either?
With persistent connections:
You cannot build transaction processing effectively
Impossible user sessions on the same connection
The applications are not scalable.
With time you may need to extend it and it will require management/tracking of persistent connections
If the script, for whatever reason, could not release the lock on the table, then any following scripts will block indefinitely and one should restart the db server.
Using transactions, transaction block will also pass to the next script (using the same connection) if script execution ends before the transaction block completes, etc.
Persistent connections do not bring anything that you can do with non-persistent connections.
Then, why to use them, at all?
The only possible reason is performance, to use them when overhead of creating a link to your MySQL Server is high. And this depends on many factors like:
Database type
Whether MySQL server is on the same machine and, if not, how far? might be out of your local network /domain?
How much overloaded by other processes the machine on which MySQL sits
One always can replace persistent connections with non-persistent connections. It might change the performance of the script, but not its behavior!
Commercial RDBMS might be licensed by the number of concurrent opened connections and here the persistent connections can mis serve.
If you are using a bounded BlockingQueue by passing a capacity value in the constructor, then the producer will block when it attempts to call put() until the consumer removes an item by calling take().
It would help to know more about when or how the program is going to execute to decide how to deal with database connections. Some easy choices are: have the producer and all consumers get an individual connection, have a connection pool for all consumers while the producer holds the a connection, or have all producers and consumers use a connection pool.
You can facilitate minimizing the number of connections by using something such as Spring to manage your connection pool and transactions; however, it would only be necessary in some execution situations.
Which are the commons guidelines/advices to configure, in Java, a http connection pool to support huge number of concurrent http calls to the same server? I mean:
max total connections
max default connection per route
reuse strategy
keep alive strategy
keep alive duration
connection timeout
....
(I am using Apache http components 4.3, but I am available to explore new solutions)
In order to be more clear, this is my situation:
I developed a REST resource that needs to perform about 10 http calls to AWS CloudSearch in order to obtain search results to be collected in a final result (that I really cannot obtain through a single query).
The whole operation must take less than 0.25 seconds. So, I run http calls in parallel in 10 different threads.
During a benchamarking test, I noticed that with few concurrent request, 5, my objective is reached. But, increasing concurrent requests to 30, there is a tremendous degradation of performance due to the connection time that takes about 1 second. With few concurrent requests, instead, the connection time is about 150 ms (to be more precise, the first connection takes 1 second, all the following connections take about 150 ms). I can ensure that CloudSearch returns its response in less than 15 ms, so there is a problem somewhere in my connection pool.
Thank you!
The amount of threads/connections that are best for your implementation depend on that implementation (which you did not post), but here are some guidelines as requested:
If those threads never block at all, you should have as many threads as cores (Runtime.availableCores(), this will include hyperthread-cores). Simply because more than 100% CPU usage isn't possible.
If your threads rarely block, cores * 2 is a good start for benchmarking.
If your threads frequently block, you absolutely need to benchmark your application with various settings to find the best solution for your implementation, OS and hardware.
Now the most optimal case is obviously the first one, but to get to this one, you need to remove blocking from your code as much as you can. Java can do this for IO operations if you use the NIO package in non-blocking mode (which is not how the Apache package does it).
Then you have 1 thread that waits on a selector and awakes as soon as any data is ready to be sent or read. This thread then only copies the data from it's source to the destination and returns to the selector. In case of a read (incoming data), this destination is a blocking queue, on which core amount of threads wait. One of those threads will then pull out the received data and process it, now without any blocking.
You can then use the length of the blocking queue to adjust how many parallel requests are reasonable for your task and hardware.
The first connection takes >1 second, because it actually has to look-up the address via DNS. All other connections are put on hold for the moment, as there is no sense in doing this twice. You can circumvent that by either calling the IP (probably not good if you talk to a load-balancer) or by "warming-up" the connections with an initial request. Any new connection afterwards will use the cached DNS result, but still needs to perform other initializations, so reusing connections as much as you can will reduce latency a lot. With NIO this is a very easy task.
In addition there are HTTP-multi-requests, that is: you make one connection but request several URLs in one request and get several responses over "the same line". This massively reduces connection overhead, but needs to be supported by the server.
I have an application that processes lots of data in files and puts this data into a database. It has been single threaded; so I create a database connection, create prepared statements on that connection, and then reuse these statements while processing the data. I might process thousands of files and can reuse the same prepared statements over and over but only updating the values. This has been working great, however ...
It has come to the point where it is taking too long to process the files, and since they are all independent, I'd like to process them concurrently. The problem is that each file might use, say, 10 prepared statements. So now for each file I'm making a new database connection (even though they are pooled), setting up these 10 prepared statements, and then closing them and the connection down for each file; so this is happening thousands and thousands of times instead of just a single time before.
I haven't actually done any timings but I'm curious if this use of connections and prepared statements is the best way? Is it really expensive to set up these prepared statements over and over again? Is there a better way to do this? I've read that you don't want to share connections between threads but maybe there's a better solution I haven't thought of?
if this use of connections and prepared statements is the best way? Is it really expensive to set up these prepared statements over and over again?
You can reuse the connections and prepared statements over and over again for sure. You do not have to re-create them and for the connections, you certainly do not have to reconnect to the database server every time. You should be using a database connection pool at the very least. Also, you cannot not use a prepared statement in multiple threads at the same time. And I also think that for most database connections, you cannot use the same connection in different threads.
That said, it might make sense to do some profiler runs because threading database code typically provides minimal speed increase because you are often limited by the database server IO and not by the threads. This may not be true if you are mixing queries and inserts and transactions. You might get some concurrency if you are making a remote connection to a database.
To improve the speed of your database operations, consider turing off auto-commit before a large number of transactions or otherwise batching up your requests if you can.
I advice you to use C3P0 API Check it http://www.mchange.com/projects/c3p0/
Enhanced performance is the purpose of Connection and Statement pooling especially if you are acquiring an unpooled Connection for each client access, this is the major goal of the c3p0 library.
This part is taken from C3P0 Doc about threads and heavy load:
numHelperThreads and maxAdministrativeTaskTime help to configure the behavior of DataSource thread pools. By default, each DataSource has only three associated helper threads. If performance seems to drag under heavy load, or if you observe via JMX or direct inspection of a PooledDataSource, that the number of "pending tasks" is usually greater than zero, try increasing numHelperThreads. maxAdministrativeTaskTime may be useful for users experiencing tasks that hang indefinitely and "APPARENT DEADLOCK" messages.
In addition, I recommend you user Executor and ExecutorService in (java.util.concurrent) to pool your threads.
Look like the following:
Executor executor = Executors.newFixedThreadPool(int numberOfThreadsNeeded);
// Executor executor =Executors.newCachedThreadPool(); // Or this one
executor.execute(runnable);
.
.
.
etc