I am developing a HTTP application server using Netty 4 and JDBC(+BoneCP for connection pooling).
So far, I am doing all the work(works involving database connections, HttpAsyncClient and so on) on one handler. I close all I/O after each job is finished.
As far as I know, Netty performs well as long as nothing is blocking the worker thread.
However, I read that JDBC connections create blocking I/O.
Is there a good practice to use JDBC with Netty to improve scalability and performance?
As you may know, Netty provides EventExecutorGroup to start a separate thread. Blocking calls (e.g. JDBC connections, etc.) should be done in this thread instead of the one with event loop running, so that the main event loop won't be blocked and keeps responsive.
Make sure you have enough database connections, obviously your workers will block waiting for a connection if your pool is out of connections. The worker will be waiting for a new connection (if the pool is allowed to grow), or waiting for a connection to return otherwise. Otherwise, use general best practices. Tune your reads with setFetchSize() and your writes by using batching. Minimize your round trips, and fetch only the data you need. Do you have specific code (or a query) that is slow?
Related
If we use any connection pooling framework or Tomcat JDBC pool then how much it is costly to open and close the DB connection?
Is it a good practice to frequently open and close the DB connection whenever DB operations are required?
Or same connection can be carried across different methods for DB operations?
Jdbc Connection goes through the network and usually works over TCP/IP and optionally with SSL. You can read this post to find out why it is expensive.
You can use a single connection across multiple methods for different db operations because for each DB operations you would need to create a Statement to execute.
Connection pooling avoids the overhead of creating Connections during a request and should be used whenever possible. Hikari is one of the fastest.
The answer is - its almost always recommended to re-use DB Connections. Thats the whole reason why Connection Pools exist. Not only for the performance, but also for the DB stability. For instance, if you don't limit the number of connections and mistakenly open 100s of DB connections, the DB might go down. Also lets say if DB connections don't get closed due to some reason (Out of Memory error / shut down / unhandled exception etc), you would have a bigger issue. Not only would this affect your application but it could also drag down other services using the common DB. Connection pool would contain such catastrophes.
What people don't realize that behind the simple ORM API there are often 100s of raw SQLs. Imagine running these sqls independent of connection pools - we are talking about a very large overhead.
I couldn't fathom running a commercial DB application without using Connection Pools.
Some good resources on this topic:
https://www.cockroachlabs.com/blog/what-is-connection-pooling/
https://stackoverflow.blog/2020/10/14/improve-database-performance-with-connection-pooling/
Whether the maintenance (opening, closing, testing) of the database connections in a DBConnection Pool affects the working performance of the application depends on the implementation of the pool and to some extent on the underlying hardware.
A pool can be implemented to run in its own thread, or to initialise all connections during startup (of the container), or both. If the hardware provides enough cores, the working thread (the "business payload") will not be affected by the activities of the pool at all.
Other connection pools are implemented to create a new connection only on demand (a connection is requested, but currently there is none available in the pool) and within the thread of the caller. In this case, the creation of that connection reduces the performance of the working thread – this time! It should not happen too often, otherwise your application needs too many connections and/or does not return them fast enough.
But whether you really need a Database Connection Pool at all depends from the kind of your application!
If we talk about a typical server application that is intended to run forever and to serve a permanently changing crowd of multiple clients at the same time, it will definitely benefit from a connection pool.
If we talk about a tool type application that starts, performs a more or less linear task in a defined amount of time, and terminates when done, then using a connection pool for the database connection(s) may cause more overhead than it provides advantages. For such an application it might be better to keep the connection open for the whole runtime.
Taking the RDBMS view, both does not make a difference: in both cases the connections are seen as open.
If you have performance as a key parameter then better to switch to the Hikari connection pool. If you are using spring-boot then by default Hikari connection pool is used and you do not need to add any dependency. The beautiful thing about the Hikari connection pool is its entire lifecycle is managed and you do not have to do anything.
Also, it is always recommended to close the connection and let it return to the connection pool so that other threads can use it, especially in multi-tenant environments. The best way to do this is using "try with resources" and that connection is always closed.
try(Connection con = datasource.getConnection()){
// your code here.
}
To create your data source you can pass the credentials and create your data source for example:
DataSource dataSource = DataSourceBuilder.create()
.driverClassName(JDBC_DRIVER)
.url(url)
.username(username)
.password(password)
.build();
Link: https://github.com/brettwooldridge/HikariCP
If you want to know the answer in your case, just write two implementations (one with a pool, one without) and benchmark the difference.
Exactly how costly it is, depends on so many factors that it is hard to tell without measuring
But in general, a pool will be more efficient.
The costly is always a definition of impact.
Consider, you have following environment.
A web application with assuming a UI-transaction (user click) and causes a thread on the webserver. This thread is coupled to one connection/thread on the database
10 connections per 60000ms / 1min or better to say 0.167 connections/s
10 connections per 1000ms / 1sec => 10 connections/s
10 connections per 100ms / 0.1sec => 100 connections/s
10 connections per 10ms / 0.01sec => 1000 connections/s
I have worked in even bigger environments.
And believe me the more you exceed the 100 conn/s by 10^x factors the more pain you will feel without having a clean connection pool.
The more connections you generate in 1 second the higher latency you generate and the higher impact is it for the database. And the more bandwidth you will eat for recreating over and over a new "water pipeline" for dropping a few drops of water from one side to the other side.
Now getting back, if you have to access a existing connection from a connection pool it is a matter of micros or few ms to access the database connection. So considering one, it is no real impact at all.
If you have a network in between, it will grow to probably x10¹ to x10² ms to create a new connection.
Considering now the impact on your webserver, that each user blocks a thread, memory and network connection it will impact also your webserver load. Typically you run into webserver (e.g. revProxy apache + tomcat, or tomcat alone) thread pools issues on high load environments, if the connections get exhausted or they need too long time (10¹, 10² millis) to create
Now considering also the database.
If you have open connection, each connection is typically mapped to a thread on a DB. So the DB can use thread based caches to make prepared statements and to reuse pre-calculated access plan to make the accesses to data on database very fast.
You may loose this option if you have to recreate the connection over and over again.
But as said, if you are in up to 10 connections per second you shall not face any bigger issue without a connection pool, except the first additional delay to access the DB.
If you get into higher levels, you will have to manage the resources better and to avoid any useless IO-delay like recreating the connection.
Experience hints:
it does not cost you anything to use a connection pool. If you have issues with the connection pool, in all my previous performance tuning projects it was a matter of bad configuration.
You can configure
a connection check to check the connection (use a real SQL to access a real db field). so on every new access the connection gets checked and if defective it gets kicked from the connection pool
you can define a lifetime of a connections, so that you get new connection after a defined time
=> all this together ensure that even if your admins are doing crap and do not inform you (killing connection / threads on DB) the pool gets quickly rebuilt and the impact stays very low. Read the docs of the connection pool.
Is one connection pool better as the other?
A clear no, it is only getting a matter if you get into high end, or into distributed environments/clusters or into cloud based environments. If you have one connection pool already and it is still maintained, stick to it and become a pro on your connection pool settings.
I have read a lot of material to try and clearly understand the gains a Jetty Non Blocking Web Application Server can or can't offer.
So far what I understand (in part by referring to this: How do Jetty and other containers leverage NIO while sticking to the Servlet specification?) is that with a non blocking IO model a web server like Jetty runs a single (or one per CPU core) thread - the Selector thread - that determines connections that are ready for some I/O. Connections that are ready with some I/O are dispatched for processing on to an internal thread pool to process the request.
I can see how such an architecture could allow you to serve many more connections with far fewer resources. However, what I am not clear about is this:
If I wrote a servlet that ran a long running database operation using a standard JDBC driver performing blocking I/O, wouldn't the handler thread dispatched from the pool to handle this request block?
And if requests came through faster than database requests are fulfilled, the handler thread pool would exhaust at some point?
And so with an application such as this is there any benefit to be run on a Non Blocking Jetty webserver? Is the non-blocking benefit only truly accrued if the servlet itself used another layer of non-blocking access to the database? Or is there something I am missing?
Please do explain if there's some magic through which Jetty will pay less of a price for the blocking database operations than say, a blocking web server.
P.S: For a contrast I read about Node.js here - How the single threaded non blocking IO model works in Node.js - it seems to suggest that Node uses libuv underneath and applies other techniques to translate all blocking operations in code (such as database access and sleep()) into event callbacks ensuring the event loop and the internal thread pool never get blocked in a blocking callback. While it's still a little gobbledygook to me, but assuming that's true for Node, can Jetty promise the same? That too for servlets etc that are not written in a non-blocking way?
I have implemented a server in Java, upon receiving data from some client it simply forwards the data to all other clients (including the sender). I'm happy with my OO-design, I wrap all sockets in classes that provide 'callbacks'. These are called when some data are ready (or when the socket closes) -- using this design I could easily implement a simple TLV protocol to atomically send packets: the callback is not called until a full packet is received.
Now, I use the java.io package blocking I/O calls to the socket streams (and make them appear 'asynchronous' through those callbacks). So I use threads inside my socket wrapper classes: when a socket is opened, that function returns a Runnable implementation that, when run, will do the blocking calls to the InputStream, buffer data and eventually call the callback.
=> In a client application, I simply launch this Runnable in a Thread instance, because it's just one thread.
=> In my server, I submit all Runnable implementations I get upon creating new sockets (i. e. when accepting new clients) into a ThreadPoolExecutor. (FYI: the callbacks of the sockets simply put the received packets in a BlockingQueue. A single, separate (non-pooled) "dispatcher" Thread instance constantly takes the packets from this queue and writes them to all sockets currently connected to the server.)
QUESTION: This all works great, however I'm unsure about my use of the ThreadPoolExecutor, because the Runnable instances submitted are almost always blocking. Will the ThreadPoolExecutor react to this? Or will the pooled threads simply block? Because, if all pooled threads are all blocking while executing their Runnable and next, a new Runnable is submitted, then what? Suspend the new Runnable? That's not good, because then the newly connected client will have zero responsiveness until some older client disconnects. If by contrast the thread pool chooses to spawn a new thread to handle the Runnable, then I actually get a thread-per-client scenario.
I want the thread pool to 'preempt' the blocking threads and use them to handle other sockets, like an operating system that suspends I/O bound processes and doesn't schedule them again until their I/O is complete. Is that at all possible, or will I have to rewrite everything using nio in order to do this? (if nio is required, could you point out where I should start reading?)
Thanks in advance!
About the ThreadPoolExecutor: it depends. An Executors.newCachedThreadPool() will just create new threads for new Runnables. See also this question and the accepted answer. But you will end up with a thread-per-client scenario.
Nio prevents the thread-per-client scenario (if there are many clients sending relative small messages with pauses in between, see also (the summary of) this article), I advice against trying to build your own nio clone.
Implementing nio from the ground up is not easy, a tutorial can be found here. It might be easier to use a nio server like Netty.
Another alternative is to use a technology designed to handle many clients that send and receive small messages. It takes some time to learn and setup, but I managed to get a Tomcat WebSockets server talking with a Jetty WebSocket client pretty quickly. A rewrite to use this technology could be less work.
Fundamentally, this question is about: Can the same DB connection be used across multiple processes (as different map-reduce jobs are in real different independent processes).
I know that this is a little trivial question but it would be great if somebody can answer this as well: What happens in case if the maximum number of connections to the DB(which is preconfigured on the server hosting the DB) have exhausted and a new process tries to get a new connection? Does it wait for sometime, and if yes, is there a way to set a timeout for this wait period. I am talking in terms of a PostGres DB in this particular case and the language used for talking to the DB is java.
To give you a context of the problem, I have multiple map-reduce jobs (about 40 reducers) running in parallel, each wanting to update a PostGres DB. How do I efficiently manage these DB read/writes from these processes. Note: The DB is hosted on a separate machine independent of where the map reduce job is running.
Connection pooling is one option but it can be very inefficient at times especially for several reads/writes per second.
Can the same DB connection be used across multiple processes
No, not in any sane or reliable way. You could use a broker process, but then you'd be one step away from inventing a connection pool anyway.
What happens in case if the maximum number of connections to the
DB(which is preconfigured on the server hosting the DB) have exhausted
and a new process tries to get a new connection?
The connection attempt fails with SQLSTATE 53300 too_many_connections. If it waited, the server could exhaust other limits and begin to have issues servicing existing clients.
For a problem like this you'd usually use tools like C3P0 or DBCP that do in-JVM pooling, but this won't work when you have multiple JVMs.
What you need to do is to use an external connection pool like PgBouncer or PgPool-II to maintain a set of lightweight connections from your workers. The pooler then has a smaller number of real server connections and shares those between the lightweight connections from clients.
Connection pooling is typically more efficient than not pooling, because it allows you to optimise the number of active PostgreSQL worker processes to the hardware and workload, providing admission control for work.
An alternative is to have a writer process with one or more threads (one connection per thread) that takes finished work from the reduce workers and writes to the DB, so the reduce workers can get on to their next unit of work. You'd need to have a way to tell the reduce workers to wait if the writer got too far behind. There are several Java queueing system implementations that would be suitable for this, or you could use JMS.
See IPC Suggestion for lots of small data
It's also worth optimizing how you write to PostgreSQL as much as possible, using:
Prepared statements
A commit_delay
synchronous_commit = 'off' if you can afford to lose a few transactions if the server crashes
Batching work into bigger transactions
COPY or multi-valued INSERTs to insert blocks of data
Decent hardware with a useful disk subsystem, not some Amazon EC2 instance with awful I/O or a RAID 5 box with 5400rpm disks
A proper RAID controller with battery backed write-back cache to reduce the cost of fsync(). Most important if you can't do big batches of work or use a commit delay; has less impact if your fsync rate is lower because of batching and group commit.
See:
http://www.postgresql.org/docs/current/interactive/populate.html
http://www.depesz.com/index.php/2007/07/05/how-to-insert-data-to-database-as-fast-as-possible/
We have a web application that is generating some 3-5 parallel threads every five seconds to connect to a JMS/JNDI connection pool. We wait for the first batch of parallel threads to complete before creating next batch of parallel threads. During this process we are using a lot of network traffic and connection threads are just hanging. Eventually we manually call operations team to kill the connection threads to free up connections.
Question I wanted to ask you is:
Obviously we are doing something wrong as we are holding up connection resources
When we wait for parallel threads to respond before sending second batch of requests,Does this design not resonate well with industry best practices?
Finally what are the options and recommendations you have for this scenario i.e. multiple threads connecting to JMS/JMDI connection
Thanks for your inputs
You need to adjust your connection pool parameters. It sounds like you're using up only 3-5 connections for your service, which seems very reasonable to me. A JMS service should be able to handle thousands of connections. Either your pool's default limit is too low, or your JMS server is configured with too few allowed connections.
Are you sure that's what the other users are blocking on? It seems strange to me.
I'm almost sure that you would be alright with single connection factory. Just make sure clean up/close session properly. We uses spring's SingleConnectionFactory.