Passing around DB connection objects between multiple map-reduce jobs

Passing around DB connection objects between multiple map-reduce jobs - java

Fundamentally, this question is about: Can the same DB connection be used across multiple processes (as different map-reduce jobs are in real different independent processes).
I know that this is a little trivial question but it would be great if somebody can answer this as well: What happens in case if the maximum number of connections to the DB(which is preconfigured on the server hosting the DB) have exhausted and a new process tries to get a new connection? Does it wait for sometime, and if yes, is there a way to set a timeout for this wait period. I am talking in terms of a PostGres DB in this particular case and the language used for talking to the DB is java.
To give you a context of the problem, I have multiple map-reduce jobs (about 40 reducers) running in parallel, each wanting to update a PostGres DB. How do I efficiently manage these DB read/writes from these processes. Note: The DB is hosted on a separate machine independent of where the map reduce job is running.
Connection pooling is one option but it can be very inefficient at times especially for several reads/writes per second.

Can the same DB connection be used across multiple processes
No, not in any sane or reliable way. You could use a broker process, but then you'd be one step away from inventing a connection pool anyway.
What happens in case if the maximum number of connections to the
DB(which is preconfigured on the server hosting the DB) have exhausted
and a new process tries to get a new connection?
The connection attempt fails with SQLSTATE 53300 too_many_connections. If it waited, the server could exhaust other limits and begin to have issues servicing existing clients.
For a problem like this you'd usually use tools like C3P0 or DBCP that do in-JVM pooling, but this won't work when you have multiple JVMs.
What you need to do is to use an external connection pool like PgBouncer or PgPool-II to maintain a set of lightweight connections from your workers. The pooler then has a smaller number of real server connections and shares those between the lightweight connections from clients.
Connection pooling is typically more efficient than not pooling, because it allows you to optimise the number of active PostgreSQL worker processes to the hardware and workload, providing admission control for work.
An alternative is to have a writer process with one or more threads (one connection per thread) that takes finished work from the reduce workers and writes to the DB, so the reduce workers can get on to their next unit of work. You'd need to have a way to tell the reduce workers to wait if the writer got too far behind. There are several Java queueing system implementations that would be suitable for this, or you could use JMS.
See IPC Suggestion for lots of small data
It's also worth optimizing how you write to PostgreSQL as much as possible, using:
Prepared statements
A commit_delay
synchronous_commit = 'off' if you can afford to lose a few transactions if the server crashes
Batching work into bigger transactions
COPY or multi-valued INSERTs to insert blocks of data
Decent hardware with a useful disk subsystem, not some Amazon EC2 instance with awful I/O or a RAID 5 box with 5400rpm disks
A proper RAID controller with battery backed write-back cache to reduce the cost of fsync(). Most important if you can't do big batches of work or use a commit delay; has less impact if your fsync rate is lower because of batching and group commit.
See:
http://www.postgresql.org/docs/current/interactive/populate.html
http://www.depesz.com/index.php/2007/07/05/how-to-insert-data-to-database-as-fast-as-possible/

Related

How costly is opening and closing of a DB connection in Connection Pool?

If we use any connection pooling framework or Tomcat JDBC pool then how much it is costly to open and close the DB connection?
Is it a good practice to frequently open and close the DB connection whenever DB operations are required?
Or same connection can be carried across different methods for DB operations?

Jdbc Connection goes through the network and usually works over TCP/IP and optionally with SSL. You can read this post to find out why it is expensive.
You can use a single connection across multiple methods for different db operations because for each DB operations you would need to create a Statement to execute.
Connection pooling avoids the overhead of creating Connections during a request and should be used whenever possible. Hikari is one of the fastest.

The answer is - its almost always recommended to re-use DB Connections. Thats the whole reason why Connection Pools exist. Not only for the performance, but also for the DB stability. For instance, if you don't limit the number of connections and mistakenly open 100s of DB connections, the DB might go down. Also lets say if DB connections don't get closed due to some reason (Out of Memory error / shut down / unhandled exception etc), you would have a bigger issue. Not only would this affect your application but it could also drag down other services using the common DB. Connection pool would contain such catastrophes.
What people don't realize that behind the simple ORM API there are often 100s of raw SQLs. Imagine running these sqls independent of connection pools - we are talking about a very large overhead.
I couldn't fathom running a commercial DB application without using Connection Pools.
Some good resources on this topic:
https://www.cockroachlabs.com/blog/what-is-connection-pooling/
https://stackoverflow.blog/2020/10/14/improve-database-performance-with-connection-pooling/

Whether the maintenance (opening, closing, testing) of the database connections in a DBConnection Pool affects the working performance of the application depends on the implementation of the pool and to some extent on the underlying hardware.
A pool can be implemented to run in its own thread, or to initialise all connections during startup (of the container), or both. If the hardware provides enough cores, the working thread (the "business payload") will not be affected by the activities of the pool at all.
Other connection pools are implemented to create a new connection only on demand (a connection is requested, but currently there is none available in the pool) and within the thread of the caller. In this case, the creation of that connection reduces the performance of the working thread – this time! It should not happen too often, otherwise your application needs too many connections and/or does not return them fast enough.
But whether you really need a Database Connection Pool at all depends from the kind of your application!
If we talk about a typical server application that is intended to run forever and to serve a permanently changing crowd of multiple clients at the same time, it will definitely benefit from a connection pool.
If we talk about a tool type application that starts, performs a more or less linear task in a defined amount of time, and terminates when done, then using a connection pool for the database connection(s) may cause more overhead than it provides advantages. For such an application it might be better to keep the connection open for the whole runtime.
Taking the RDBMS view, both does not make a difference: in both cases the connections are seen as open.

If you have performance as a key parameter then better to switch to the Hikari connection pool. If you are using spring-boot then by default Hikari connection pool is used and you do not need to add any dependency. The beautiful thing about the Hikari connection pool is its entire lifecycle is managed and you do not have to do anything.
Also, it is always recommended to close the connection and let it return to the connection pool so that other threads can use it, especially in multi-tenant environments. The best way to do this is using "try with resources" and that connection is always closed.
try(Connection con = datasource.getConnection()){
// your code here.
}
To create your data source you can pass the credentials and create your data source for example:
DataSource dataSource = DataSourceBuilder.create()
.driverClassName(JDBC_DRIVER)
.url(url)
.username(username)
.password(password)
.build();
Link: https://github.com/brettwooldridge/HikariCP

If you want to know the answer in your case, just write two implementations (one with a pool, one without) and benchmark the difference.
Exactly how costly it is, depends on so many factors that it is hard to tell without measuring
But in general, a pool will be more efficient.

The costly is always a definition of impact.
Consider, you have following environment.
A web application with assuming a UI-transaction (user click) and causes a thread on the webserver. This thread is coupled to one connection/thread on the database
10 connections per 60000ms / 1min or better to say 0.167 connections/s
10 connections per 1000ms / 1sec => 10 connections/s
10 connections per 100ms / 0.1sec => 100 connections/s
10 connections per 10ms / 0.01sec => 1000 connections/s
I have worked in even bigger environments.
And believe me the more you exceed the 100 conn/s by 10^x factors the more pain you will feel without having a clean connection pool.
The more connections you generate in 1 second the higher latency you generate and the higher impact is it for the database. And the more bandwidth you will eat for recreating over and over a new "water pipeline" for dropping a few drops of water from one side to the other side.
Now getting back, if you have to access a existing connection from a connection pool it is a matter of micros or few ms to access the database connection. So considering one, it is no real impact at all.
If you have a network in between, it will grow to probably x10¹ to x10² ms to create a new connection.
Considering now the impact on your webserver, that each user blocks a thread, memory and network connection it will impact also your webserver load. Typically you run into webserver (e.g. revProxy apache + tomcat, or tomcat alone) thread pools issues on high load environments, if the connections get exhausted or they need too long time (10¹, 10² millis) to create
Now considering also the database.
If you have open connection, each connection is typically mapped to a thread on a DB. So the DB can use thread based caches to make prepared statements and to reuse pre-calculated access plan to make the accesses to data on database very fast.
You may loose this option if you have to recreate the connection over and over again.
But as said, if you are in up to 10 connections per second you shall not face any bigger issue without a connection pool, except the first additional delay to access the DB.
If you get into higher levels, you will have to manage the resources better and to avoid any useless IO-delay like recreating the connection.
Experience hints:
it does not cost you anything to use a connection pool. If you have issues with the connection pool, in all my previous performance tuning projects it was a matter of bad configuration.
You can configure
a connection check to check the connection (use a real SQL to access a real db field). so on every new access the connection gets checked and if defective it gets kicked from the connection pool
you can define a lifetime of a connections, so that you get new connection after a defined time
=> all this together ensure that even if your admins are doing crap and do not inform you (killing connection / threads on DB) the pool gets quickly rebuilt and the impact stays very low. Read the docs of the connection pool.
Is one connection pool better as the other?
A clear no, it is only getting a matter if you get into high end, or into distributed environments/clusters or into cloud based environments. If you have one connection pool already and it is still maintained, stick to it and become a pro on your connection pool settings.

Java many connections to API server with threads

I have created a custom web API for submitting data to a MYSQL server for my Java application. Within my Java application I need to add/update 200 rows, and I already have it making these requests one at a time.
This can be pretty time consuming, so can I create threads for all
of these different connections?
Should I limit the number of maximum connections made at a time? Maybe like 10 at a time?
Will this cause any issues with mysql
possibly adding rows at exactly the same time? (No 2 rows would ever
be needed to be changed at a single point in time)

Inserting records in multiple connections will potentially speed up the 200 inserts, but you can only know for sure by testing and measuring after introducing multiple threads. I will also suggest trying JDBC batch and sending all 200 inserts to the database in one go (if that is possible in your implementation), as that might provide a performance boost by saving round trips to the database.
To create a connection pool, look at HikariCP which is a JDBC connection pool implementation. It will allow you to specify the min/max concurrent connections along with other settings. Your worker threads then can request connections from the pool and perform the inserts.
Inserting multiple records concurrently could have issues at mySQL level if it acquires a table lock for each insert. In that case you would not get a speed improvement with multiple threads and might need some tuning at the database level to work around it. Here's a good article that should help: High rate insertion with mySQL

Does H2 spawn a new thread for each remote connection? Then, is there a limit?

I am developing an online mobile game. I have several server machines running numerous instances of a Java socket server application.
Player data has to be stored somewhere (their profiles, items etc). I want to use the H2 database for this purpose.
Now, here's the tricky part: I want all the player data to be stored in the same H2 database. That is, all my server applications will access the data by remotely connecting to one particular machine over TCP, out of convenience.
The thing is, we are expecting a very large amount of clients on launch. For each client, a connection to the H2 database is created. The obvious concern here is whether one single H2 database process can handle so many connections concurrently.
From the website:
There is no limit on the number of database open concurrently per
server, or on the number of open connections.
Given the above fact, in theory, if our server machine has enough resources (memory, space, CPUs, etc), then yes, the H2 database should be able to handle as many concurrent connections as our resources allow.
But there is something unclear to me:
Does the H2 process create a thread for each remote connection? I ask this because I once read that in Windows (our VPS' OS), a thread is stored as a short type, and hence the max amount of threads an application can spawn is roughly 32,000 (I don't know the math they used to get that number). In that case, then the H2 process does have a limit of concurrent connections - which is troubling because I do indeed expect more than 32,000 clients connected.
Of course, it would seem wise to discard the idea of having one single H2 database for all my clients. But I'd like to know if the above statement is correct: can H2 handle more than 32,000 remote database connections?

Let take this by parts:
"Does the H2 process create a thread for each remote connection?"
An application should normally use one connection per thread. An H2 database synchronizes access to the same connection, but other databases may not do this.
"can H2 handle more than 32,000 remote database connections?"
If you want to access the same database at the same time from different processes or computers, you need to use the client / server mode. The JdbcConnectionPool class has the default maximum number of connections set to 10, but it provides a setter to change it if you want. In theory, you can set it to Integer.MAX_VALUE, but I don't think this is wise. Why? For starters, the synchronization point made on the previous section. Another point to consider is if your application opens and closes connections a lot (for example, for each request), you should consider using a connection pool. Opening a DB connection is very slow.
"Of course, it would seem wise to discard the idea of having one single H2 database for all my clients"
It might be, but you have to keep in mind that the number of open database is limited by the memory available. If you are running on a powerful server, it might be a good option to consider. Then again, it might not.

JMS Multiple threads processing

We have a web application that is generating some 3-5 parallel threads every five seconds to connect to a JMS/JNDI connection pool. We wait for the first batch of parallel threads to complete before creating next batch of parallel threads. During this process we are using a lot of network traffic and connection threads are just hanging. Eventually we manually call operations team to kill the connection threads to free up connections.
Question I wanted to ask you is:
Obviously we are doing something wrong as we are holding up connection resources
When we wait for parallel threads to respond before sending second batch of requests,Does this design not resonate well with industry best practices?
Finally what are the options and recommendations you have for this scenario i.e. multiple threads connecting to JMS/JMDI connection
Thanks for your inputs

You need to adjust your connection pool parameters. It sounds like you're using up only 3-5 connections for your service, which seems very reasonable to me. A JMS service should be able to handle thousands of connections. Either your pool's default limit is too low, or your JMS server is configured with too few allowed connections.
Are you sure that's what the other users are blocking on? It seems strange to me.

I'm almost sure that you would be alright with single connection factory. Just make sure clean up/close session properly. We uses spring's SingleConnectionFactory.

Java database connection

I've been approached by a programmer that has background in Oracle forms and is moving into the Java realm. He asked me a question that I didn't have a good answer for. Instead of answering that; I always do it that way or that's how I was taught. I figured I'd do some research.
Question: With Java multi-thread capabilities; why don't you setup a JDBC connection to the database for each user, each on it's own thread? Instead of setting up a connection pool and applying security to which users can access the pool?

A connection pool scales better. If you have a dedicated connection per user, you will need 50 connections for 50 users. With a pool, you can probably do with something like 10 - 20 connections to handle the 50 users (depending on the use case). Now take a look at a bigger group and think about handling 500, 5000 or 50000 users and you will see that the 1 connection per user model does not scale.
With a connection pool (and a thread pool), each request will still be handled by one thread and by one database connection, but they will be taken from a pool instead of a dedicated one per user.

Because the overhead of maintaining a lot of threads is not efficient. What happens when there are more users than your application server or database can maintain? What happens when there's a lot of contention on a database table?
By using a connection pool, you're not having to "connect" to the database again from scratch so you get a big performance boost from re-using database connections from a pool.
Something like Apache's DBCP is a good place to start if you're not using an application server.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.