Cassandra Datastax Driver - Connection Pool - java

I'm trying to understand the connection pooling in Datastax Cassandra Driver, so I can better use it in my web service.
I have version 1.0 of the documentation. It says:
The Java driver uses connections asynchronously, so multiple requests can be submitted on the same connection at the same time.
What do they understand by connection? When connecting to a cluster, we have: a Builder, a Cluster and a Session. Which one of them is the connection?
For example, there is this parameter:
maxSimultaneousRequestsPerConnection - number of simultaneous requests on all connections
to a host after which more connections are created.
So, these connections are automatically created, in the case of connection pooling (which is what I would expect). But what exactly are the connections? Cluster objects? Sessions?
I'm trying to decide what to keep 'static' in my web service. For the moment, I decided to keep the Builder static, so for every call I create a new Cluster and a new Session. Is this ok? If the Cluster is the Connection, then it should be ok. But is it? Now, the logger says, for every call:
2013:12:06 12:05:50 DEBUG Cluster:742 - Starting new cluster with contact points
2013:12:06 12:05:50 DEBUG ControlConnection:216 - [Control connection] Refreshing node list and token map
2013:12:06 12:05:50 DEBUG ControlConnection:219 - [Control connection] Refreshing schema
2013:12:06 12:05:50 DEBUG ControlConnection:147 - [Control connection] Successfully connected to...
So, it connects to the Cluster every time? It's not what I want, I want to reuse connections.
So, the connection is actually the Session? If this is the case, I should keep the Cluster static, not the Builder.
What method should I call, to be sure I reuse connections, whenever possible?

The accepted answer (at the time of this writing) is giving the correct advice:
As long as you use the same Session object, you [will] be reusing connections.
However, some parts were originally oversimplified. I hope the following provides insight into the scope of each object type and their respective purposes.
Builder ≠ Cluster ≠ Session ≠ Connection ≠ Statement
A Cluster.Builder is used to configure and create a Cluster
A Cluster represents the entire Cassandra ring
A ring consists of multiple nodes (hosts), and the ring can support one or more keyspaces. You can query a Cluster object about cluster- (ring)-level properties.
I also think of it as the object that represents the calling application to the ring. You communicated your application's needs (e.g. encryption, compression, etc.) to the builder, but it is this object that first implements/communicates with the actual C* ring. If your application uses more than one authentication credential for different users/purposes, you likely have different Cluster objects even if they connect to the same ring.
A Session itself is not a connection, but it manages them
A session may need to talk to all nodes in the ring, which cannot be done with a single TCP connection except in the special case of rings that contain exactly one(1) node. The Session manages a connection pool, and that pool will generally have at least one connection for each node in the ring.
This is why you should re-use Session objects as much as possible. An application does not directly manage or access connections.
A Session is accessed from the Cluster object; it is usually "bound" to a single keyspace at a time, which becomes the default keyspace for the statements executed from that session. A statement can use a fully-qualified table name (e.g. keyspacename.tablename) to access tables in other keyspaces, so it's not required to use multiple sessions to access data across keyspaces. Using multiple sessions to talk to the same ring increases the total number of TCP connections required.
A Statement executes within a Session
Statements can be prepared or not, and each one either mutates data or queries it (and in some cases, both). The fastest, most efficient statements need to communicate with at most one node, and a Session from a topology-aware Cluster should contact only that node (or one of its peers) on a single TCP connection. The least efficient statements must touch all replicas (a majority of nodes), but that will be handled by the coordinator node on the ring itself, so even for these statements the Session will only use a single connection from the application.
Also, versions 2 and 3 of the Cassandra binary protocol used by the driver use multiplexing on the connections. So while a single statement requires at least one TCP connection, that single connection can potentially service up to 128 or 32k+ asynchronous requests simultaneously, depending on the protocol version (respectively).

You are right, the connection is actually in the Session, and the Session is the object you should give to your DAOs to write into Cassandra.
As long as you use the same Session object, you should be reusing connections (you can see the Session as being your connection pool).
Edit (2017/4/10) : I precised this answer following #William Price one.
Please be aware that this answer is 4 years old, and Cassandra have changed a fair bit in the meantime !

Just an update for the community. You can set connection pool in the following way
private static Cluster cluster;
cluster.getConfiguration().getPoolingOptions().setMaxConnectionsPerHost(HostDistance.LOCAL,100);

Related

How can I request a specific connection from com.mchange.v2.c3p0.ComboPooledDataSource?

Problem:
Program uses com.mchange.v2.c3p0.ComboPooledDataSource to connect to Sybase server
Program executes 2 methods, runSQL1() and runSQL2(), in sequence
runSQL1() executes SQL which creates a #temptable
SELECT * INTO #myTemp FROM TABLE1 WHERE X=2
runSQL2() executes SQL which reads from this #temptable
SELECT * FROM #myTemp WHERE Y=3
PROBLEM: runSQL2() gets handed a different DB connection from the pool than the one handed to runSQL1().
However, Sybase #temptables are connection-specific, therefore runSQL2() fails when it can't find the table.
The most obvious solution I can think of (aside from degenerate one of making pool size 1, at which point we don't even need a pool), is to somehow remember which specific connection from the pool was used by runSQL1(), and have runSQL2() request the same connection.
Is there a way to do this in com.mchange.v2.c3p0.ComboPooledDataSource?
If possible, I'd like an answer which is concurrency-safe (in other words, if connection used in runSQL1() is being used by another thread, runSQL2()'s call to get connection will wait until that connection is released by another thread).
However, if that's impossible, I'm OK with the answer which assumes that DB connections (the ones I care about) are all happening in one single thread, and therefore any connection requested by runSQL2() will be 100% available if it was available to runSQL1().
I'm also welcoming of any solutions that address the problem some other way, as long as they don't involve "stop using #temptables" as part of the solution.
Easiest and most obvious way to do that is just to request connection from the pool and then run runSQL1() and runSQL2() with that connection. Usage pattern being suggested in the question goes against general design principles of connection pool managers, as it will effectively promote them to some kind of transaction manager.
There are Java frameworks that might aid in the above. For example in Spring #Transaction or TransactionTemplate can be used to demarcate transaction boundaries and it will guarantee that single connection is used by single thread (or more precisely, according to transaction propagation annotations). Spring can use many transaction managers, but probably simplest would be to use DataSourceTransactionManager and it can also be configured to use c3p0 as DataSource.

Should I have a single database connection or one connection per task? [duplicate]

This question already has answers here:
How to manage db connections on server?
(3 answers)
Closed 7 years ago.
I have a Java server and PostgreSQL database.
There is a background process that queries (inserts some rows) the database 2..3 times per second. And there is a servlet that queries the database once per request (also inserts a row).
I am wondering should I have separate Connection instances for them or share a single Connection instance between them?
Also does this even matter? Or is PostgreSQL JDBC driver internally just sending all requests to a unified pool anyway?
One more thing should I make a new Connection instance for every servlet request thread? Or share a Connection instance for every servlet thread and keep it open the entire up time?
By separate I mean every threads create their own Connection instances like this:
Connection connection = DriverManager.getConnection(url, user, pw);
If you use a single connection and share it, only one thread at a time can use it and the others will block, which will severely limit how much your application can get done. Using a connection pool means that the threads can have their own database connections and can make concurrent calls to the database server.
See the postgres documentation, "Chapter 10. Using the Driver in a Multithreaded or a Servlet Environment":
A problem with many JDBC drivers is that only one thread can use a
Connection at any one time --- otherwise a thread could send a query
while another one is receiving results, and this could cause severe
confusion.
The PostgreSQL™ JDBC driver is thread safe. Consequently, if your
application uses multiple threads then you do not have to worry about
complex algorithms to ensure that only one thread uses the database at
a time.
If a thread attempts to use the connection while another one is using
it, it will wait until the other thread has finished its current
operation. If the operation is a regular SQL statement, then the
operation consists of sending the statement and retrieving any
ResultSet (in full). If it is a fast-path call (e.g., reading a block
from a large object) then it consists of sending and retrieving the
respective data.
This is fine for applications and applets but can cause a performance
problem with servlets. If you have several threads performing queries
then each but one will pause. To solve this, you are advised to create
a pool of connections. When ever a thread needs to use the database,
it asks a manager class for a Connection object. The manager hands a
free connection to the thread and marks it as busy. If a free
connection is not available, it opens one. Once the thread has
finished using the connection, it returns it to the manager which can
then either close it or add it to the pool. The manager would also
check that the connection is still alive and remove it from the pool
if it is dead. The down side of a connection pool is that it increases
the load on the server because a new session is created for each
Connection object. It is up to you and your applications'
requirements.
As per my understanding,You should defer this task to the container to manage connection pooling for you.
As you're using Servlets,which will be running in a Servlet container, and all major Servlet containers that I'm aware of provide connection pool management.
See Also
Best way to manage database connection for a Java servlet

Elegant/efficient way reading millions of records in MySQL Database, Java

I have a MySQL database with ~8.000.000 records. Since I need to process them all I use a BlockingQueue which as Producer reads from the database and puts 1000 records in a queue. The Consumer is the processor that takes records from the queue.
I am writing this in Java, however I'm stuck to figure out how I can (in a clean, elegant way) read from my database and 'suspend' reading once the BlockingQueue is full. After this the control is being handed to the Consumer until there are free spots available again in the BlockingQueue. From here on the Producer should continue reading in records from the database.
Is it clean/elegant/efficient keeping my database connection open inorder for it to continuously read? Or should, once the control is shifted from Producer to Consumer, close the connection, store the id of the record read so far and later open the connection and start reading from that id? The latter seems to me not really good since my database will have to open/close a lot! However, the former is not so elegant in my opinion either?
With persistent connections:
You cannot build transaction processing effectively
Impossible user sessions on the same connection
The applications are not scalable.
With time you may need to extend it and it will require management/tracking of persistent connections
If the script, for whatever reason, could not release the lock on the table, then any following scripts will block indefinitely and one should restart the db server.
Using transactions, transaction block will also pass to the next script (using the same connection) if script execution ends before the transaction block completes, etc.
Persistent connections do not bring anything that you can do with non-persistent connections.
Then, why to use them, at all?
The only possible reason is performance, to use them when overhead of creating a link to your MySQL Server is high. And this depends on many factors like:
Database type
Whether MySQL server is on the same machine and, if not, how far? might be out of your local network /domain?
How much overloaded by other processes the machine on which MySQL sits
One always can replace persistent connections with non-persistent connections. It might change the performance of the script, but not its behavior!
Commercial RDBMS might be licensed by the number of concurrent opened connections and here the persistent connections can mis serve.
If you are using a bounded BlockingQueue by passing a capacity value in the constructor, then the producer will block when it attempts to call put() until the consumer removes an item by calling take().
It would help to know more about when or how the program is going to execute to decide how to deal with database connections. Some easy choices are: have the producer and all consumers get an individual connection, have a connection pool for all consumers while the producer holds the a connection, or have all producers and consumers use a connection pool.
You can facilitate minimizing the number of connections by using something such as Spring to manage your connection pool and transactions; however, it would only be necessary in some execution situations.

Saving conexion in Java EE's Session .vs. a connection pool

I know that Java EE's session object can store complex objects, like a connection to a database.
I'm pondering how to implement a certain application for a programming practice, made with Java EE. My first option to use a connection pool, which is very easy with Java EE.
I'm wondering, out of curiosity, and also to properly justify the decision, what are the pros and cons of creating a connection to the database any time a client starts a session and storing it there, against the use of a connection pool.
Thanks a lot.
A resource pool will optimise the handling of the resource (your database connection) in a way your system can cope with it. Even though you can end up out of resources if you have a lot of opened connections.
That is more likely to happen if you store your database connection in the session context. Web applications don't need to be connected all the time to a database, that connection can be stablished at the beginning of a new operation and closed at the end. Using a resource pool you return your connection back to the pool when you no longer need it, so a new user (session in the web paradigm) can use that resource you have already released instead of creating a new one.
The pool will also handle the scenario in which some resources have been idle for a long time (no one has used them in a specific amount of time) and then it will release those resources.
When storing a database connection in the session you are never releasing the resource but keeping a permanent reference to it that will last as long as the user session does. You may not face any issues in a short time with that, specially if there are really few users connected at the same time. But in real world applications you will definitively find them.
Thus, storing a database connection in the session context is considered as a bad practice.
EDIT: I forgot to mention that should only store Serializable objects in the session so, if the application server decides to passivate a session, it can be persisted and restored when the application server decides to reactivate it. A database connection is not a Serializable resource.
Using a connection pool allows you maximize the usability of your connections. This means less connections = less memory = less sockets etc. The reason a pool is better than saving in a session is what happens if someone drops out unexpectedly? If you have a connection in your session, you risk keeping that connection alive for a long time, possibly indefinitely.

shared DB connection vs private DB connections

Trying to figure out how to manage/use long-living DB connections. I have too little experience of this kind, as I have used DB only with small systems (up to some 150 concurrent users, each one had its own DB user/pass, so there were up to 150 long-living DB connections at any time) or web pages (each page request has its own DB connection that lasts for less than a second, so number of concurrent DB conncetions isn't huge).
This time there will be a Java server and Flash client. Java connects to PostgreSQL. Connections are expected to be long-living, i.e., they're expected to start when Flash client connects to Java server and to end when Flash client disconnects. Would it be better to share single connection between all users (clients) or to make private connection for every client? Or some other solution would be better?
*) Single/shared connection:
(+) pros
only one DB connection for whole system
(-) cons:
transactions can't be used (e.g., "user1.startTransaction(); user1.updateBooks(); user2.updateBooks(); user1.rollback();" to a single shared connection would rollback changes that are done by user2)
long queries of one user might affect other users (not sure about this, though)
*) Private connections:
(+) pros
no problems with transactions :)
(-) cons:
huge number of concurrent connections might be required, i.e., if there are 10000 users online, 10000 DB connections are required, which seems to be too high number :) I don't know anything about expected number of users though, as we are still in process of researching and planning.
One solution would be to introduce timeouts, i.e., if DB connection is not used for 15/60/900(?) seconds, it gets disconnected. When user again needs a DB, it gets reconnected. This seems to be a good solution for me, but I would like to know what might be the reasonable limits for this, e.g., what might be the max number of concurrent DB connections, what timeout should be used etc.
Another solution would be to group queries into two "types" - one type that can safely use single shared long-living connection (e.g., "update user set last_visit = now() where id = :user_id"), and another type that needs a private short-living connection (e.g., something that can potentially do some heavy work or use transactions). This solution does not seem to be appealing for me, though if that's the way it should be done, I could try to do this...
So... What do other developers do in such cases? Are there any other reasonable solutions?
I don't use long-lived connections. I use a connection pool to manage connections, and I keep them only for as long as it takes to perform an operation: get the connection, perform my SQL operation, return the connection to the pool. It's much more scalable and doesn't suffer from transaction problems.
Let the container manage the pool for you - that's what it's for.
By using single connection, you also get very low performance because the database server will only allocate one connection for you.
You definitely need a connection pool. If you app runs inside an application server, use the container pool. Or you can use a connection pool library like c3p0.

Categories