Production grade Cassandra client configuration for Java application

Production grade Cassandra client configuration for Java application - java

My application runs with more than 100 transactions per second in production. I would like to know the configuration should be used to achieve this.
In non Prod environment i am using the cluster with DCAwareLoadBalancingPolocy and consistency level as LOCAL_QUORUM.
All remaining configuration is left as default.
Is the default configuration enough or i need to specify all the connection options like pooling options, socket options, consistency level, etc.,
PS:
Cassandra version 3
Please suggest how to scale it.

The Java driver defaults are quite good, especially for that load. You need to use DCAware/TokenAware load balancing policy that is default. You may tune the connection pooling to allow more "in-flight" requests per single connection. You need to have only single instance of the Session class per application to avoid opening too many connections to cluster. The real performance gain comes from using the asynchronous operations, and having lower consistency level, like, LOCAL_ONE (but this is application specific).

Related

Some questions about connection pooling in a servlet application

I'm new to java in web application design and I'm getting surprised how many things I don't now.
In particular I'm having problems in understending how servlet containers manage resources like connection pooling classes.
Assuming that I choose a pooling library (let's say c3p0), I read that there are many ways to use and manage the connection pooling classes.
For instance in many examples I saw that a certain class (let's say ComboPooledDataSource) is instantiated in the init() method of the servlet and here I'm getting a bit confusing. I mean, I think that a connection pooling system must exist and have a separate life with respect to all the servlets that will need a connection, otherwise it make no sense. So I think that the class below may be a Thread that is started once from the first servlet that call the init method and then it continues to exist until someone doesn not interrupt it. Is that correct? If not how does it work?
Anyway once I start this class , is it shared among all the servlets in the context (I mean all the servlets that call it in the init method)?
Other examples set up the connection pooling system as a resource by for instance defining it in the context.xml and then any servlet that want a connection just have to access it via JNDI (JNDI is correct?). What i understood (or I think to) is that in this case the thread that will execute the pooling system is started when the application is started and each servlet can access it when it wants to. Is that correct?
In this case can I modify the connection pooling system properties by a servlet or a background thread runtime? (for example if I want to change the number of connections as a function of the statistics on the number of requests and so on)
If I want to create different pools (for example I want to subdivide the database access to N different databases or I want to access using different usernames ) do I need to create as many resources as are the different kind of connection I want?
Is there a "better" way among these two or they are equivalent?

It comes down to ease of use of the webapp and (Tomcat) server maintenance.
You describe 2 use cases:
a connection pool used by one webapp
a connection pool shared between webapps
The first case is suitable for "ease of use": you can drop the war-file in any Tomcat server and it will work (e.g. Jenkins delivers such a war-file) since the webapp contains all the code needed to access a database. No need to change the configuration of the Tomcat server and restart it.
If I can, I like to deliver these kind of war-files since less things can go wrong (e.g. no clashes with configuration for other webapps). You can take it a step further by delivering Tomcat embedded with an application (so you don't need a Tomcat server to start with, an example project is here).
Note that opening a database connection pool (preferably via ServletContextListener.contextInitialized, not a servlet) and closing it (via contextDestroyed) does not involve starting and stopping a thread. The pool implementation may decide to start a background thread (e.g. to remove idle and/or abandoned connections), but it does not have to.
The second case is suitable for several webapps that share a common ground. For example, if several webapps all running in the same Tomcat server all use the same database, it saves time and effort if the Tomcat server already has a connection pool available for the different webapps (via JNDI). The Tomcat server can contain the JDBC driver software and connection pool software in the "lib" directory and configuration is done once in the context.xml of the server. If the database changes or different (patched) software components are required, only the tomcat server needs to be updated and all the webapps can remain the same. In this case, updating the Tomcat server instead of each webapp is a lot easier.
The second case is also more suitable for monitoring: there is a good chance you can monitor the connection pool via JMX. This kind of monitoring may not be available in the first case.
Changing "the number of connections as a function of the statistics on the number of requests " is not needed with connection pools: you set a maximum amount of connections to use and a timeout to remove idle connections. The connection pool will then grow and shrink with the number of requests.
Subdividing the database access to N different databases would require a different connection pool for each database, unless this is "read-only" in which case you could also use a load-balancer.
Access using different usernames is a bit tricky. I read in another question (which I cannot find anymore, sorry) that you can change the schema during runtime, but changing username/password might require a new connection in which case connection pooling is out.

Are you using Tomcat to run your servlets? Tomcat 7 has a new pooling solution that you could investigate. Tomcat also includes the dbcp libraries you can use them by setting the factory="org.apache.tomcat.dbcp.dbcp.BasicDataSourceFactory". The DBCP pools are self managed and you can define the configuration values in the context.xml file I am not aware of how to change these values on the fly. I believe the Tomcat 7 pool use the same configuration as the DBCP with a couple of added options to make the conversion between the two simple. We use DBCP in all our applications and have not experienced problems so I have not used the Tomcat 7 pooling. I am thinking you would need to create many pools to handle most of your requirements.

App Server Clustering vs Terracotta

I've heard the term "clustering" used for application servers like GlassFish, as well as with Terracotta; and I'm trying to understand what the word clustering implies when used in conjunction with application servers, and when used in conjunction with Terracotta.
My understanding is:
If a GlassFish server is clustered, then it means we have multiple physical/virtual machines, each with their own JRE/JVM running separate instances of GlassFish. However, since they are clustered, they will all communicate through their admin server ("DAS"), and have the same apps deployed to all of them. They will effectively act (to the end user) as if they are a single app server - but now with load balancing, failover/redundancy and scalability added into the mix.
Terracotta is, essentially, a product that makes multiple JVMs, running on different physical/virtual machines, act as if they are a single JVM.
Thus, if my understanding is correct, the following are implied:
You cluster app servers when you want load balancing and failover tolerance
You use Terracotta when any particular JVM is too small to contain your application and you need more "horsepower"
Thus, technically, if you have a GlassFish cluster of, say, 5 server instances; each of those 5 instances could actually be an array/cluster of Terracotta instances; meaning each GlassFish server instance is actually a GlassFish instance living across the JVMs of multiple machines itself
If any of these assertions/assumptions are untrue, please correct me! If I have gone way off-base and clearly don't understand clustering and/or the very purpose of Terracotta, please point me in the right direction!

Terracotta enables you to have a shared state across all your nodes (its stateful). Basically it creates a shared memory space between different JVM's. This is useful when nodes in a cluster all need access to the same objects.
If your application is stateless and you just need load balancing and fail over you can use a solution like JGroups. In this scenario each node just handles requests and has little idea about other nodes. Objects in memory are not shared across nodes and each JVM just runs on its own and has no idea about other JVM's. This often works nicely for request / response type applications. A webserver serving content (without sessions) does this for example.
Dealing with a stateless cluster is often simpler then dealing with a stateful cluster. This is because in a stateless cluster nodes know almost nothing about each other which results in less things that can go wrong.
GlassFish sits a bit in the middle of the above concepts. Objects in memory within GlassFish are visible to all nodes. However the frontend (HTTP connectors) work stateless.
So to answer your questions:
1) Yes, those are the two most obvious reasons. However sometimes people only want failover or only want load balancing or sometimes both. Not all clustering solutions fix both of these problems.
2) Yes. Altough technically speaking Terracotta only solves the shared memory part, not the CPU part. However by solving the memory part it automatically solves the CPU part since you can now just add JVM's to the shared memory space.
3) I don't know if thats practically possible but as a thought experiment; Yes.

Clustering can mean one of the following:
Multiple instances can be managed as one. Deploy an application to the cluster, it is deployed to all instances in the cluster. Make a configuration change, and that change will be pushed to all nodes in the cluster. GlassFish supports this out of the box.
Service Availability. If any one instance fails, the application is available on another instance. Without high availability enabled, any instance failure also results in session loss for any session being managed by that instance. GlassFish supports this out of the box.
High availability. If any one instance fails, the application is available on another instance, and there is no session loss because a session replica is also maintained on another instance. GlassFish supports this. You will have to choose either #2 or #3 in any one cluster.
What you are asking about IMHO is really #3, because it is the only real case where Terracotta - in the context of high availability clustering - will offer value w/GlassFish. GlassFish already offers built-in high availability, so there had better be a very good reason to add Terracotta to the solution because it will complicate the deployment architecture.
The primary reason I can think of adding Terracotta is that you may want to offload session management to a data grid and free up GlassFish to run business logic. This may be due to more frequent garbage collection or wanting to manage more users per GlassFish instance. However, I'm not sure that Terracotta can do this seamlessly. With GlassFish built-in HA clustering, replicating sessions is seamless (no application logic modifications). You may have to write code to put/get data from a Terracotta cache I'll let you research :-) Oracle GlassFish Server also integrates (seamlessly) with Coherence to solve this problem. You can separate session management into a Coherence data grid without modifying your application code.
Unless you know for a fact up front that your application must scale to a very large number of concurrent users, start with built-in HA clustering, run tests, and go from there.
Hope this helps.

Using an application managed connection pool in a Java EE application server

Is it safe to run a database connection pool (like Commons DBCP or c3p0) as part of an application deployed to an application server like Glassfish or Websphere? Are there any additional steps over a standalone application that should be taken to ensure safety or performance?
Update, clarification of reason - the use case I have in mind could need new data sources to be defined at runtime by skilled end users - changing the data sources is part of the application's functionality, if you like. I don't think I can create abnd use container-managed pools on the fly?

AFAIK it works, but it will of course escape the app. server management features.
Also, I'm not entirely sure how undeployment or redeployment happens and whether the connections are correctly disposed. But that can be considered as a minor safety detail: if disposed improperly, connections will simply time out I guess. I'm also not entirely sure whether it works for XA data source which integrates with the distributed transaction manager.
That said, using the app. server pool is usually a matter of configuring the JNDI name in a configuration file. Then you get monitoring, configuration from the admin console, load management, etc. for free.

In fact, you can create container-managed datasources depending on AS you use.
For example, Weblogic has an extensive management API that is used, for example, by their own WLST (Weblogic Shell) to configure servers by scripts. This is, of course, Java API. It has methods to create and configure datasources too.
Another route is JMX-based configuration. All modern AS expose themselves as JMX containers. You can create datasources via JMX too.
All you need is to grant your application admin privileges (i.e. provide with username/password).
The benefit of container-managed DS is it can be clustered. Also, it can be managed by human being using standard AS UI.
If that doesn't work for you, why, sure you can create application-managed DS any time and in any numbers. Just keep in mind that it will be bound to a specific managed server (unless you implement a manual clustering of it's definition).

I don't see why you'd want to. Why not use the connection pool that the app server provides for you?
Update: I don't believe it's possible to create new pools on the fly without having to bounce the app server, but I could be wrong. If that's correct, I don't believe that Commons DBCP or C3P0 will help.

When should I use the JDBC Persistence Adapter in ActiveMQ?

Reading the ActiveMQ documentation (we are using the 5.3 release), I find a section about the possibility of using a JDBC persistence adapter with ActiveMQ.
What are the benefits? Does it provide any gain in performance or reliability? When should I use it?

In my opinion, you would use JDBC persistence if you wanted to have a failover broker and you could not use the file system. The JDBC persistence was significantly slower (during our tests) than journaling to the file system. For a single broker, the journaled file system is best.
If you are running two brokers in an active/passive failover, the two brokers must have access to the same journal / data store so that the passive broker can detect and take over if the primary fails. If you are using the journaled file system, then the files must be on a shared network drive of some sort, using NFS, WinShare, iSCSI, etc. This usually requires a higher-end NAS device if you want to eliminate the file share as a single point of failure.
The other option is that you can point both brokers to the database, which most applications already have access to. The tradeoff is usually simplicity at the price of performance, as the journaled JDBC persistence was slower in our tests.
We run ActiveMQ in an active/passive broker pair with journaled persistence via an NFS mount to a dedicated NAS device, and it works very well for us. We are able to process over 600 msgs/sec through our system with no issues.

Hey, the use of journaled JDBC seems to be better than using JDBC persistence only since the journaling is very much faster than JDBC persistence. It is better than just journalled persistence only cos' you have an additional backup of the messages in the db. Journalled JDBC has the additional advantage that the same data in journal is persisted to the db later and this can be accessed by developers when needed!
However, when you are using master/slave ActiveMQ topology with journalled JDBC, you might end up loosing messages since you might have messages in journal that are not yet into the DB!

If you have a redelivery plugin policy in place and use a master/slave setup, the scheduler is used for the redelivery.
As of today, the scheduler can only be setup on a file database, not on the JDBC. If you do not pay attention to that, you will take all messages that are in redelivery out of the HA scenario and local to the broker.
https://issues.apache.org/jira/browse/AMQ-5238 is an issue in Apache issue tracker that asks for a JDBC persistence adapter for schedulerdb. You can place a vote for it to make it happen.
Actually, even on the top AMQ HA solution, LevelDB+ZooKeeper, the scheduler is taken out of the game and documented to create issues (http://activemq.apache.org/replicated-leveldb-store.html at end of page).
In a JDBC scenario, therefor it can be considered unsafe and unsupported, but at least not clearly documented, how to setup the datastore for the redelivery policy.

Session replication across JVMs in WebSphere

We have an infrastructure set up where in the webservers are clustered and the application servers are not. The webservers route the request to the application servers based on round-robin policy.
In this scenario, the session data available in one application server is not available in the other application server. Is there anyway by which the session data from first application server can be made available in the second application ? The two application servers are physically separate boxes in different cells.
One approach could be to use the database - is there any other means of accomplishing this session replication ?

In WebSphere there are essentially two ways to replicate session data:
Persisting to a database
Memory-To-Memory transfers
Which one is appropriate for your needs is highly dependent on your application scenario:
How important is the persistence of your session data, when all your application servers go down?
How many session objects do you have at any one time simultaneously?
In a DB you can store many sessions without much problems, the other option is always a question of how much memory is available.
I would go with the database, if you already got one set up, which all application servers use anyway.
Here is the link to the WebSphere Information Center with the necessary details.

One obvious solution is to enable clustering of your application servers. I assume from the way you worded your question you have rejected this option. Another option is to change the routing used by the web servers to use session affinity (requests for the same session go to the same app server).
Other that that, I'd second the answer by dertoni.

maybe you can look at 'terracota'. its an caching framework, which can cache sessions and runs on a seperate server

There are two options for clustering within WebSphere, session replication or database. If you have large session objects you are best off using database because it allows you to offload stale sessions to disk. If they are then represented then they can be extracted from the database, if you use session replication then those sessions need to stay in memory on not just your target server but also the other servers in the replication group. With large sessions this can lead to an out of memory condition.
With database session handling it is also very customisable and doesn't performance noticeably in the environments that I have used it.

don't forget oracle coherence.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.