Tomcat In-Memory Caching - replication on Cluster - java

I have a question on Tomcat clustering. I have a java application in which we have implemented in-memory caching. So basically when Tomcat starts, it loads a few objects from the database. These objects are stored in the tomcat memory like static objects. so whenever we update something from the application, it writes to the database and also updates the object in memory.
My question is, if we implement clustering in tomcat with 2 or more nodes, will those cached objects be also shared? Is that possible? I dont think it is. HttpSession objects can be shared using the session replication provided by tomcat delta manager or backup manager. But can the in-memory things also be shared?
Additionally what happens to batch jobs that are running? Will they also run multiple times as there will be multiple tomcat instances in the cluster and they would each trigger the job? That would be a failure as well as.
Any thoughts \ ideas?

If you save something in memory, it will not be replicated unless you implement something specifically to send it to other machines. Each jvm keeps their memory independent from each other.
In general, if you want to have replicated caching, a good solution is to use ehcache (http://www.ehcache.org/).
With regard to batch jobs, it depends on the library you use but generally, if you use an established library (like http://www.quartz-scheduler.org/), it should be capable of making sure that only one instance runs the job. Perhaps you need to configure.
The important thing is to test to make sure that any solution you put in place actually does what you expect it to do.
Good luck!

Whenever moving to a cluster or a cluster-like topology you need to revise your application solution design/architecture to make sure it will support multiple instance execution.
Data cached in memory by a given Tomcat instance WILL NOT be shared across instances in the cluster. You will need to move such data outside the Tomcat instance to a shared cache instance - Redis seems to be a popular option this days.
Job execution probably needs to be revised and customized to be driven by configuration. Create a Boolean flag your app can read and kick off batch processing if required. Select the node within the cluster you need the job to run on and set the flag to true there. Set it to false in all other nodes. Quartz WILL NOT ensure/control/manage multiple instances of a job running in a cluster.

Related

Cassandra Java Client (Hector) : Using the same cluster and keyspace objects throughout the application

I want to know if it is okay to initialize the cluster and keyspace object just once, i.e. at the start of the application and use it throughout the application.
Or shall I create a different object everytime I have to interact with the data store.
There maybe multiple parallel reads and writes.
It's definitely OK to use only one cluster/keyspace object in your entire application. Not doing so will actually probably result in a degraded performance.

Infinispan Operational modes

I have recently started taking a look into Infinispan as our caching layer. After reading through the operation modes in Infinispan as mentioned below.
Embedded mode: This is when you start Infinispan within the same JVM as your applications.
Client-server mode: This is when you start a remote Infinispan instance and connect to it using a variety of different protocols.
Firstly, I am confuse now which will be best suited to my application from the above two modes.
I have a very simple use case, we have a client side code that will make a call to our REST Service using the main VIP of the service and then it will get load balanced to individual Service Server where we have deployed our service and then it will interact with the Cassandra database to retrieve the data basis on the user id. Below picture will make everything clear.
Suppose for example, if client is looking for some data for userId = 123 then it will call our REST Service using the main VIP and then it will get load balanced to any of our four service server, suppose it gets load balanced to Service1, and then service1 will call Cassandra database to get the record for userId = 123 and then return back to Client.
Now we are planning to cache the data using Infinispan as compaction is killing our performance so that our read performance can get some boost. So I started taking a look into Infinispan and stumble upon two modes as I mentioned below. I am not sure what will be the best way to use Infinispan in our case.
Secondly, As from the Infinispan cache what I will be expecting is suppose if I am going with Embedded Mode, then it should look like something like this.
If yes, then how Infinispan cache will interact with each other? It might be possible that at some time, we will be looking for data for those userId's that will be on another Service Instance Infinispan cache? Right? So what will happen in that scenario? Will infinispan take care of those things as well? if yes, then what configuration setup I need to have to make sure this thing is working fine.
Pardon my ignorance if I am missing anything. Any clear information will make things more clear to me to my above two questions.
With regards to your second image, yes, architecture will exactly look like this.
If yes, then how Infinispan cache will interact with each other?
Please, take a look here: https://docs.jboss.org/author/display/ISPN/Getting+Started+Guide#GettingStartedGuide-UsingInfinispanasanembeddeddatagridinJavaSE
Infinispan will manage it using JGroups protocol and sending messages between nodes. The cluster will be formed and nodes will be clustered. After that you can experience expected behaviour of entries replication across particular nodes.
And here we go to your next question:
It might be possible that at some time, we will be looking for data for those userId's that will be on another Service Instance Infinispan cache? Right? So what will happen in that scenario? Will infinispan take care of those things as well?
Infinispan was developed for this scenario so you don't need to worry about it at all. If you have for example 4 nodes and setting distribution mode with numberOfOwners=2, your cached data will live on exactly 2 nodes in every moment. When you issue GET command on NON owner node, entry will be fetched from the owner.
You can also set clustering mode to replication, where all nodes contain all entries. Please, read more about modes here: https://docs.jboss.org/author/display/ISPN/Clustering+modes and choose what is the best for your use case.
Additionally, when you add new node to the cluster there will StateTransfer take place and synchronize/rebalance entries across cluster. NonBlockingStateTransfer is implemented already so your cluster will be still capable of serving responses during that joining phase. See: https://community.jboss.org/wiki/Non-BlockingStateTransferV2
Similarly for removing/crashing nodes in your cluster. There will be automatic rebalancing process so for example some entries (numOwners=2) which after crash live only at one node will be replicated respectively to live on 2 nodes according to numberOfOwners property in distribution mode.
To sum it up, your cluster will be still up to date and this does not matter which node you are asking for particular entry. If it does not contain it, entry will be fetched from the owner.
if yes, then what configuration setup I need to have to make sure this thing is working fine.
Aforementioned getting started guide is full of examples plus you can find some configuration file examples in the Infinispan distribution: ispn/etc/config-samples/*
I would suggest you to take a look at this source too: http://refcardz.dzone.com/refcardz/getting-started-infinispan where you can find even more basic and very quick configuration examples.
This source also provides decision related information for your first question: "Should I use embedded mode or remote client-server mode?" From my point of view, using remote cluster is more enterprise-ready solution (see: http://howtojboss.com/2012/11/07/data-grid-why/). Your caching layer is very easily scalable, high-available and fault tolerant then and is independent of your database layer and application layer because it simply sits between them.
And you could be interested about this feature as well: https://docs.jboss.org/author/display/ISPN/Cache+Loaders+and+Stores
I think in newest Infinispan release supports to work in a special, compatibility mode for those users interested in accessing Infinispan in multiple ways .
follow below link to configure your cache environment to support either embedded or remotely.
Interoperability between Embedded and Remote Server Endpoints

App Server Clustering vs Terracotta

I've heard the term "clustering" used for application servers like GlassFish, as well as with Terracotta; and I'm trying to understand what the word clustering implies when used in conjunction with application servers, and when used in conjunction with Terracotta.
My understanding is:
If a GlassFish server is clustered, then it means we have multiple physical/virtual machines, each with their own JRE/JVM running separate instances of GlassFish. However, since they are clustered, they will all communicate through their admin server ("DAS"), and have the same apps deployed to all of them. They will effectively act (to the end user) as if they are a single app server - but now with load balancing, failover/redundancy and scalability added into the mix.
Terracotta is, essentially, a product that makes multiple JVMs, running on different physical/virtual machines, act as if they are a single JVM.
Thus, if my understanding is correct, the following are implied:
You cluster app servers when you want load balancing and failover tolerance
You use Terracotta when any particular JVM is too small to contain your application and you need more "horsepower"
Thus, technically, if you have a GlassFish cluster of, say, 5 server instances; each of those 5 instances could actually be an array/cluster of Terracotta instances; meaning each GlassFish server instance is actually a GlassFish instance living across the JVMs of multiple machines itself
If any of these assertions/assumptions are untrue, please correct me! If I have gone way off-base and clearly don't understand clustering and/or the very purpose of Terracotta, please point me in the right direction!
Terracotta enables you to have a shared state across all your nodes (its stateful). Basically it creates a shared memory space between different JVM's. This is useful when nodes in a cluster all need access to the same objects.
If your application is stateless and you just need load balancing and fail over you can use a solution like JGroups. In this scenario each node just handles requests and has little idea about other nodes. Objects in memory are not shared across nodes and each JVM just runs on its own and has no idea about other JVM's. This often works nicely for request / response type applications. A webserver serving content (without sessions) does this for example.
Dealing with a stateless cluster is often simpler then dealing with a stateful cluster. This is because in a stateless cluster nodes know almost nothing about each other which results in less things that can go wrong.
GlassFish sits a bit in the middle of the above concepts. Objects in memory within GlassFish are visible to all nodes. However the frontend (HTTP connectors) work stateless.
So to answer your questions:
1) Yes, those are the two most obvious reasons. However sometimes people only want failover or only want load balancing or sometimes both. Not all clustering solutions fix both of these problems.
2) Yes. Altough technically speaking Terracotta only solves the shared memory part, not the CPU part. However by solving the memory part it automatically solves the CPU part since you can now just add JVM's to the shared memory space.
3) I don't know if thats practically possible but as a thought experiment; Yes.
Clustering can mean one of the following:
Multiple instances can be managed as one. Deploy an application to the cluster, it is deployed to all instances in the cluster. Make a configuration change, and that change will be pushed to all nodes in the cluster. GlassFish supports this out of the box.
Service Availability. If any one instance fails, the application is available on another instance. Without high availability enabled, any instance failure also results in session loss for any session being managed by that instance. GlassFish supports this out of the box.
High availability. If any one instance fails, the application is available on another instance, and there is no session loss because a session replica is also maintained on another instance. GlassFish supports this. You will have to choose either #2 or #3 in any one cluster.
What you are asking about IMHO is really #3, because it is the only real case where Terracotta - in the context of high availability clustering - will offer value w/GlassFish. GlassFish already offers built-in high availability, so there had better be a very good reason to add Terracotta to the solution because it will complicate the deployment architecture.
The primary reason I can think of adding Terracotta is that you may want to offload session management to a data grid and free up GlassFish to run business logic. This may be due to more frequent garbage collection or wanting to manage more users per GlassFish instance. However, I'm not sure that Terracotta can do this seamlessly. With GlassFish built-in HA clustering, replicating sessions is seamless (no application logic modifications). You may have to write code to put/get data from a Terracotta cache I'll let you research :-) Oracle GlassFish Server also integrates (seamlessly) with Coherence to solve this problem. You can separate session management into a Coherence data grid without modifying your application code.
Unless you know for a fact up front that your application must scale to a very large number of concurrent users, start with built-in HA clustering, run tests, and go from there.
Hope this helps.

Quartz clustering setup when running in tomcat on multiple machines

I have a webapp which will run on 2 different machines. From the webapp it is possible to "order" jobs to be executed at specific times by quartz. quartz is running inside the webapp. Thus quartz is running on each of the two machines.
I am using JDBC datastore to persist the jobs, triggers etc.
However, the idea is that only one of the machines will run jobs and the other will only use quartz to schedule jobs. Thus, the scheduler will only be started (scheduler.start()) on one of the machines.
In the documentation it says
Never run clustering on separate machines, unless their clocks are synchronized using some form of time-sync service (daemon) that runs very regularly (the clocks must be within a second of each other). See http://www.boulder.nist.gov/timefreq/service/its.htm if you are unfamiliar with how to do this.
Never start (scheduler.start()) a non-clustered instance against the same set of database tables that any other instance is running (start()ed) against. You may get serious data corruption, and will definitely experience erratic behavior.
And i'm not sure that the two machines in which my webapp is running have their clocks synchronized.
My question is this: Should i still run quartz in clustering mode for this setup when only one of the quartz instances will be started and run jobs while the other instance will only used for scheduling jobs to be executed by the first instance.
What about simply starting the scheduler on one node only and accessing it remotely on another machine? You can schedule jobs using RMI/JMX. Or you can use a RemoteScheduler adapter.
Basically instead of having two clustered schedulers where one is working and another one is only accessing the shared database you have only a single scheduler (server) and you access it from another machine, scheduling and monitoring jobs via API.
If you will never call the start() method on the second node, then you shouldn't need to worry about clock synchronization.
However you will want to set the isClustered config prop to true to make sure that table-based locking is used when data is inserted/updated/deleted by both nodes.

Need help with java web app design to perform background tasks

I have a local web app that is installed on a desktop PC, and it needs to regularly sync with a remote server through web services.
I have a "transactions" table that stores transactions that have been processed locally and need to be sent to the remote server, and this table also contains transactions that have retrieved from the remote server (that have been processed remotely) and need to be peformed locally (they have been retrieved using a web service call)... The transactions are performed in time order to ensure they are processed in the right order.
An example of the type of transactions are "loans" and "returns" of items from a store, for example a video rental store. For example something may have been loaned locally and returned remotely or vice versa, or any sequence of loan/return events.
There is also other information that is retrieved from the remote server to update the local records.
When the user performs the tasks locally, I update the local db in real time and add the transaction to the table for background processing with the remote server.
What is the best approach for processing the background tasks. I have tried using a Thread that is created in a HTTPSessionListener, and using interrupt() when the session is removed, but I don't think that this is the safest approach. I have also tried using a session attribute as a locking mechanisim, but this also isn't the best approach.
I was also wondering how you know when a thread has completed it's run, as to avoid lunching another thread at the same time. Or whether a thread has ditched before completing.
I have come accross another suggestion, using the Quartz scheduler, I haven't read up on this approach in detail yet. I am going to puchase a copy of Java Concurrency in Practice, but I wanted some help with ideas for the best approach before I get stuck into it.
BTW I'm not using a web app framework.
Thanks.
Safest would be to create an applicationwide threadpool which is managed by the container. How to do that depends on the container used. If your container doesn't support it (e.g. Tomcat) or you want to be container-independent, then the basic approach would be to implement ServletContextListener, create the threadpool with help of Java 1.5 provided ExecutorService API on startup and kill the threadpool on shutdown. If you aren't on Java 1.5 yet or want more abstraction, then you can also use Spring's TaskExecutor
There was ever a Java EE proposal about concurrency utilities, but it has not yet made it into Java EE 6.
Related questions:
What is the recommend way of spawning threads from a servlet?
Background timer task in a JSP web application
Its better to go with Quartz Scheduling framework, because it has most of the features related to scheduling. It has facility to store jobs in Database, Concurrency handling,etc..
Please try this solution
Create a table,which stores some flag like 'Y' or 'N' mapped to some identifiable field with default value as 'N'
Schedule a job for each return while giving loand it self,which executes if flag is 'Y'
On returning change the flag to 'N',which then fires the process which you wanted to do

Categories