EclipseLink JPA with App Engine: cache synchronization

EclipseLink JPA with App Engine: cache synchronization - java

I have an App Engine project which persists the data inside a Cloud SQL instance, using EclipseLink as JPA persistence manager.
Due to the nature of App Engine (multi-instance environment) we have some concerns on how to synchronize the JPA cache between the instances.
Each JPA instance runs inside a single App Engine instances, so the Memcache service of App Engine is not used (else, EclipseLink does not "know" of what App Engine memcache is or how to use it)
Here is a simple scenario example:
- Instance A read object 1: value="A"
- Instance B read object 1: value="A"
- Instance A write object 1: value="B"
- JPA cache of Instance A is evicted due to write operation
- Instance A read object 1: value="B" (the value is retrieved from the database because cache has been evicted after write operation)
- Instance B read object 1: value="A" (no write operation has been performed, the cache is still valid so the value has not been updated)
Searching around for this kind of Behaviour I found different articles which talk about this [1] [2] [3] [4].
I quote:
unless the database is modified directly by other applications, or by
the same application on other servers in a clustered environment
As the nature of App Engine, for this we can consider it as "other servers in a clustered environment", so the case seems to be the one.
Of course the proper way on how to handle this problem should be build a cache layer for JPA which is built on top of App Engine memcache service, but from my searches I understand that EclipseLink does not allow developing a custom cache layer.
I'm available to build something which can bridge between EclipseLink and App Engine memcache, but I cannot find any reference if there are the proper "hooks" on how to do it.
From the documentation there are few suggestions on how to handle this:
disable the shared cache: This is not a suitable option due to lose of application performance
using a distributed cache (such as Oracle TopLink Grid with Oracle Coherence):
I would like to use the App Engine memcache service, but as I understand there is no EclipseLink "hook" that we can use
using cache coordination (synchronizing the caches, as discussed in this example)
The provided methos seems to be not usable with App Engine environment
Is there a known solution on how properly handle this cache scenario?
The scenario here is very clear, when a write operation is made inside an instance, all the existing JPA cache need to be "notified" as well to evict their own cache.
[1] https://wiki.eclipse.org/EclipseLink/Examples/JPA/CacheCoordination
[2] http://www.eclipse.org/eclipselink/documentation/2.5/concepts/cache011.htm
[3] https://wiki.eclipse.org/EclipseLink/UserGuide/JPA/Basic_JPA_Development/Caching/Coordination
[4] https://wiki.eclipse.org/EclipseLink/Examples/JPA/Caching#Caching_in_Clustered_Environments

I took a look at the EclipseLink (EL) source to see if there is any easy way to extend the cache coordination mechanism to work with GAE.
EL support JMS, and RMI by default, and the cache coordination is built around remoting, so EL can send commands (org.eclipse.persistence.sessions.coordination.Command), which are executed against the AbstractSessionon every host in the cluster.
I don't think there is any way you could use MemCache for caching, because the commands, like MergeChangeSetCommand always operate on the AbstractSession.
It is possible to build your own cache coordination protocol, this is done by extending org.eclipse.persistence.sessions.coordination.TransportManager, and setting eclipselink.cache.coordination.protocol=com.example.MyTransportManager, but the DiscoveryManager uses multicast which is typically not available in the cloud. If you could discover all your GAE instances (and send data directly to each node), I think it would be possible to create a http based cache coordination solution. On AWS it is possible to ask the load balancer for the list of nodes, this is how we get around the multicast problem when we need to use Hazelcast, for intra node communication.

Related

Want to have dedicated cache and want other processes to fetch data from this cache (Java)

I am developing Java based jobs having some business logic which would run in its own jvm and would want to have a separate cache containing frequently accessed data from database , also running in its own jvm. The jobs need to access this cache instead of hitting database.
Which cache can I use? Ehcache , hazelcast or coherence?
How will jobs access this cache? Basically how will I expose cache operations mostly fetch operations to the jobs?

I only have some experience with EhCache, which served as a cache layer for Hibernate (or any other ORM) and it is transparent to the fetch operations, meaning that you don't have to explicitly activate it every time you run a DB query. EhCache inspects each query you run and if it sees the same query, with the same parameters, was run previously then it will hit the cache ,unless invalidated. However, EhCache runs on the same JVM as your java application with default configuration.
There is another solution since you wish to run the cache on a separate machine. A Database called Redis (https://redis.io/) is a great tool for building fast caches. it is a Document based, NOSQL store, which can run standalone or embedded on your cache JVM. i highly recommend you try it out (i am not affiliated with Redis in any way :) )

Centralised Second Level Cache

We're trying to horizontally scale a JPA based application, but have encountered issues with the second level cache of JPA. We've looked at several solutions (EhCache, Terracotta, Hazelcast) but couldn't seem to find the right solution. Basically what we want to achieve is to have multiple application servers all pointing to a single cache server that serves as the JPA's second level cache.
From a non java perspective, it would look like several PHP servers all pointing to one centralised memcache server as it's cache service. Is this currently possible with Java?
Thanks

This is in response to the comment above.
Terracotta will be deployed in it's own server
Each of the app server will have terracota drivers which will store/retrieve data to-fro terracotta server.
Ehcache api present in the application war, will invoke the terracota drivers to store data into terracotta server.
Hibernate api will maintain the L1 cache, in addition it will use the ehcache api to save/retrieve data to-fro L2 cache. Blissfully unaware about how ehcache api performs the task.

Infinispan Operational modes

I have recently started taking a look into Infinispan as our caching layer. After reading through the operation modes in Infinispan as mentioned below.
Embedded mode: This is when you start Infinispan within the same JVM as your applications.
Client-server mode: This is when you start a remote Infinispan instance and connect to it using a variety of different protocols.
Firstly, I am confuse now which will be best suited to my application from the above two modes.
I have a very simple use case, we have a client side code that will make a call to our REST Service using the main VIP of the service and then it will get load balanced to individual Service Server where we have deployed our service and then it will interact with the Cassandra database to retrieve the data basis on the user id. Below picture will make everything clear.
Suppose for example, if client is looking for some data for userId = 123 then it will call our REST Service using the main VIP and then it will get load balanced to any of our four service server, suppose it gets load balanced to Service1, and then service1 will call Cassandra database to get the record for userId = 123 and then return back to Client.
Now we are planning to cache the data using Infinispan as compaction is killing our performance so that our read performance can get some boost. So I started taking a look into Infinispan and stumble upon two modes as I mentioned below. I am not sure what will be the best way to use Infinispan in our case.
Secondly, As from the Infinispan cache what I will be expecting is suppose if I am going with Embedded Mode, then it should look like something like this.
If yes, then how Infinispan cache will interact with each other? It might be possible that at some time, we will be looking for data for those userId's that will be on another Service Instance Infinispan cache? Right? So what will happen in that scenario? Will infinispan take care of those things as well? if yes, then what configuration setup I need to have to make sure this thing is working fine.
Pardon my ignorance if I am missing anything. Any clear information will make things more clear to me to my above two questions.

With regards to your second image, yes, architecture will exactly look like this.
If yes, then how Infinispan cache will interact with each other?
Please, take a look here: https://docs.jboss.org/author/display/ISPN/Getting+Started+Guide#GettingStartedGuide-UsingInfinispanasanembeddeddatagridinJavaSE
Infinispan will manage it using JGroups protocol and sending messages between nodes. The cluster will be formed and nodes will be clustered. After that you can experience expected behaviour of entries replication across particular nodes.
And here we go to your next question:
It might be possible that at some time, we will be looking for data for those userId's that will be on another Service Instance Infinispan cache? Right? So what will happen in that scenario? Will infinispan take care of those things as well?
Infinispan was developed for this scenario so you don't need to worry about it at all. If you have for example 4 nodes and setting distribution mode with numberOfOwners=2, your cached data will live on exactly 2 nodes in every moment. When you issue GET command on NON owner node, entry will be fetched from the owner.
You can also set clustering mode to replication, where all nodes contain all entries. Please, read more about modes here: https://docs.jboss.org/author/display/ISPN/Clustering+modes and choose what is the best for your use case.
Additionally, when you add new node to the cluster there will StateTransfer take place and synchronize/rebalance entries across cluster. NonBlockingStateTransfer is implemented already so your cluster will be still capable of serving responses during that joining phase. See: https://community.jboss.org/wiki/Non-BlockingStateTransferV2
Similarly for removing/crashing nodes in your cluster. There will be automatic rebalancing process so for example some entries (numOwners=2) which after crash live only at one node will be replicated respectively to live on 2 nodes according to numberOfOwners property in distribution mode.
To sum it up, your cluster will be still up to date and this does not matter which node you are asking for particular entry. If it does not contain it, entry will be fetched from the owner.
if yes, then what configuration setup I need to have to make sure this thing is working fine.
Aforementioned getting started guide is full of examples plus you can find some configuration file examples in the Infinispan distribution: ispn/etc/config-samples/*
I would suggest you to take a look at this source too: http://refcardz.dzone.com/refcardz/getting-started-infinispan where you can find even more basic and very quick configuration examples.
This source also provides decision related information for your first question: "Should I use embedded mode or remote client-server mode?" From my point of view, using remote cluster is more enterprise-ready solution (see: http://howtojboss.com/2012/11/07/data-grid-why/). Your caching layer is very easily scalable, high-available and fault tolerant then and is independent of your database layer and application layer because it simply sits between them.
And you could be interested about this feature as well: https://docs.jboss.org/author/display/ISPN/Cache+Loaders+and+Stores

I think in newest Infinispan release supports to work in a special, compatibility mode for those users interested in accessing Infinispan in multiple ways .
follow below link to configure your cache environment to support either embedded or remotely.
Interoperability between Embedded and Remote Server Endpoints

Can we Test Object's presence in Oracle Coherence?

I have a requirement where i need to create a third party application which will test any object's presence in Oracle Coherence.
Scenario: Our main application uses Oracle Coherence to store some data, now i have to create a separate application (which will be running on a different server - out of the coherence cluster node). This particular application will detect whether some particular object is present in the coherence or not. We have no plans to run coherence on this machine too.
Can any third party application (which is not part of the coherence cluster) connect to coherence and fetch data? If yes then how? can i get some pointers to do the same?

There are multiple ways you can do it.
1) Use Coherence Extend - This allows any application to interact with Coherence without being part of Coherence Cluster.
Refer to http://docs.oracle.com/cd/E14526_01/coh.350/e14509/configextend.htm
This option is supported only if the third part application is in Java, .Net or C++
http://coherence.oracle.com/display/COH35UG/Coherence+Extend#CoherenceExtend-Typesofclients
2) Use REST API - The newer/latest versions of Coherence exposes cache data management using REST API's. Refer to http://docs.oracle.com/cd/E24290_01/coh.371/e22839/rest_intro.htm
This option does not have any restriction on client/third part technology as it is based on XML/JSON over HTTP.
Using REST you can check presence of cache key as below.
GET Operation
GET http://{host}:{port}/cacheName/key
Returns a single object from the cache based on a key. A 404 (Not Found) message is returned if the object with the specified key does not exist.

I created such a tool some time back using the C++ API.
https://github.com/actsasflinn/coherence-tool
I also wrapped the C++ API in a Ruby binding for scripting purposes.
https://github.com/actsasflinn/ruby-coherence
Either of these can run standalone outside of the cluster and rely on the TCP proxy method of communicating with a cluster.

Are there any design patterns that could work in this scenario?

We have a system (Java web application) that's been in active development / maintenance for a long time now (something like ten years).
What we're looking at doing is implementing a RESTful API to the web app. This web application, using Jersey, will be a separate project with the intent that it should be able to run alongside the main application or deployed in the cloud.
Because of the nature and age of our application, we've had to implement a (somewhat) comprehensive caching layer on top of the database (postgres) to help keep load down. Anyway, for the RESTful API, the idea is that GET requests will go to the cache first instead of the database to keep load of the database.
The cache will be populated in a way to help ensure that most things registered API users will need should be in there.
If there is a cache miss, the needed data should be retrieved from the database (also being entered into the cache in the process).
Obviously, this should remain transparent from the RESTful endpoint methods in my code. We've come up with the idea of creating a 'Broker' to handle communications with the DB and the cache. The REST layer will simply pass across ids (if looking to retrieve) or populated Java objects (if looking to insert / update) and the broker will take care of retrieving / updating / invalidating, etc.
There is also the issue of extensibility. To begin with, the API will be living alongside the rest of servers so access to the database won't be an issue however if we deploy to the cloud, we're going to need a different Broker implementation that will communicate with the system (namely the database) in a different manner (potentially through the use of an internal API).
I already have a rough idea on how I can implement this but it struck me that is probably a problem for which a suitable pattern could exist. If I could follow an established pattern as opposed to coming up with my own solution, that'll probably be a better choice. Any ideas?

Ehcache has an implementation of just such a cache that it calls a SelfPopulatingCache.
Requests are made to the cache, not to the database. Then if there is a cache miss Ehcache will call the database (or whatever external data source you have) on your behalf.
You just need to implement a CacheEntryFactory which has a single method:
Object createEntry(Object key) throws Exception;
So as the name suggests, Ehcache implements this concept with a pretty standard factory pattern...

There's no pattern. Just hide the initial DB services behind interfaces, build tests around their intended behavior, then switch in an implementation that uses the caching layer. I guess dependency injection would be the best thing to help you do that?

Sounds like decorator pattern will suit your need: http://en.wikipedia.org/wiki/Decorator_pattern.
You can create an DAO interface for data access, something like:
Value get(long id);
And firstly create a direct DB implementation, then create a Cache implementation which calls underlying DAO instance, in this case it should be the DB implementation.
The Cache implementation will try to get value from its own managed Cache, and from underlaying DAO if it fails.
So both of your old application or the REST will only see DAO interface, without knowing any implemntation details, and in future you can change the implementation freely.

The best design pattern for transparently caching HTTP requests is to use an HTTP cache.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.