Share file storage index with multiple open applications in Java

Share file storage index with multiple open applications in Java - java

I'm writing an HTTP Cache library for Java, and I'm trying to use that library in the same application which is started twice. I want to be able to share the cache between those instances.
What is the best solution for this? I also want to be able to write to that same storage, and it should be available for both instances.
Now I have a memory-based index of the files available to the cache, and this is not shareable over multiple VMs. It is serialized between startups, but this won't work for a shared cache.
According to the HTTP Spec, I can't just map files to URIs as there might be a variation of the same payload based on the request. I might, for instance, have a request that varies on the 'accept-language' header: In that case I would have a different file for each subsequent request which specifies a different language.
Any Ideas?

First, are you sure you want to write your own cache when there are several around? Things like:
ehcache
jboss cache
memcached
The first two are written in Java and the third can be accessed from Java. The first two also handle distributed caching, which is the general case of what you are asking for, I think. When they start up, they look to connect to other members so that they maintain a consistent cache across instances. Changes to one are reflected across instances. They can be set up to connect via multicast or with specific lists of servers specified.
Memcached typically works in a slightly different manner in that it is running externally to the Java processes you are running, so that all Java instances that start up will be talking to a common service. You can set up memcached to work in a distributed manner, but it does so by hashing keys so that the server you want to connect to can be determined by what it is you are looking for.
Doing a true distributed cache with consistent content is very hard to do well, which is why I suggest looking at an existing library. If you want to do it yourself, it would still help to look at those listed to see how they go about it and consider using something like JGroups as your underlying mechanism.

I think you should have a look at the WebDav-Specifications. It's an HTTP extension for sharing/editing/storing/versioning resources on a server. There exists an implementation as an Apache module, wich allows you a swift start using them.
So instead of implementing your own cache server implementation, you might be better off with a local Apache + mod-dav instance that is available to both of your applications.
Extra bonus: Since WebDav is a specified protocoll you get the interoperability with lots of tools for free.

Related

In a distributed Java web application, how to share a value between all servlets on all machines?

If I have a distributed java web application deployed in a cluster and I have say 10 servlets & 10 JSPs running the show, and if I want to share some data, say a variable or a simple POJO between all the threads of all the servlets on all the machines, what is the way to do it?
No framework like Spring/Struts is used and let's say I'm only using the basic Servlets and JSPs. Usually we think about ServletConfig, ServletContext, HttpSession and HttpServletRequest objects to store information which needs to be passed/shared from one component to another. ServletContext has the largest scope because it's accessible from all the servlets and JSPs in the web app. But in case of distributed application I guess ServeltContext object would be created one per JVM, so even for a single web app every machine in the cluster will have a different java object for ServletContext, correct? So in such a scenario what should be done to share a POJO between all the servlets on all the machines of a single web app?
If it's not possible using plain Servlets and JSP, do any frameworks make is possible? Would appreciate any inputs. Many thanks!

In a distributed architecture, it is useful to think beyond objects and think about "services". There are several possible solutions for this but all of them would include some form of service you could access from any of your 10 nodes.
So, you could for example create an 11th machine and host an API for putting and getting objects (values/maps/etc?). That would create a shareable region between the nodes.
However, this opens a whole world of possible issues if not done correctly, because you need to think about sinchronization, deadlocks, dirty reads and other concurrent processing stuff in a cross-JVM mindset.
Also, many systems sinchronize their nodes via the database, but this approach is somehow deprecated nowadays in favor of the more recent "microservices" approach where persistence is distributed, not monolithic.

you are using spring already, so maybe spring session project is a right choice for you - http://projects.spring.io/spring-session/. For sure its the easiest one to run.

You can use hazelcast, a framework as memcache but with auto-discovery for clustering . I use to used for the session and cache sharing on my Amazon cluster and works like a charm
http://hazelcast.com/use-cases/caching/
But if you want keep in simple you can always use as I said before memcached
http://memcached.org/

Sharing things between servers is:
error prone
sometimes complicated
The most common thing to want is user session data across a load balanced cluster of servers. If someone is talking to one server, then gets load balanced to a different server, you want to keep their session going. Tomcat Clusters does this, and it's already built in.
https://tomcat.apache.org/tomcat-7.0-doc/cluster-howto.html
The last time I played with that, it was touchy; don't count on session replication always working in any servlet container, and you'll be better off. Also, session replication is crazy expensive; once you're past a few machines, the cost (in RAM) of having all session data everywhere... starts to add up quickly, and you can't add more users easily anymore.
Wanting to share things between multiple JVMs is a code smell; if you can architect around it, do so. But other than clustering, you have the two normal options:
a database. Tried, true, tested; keep details that need to change there.
an in-memory store. If it gets called on every request, and/or must be really fast for whatever reason, just consider keeping it in memory; memcached is a multi-machine in-memory key-value-store that does just this.

The simplest solution is ConcurrentHashMap https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ConcurrentHashMap.html
If you want to scale your application - you will need something like hazelcast - http://hazelcast.com/

Sharing a java object across a cluster

My requirement is to share a java object across a cluster.
I get Confused
whether to write an EJB and share the java objects across the cluster
or
to use any third party such as infinispan or memecached or terracotta or
what about JCache?
with the constraint that
I can't change any of my source code with specific to any application
server (such as implementing the weblogic's singleton services).
I can't offer two builds for cluster and non cluster environment.
Performance should not be downgraded.
I am looking for only open source third party if I need to use it.
It need to work in weblogic , Websphere , Jbos and Tomcat too.
Can any one come up with the best option with these constraints in mind.

It can depend on the use case of the objects you want to share in the cluster.
I think it comes down to really the following options in most complex to least complex
Distributed cacheing
http://www.ehcache.org
Distributed cacheing is good if you need to ensure that an object is accessible from a cache on every node. I have used ehache to distribute quite successfully, no need to setup a terracotta server unless you need the scale, can just point instances together via rmi. Also works synchronously and asynchronously depending on requirements. Also cache replication is handy if nodes go down so cache is actually redundant and dont lose anything. Good if you need to make sure that the object has been updated across all the nodes.
Clustered Execution/data distribution
http://www.hazelcast.com/
Hazelcast is also a nice option as provides a way of executing java classes across a cluster. This is more useful if you have an object that represents a unit of work that needs to be performed and you dont care so much where it gets executed.
Also useful for distributed collections, i.e. a distributed map or queue
Roll your own RMI/Jgroups
Can write your own client/server but I think you will start to run into issues that the bigger frameworks solve if the requirements of the objects your dealing with starts to get complex. Realistically Hazelcast is really simple and should really eliminate the need to roll your own.

It's not open source, but Oracle Coherence would easily solve this problem.
If you need an implementation of JCache, the only one that I'm aware of being available today is Oracle Coherence; see: http://docs.oracle.com/middleware/1213/coherence/develop-applications/jcache_part.htm
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.

It is just an idea. you might want to check the exact implementation.
It will downgrade performance but I don't see how it is possible to avoid it.
It not an easy one to implement. might be you should consider load balance instead of clustering.
you might consider RMI and/or dynamic-proxy.
extract interface of your objects.
use RMI to access the real object (from all clusters even the one that actually holds the object)
in order to create RMI for an existing code you might use dynamic-proxy (again..not sure about implementation)
*dynamic proxy can wrap any object and do some pre and post task on each method invocation. in this case it might use the original object for RMI invocation
you will need connectivity between clusters in order to propogate the RMI object.

Client side caching in GWT

We have a gwt-client, which recieves quite a lot of data from our servers. Logically, i want to cache the data on the client side, sparing the server from unnecessary requests.
As of today i have let it up to my models to handle the caching of data, which doesn't scale very well. It's also become a problem since different developers in our team develop their own "caching" functionality, which floods the project with duplications.
I'm thinking about how one could implement a "single point of entry", that handles all the caching, leaving the models clueless about how the caching is handled.
Does anyone have any experience with client side caching in GWT? Is there a standard approach that can be implemented?

I suggest you look into gwt-presenter and the CachingDispatchAsync . It provides a single point of entry for executing remote commands and therefore a perfect opportunity for caching.
A recent blog post outlines a possible approach.

You might want to take a look at the Command Pattern; Ray Ryan held a talk at Google IO about best practices in GWT, here is a transcript: http://extgwt-mvp4g-gae.blogspot.com/2009/10/gwt-app-architecture-best-practices.html
He proposes the use of the Command Pattern using Action and Response/Result objects which are thrown in and out the service proxy. These are excellent objects to encapsulate any caching that you want to perform on the client.
Here's an excerpt: "I've got a nice unit of currency for implementing caching policies. May be whenever I see the same GET request twice, I'll cache away the response I got last time and just return that to myself immediately. Not bother with a server-side trip."
In a fairly large project, I took another direction. I developed a DtoCache object which essentially held a reference to each AsyncCallback that was expecting a response from a service call in a waiting queue. Once the DtoCache received the objects from the server, they were cached inside the DtoCache. The cached result was henceforth returned to all queued and newly created AsyncCallbacks for the same service call.

For an already-fully-built, very sophisticated caching engine for CRUD operations, consider Smart GWT. This example demonstrates the ability to do client-side operations adaptively (when the cache allows it) while still supporting paging for large datasets:
http://www.smartclient.com/smartgwt/showcase/#grid_adaptive_filter_featured_category
This behavior is exposed via the ResultSet class if you need to put your own widgets on top of it:
http://www.smartclient.com/smartgwtee/javadoc/com/smartgwt/client/data/ResultSet.html

There are two levels of caching:
Caching during one browser session.
Caching cross browser sessions, e.g the cached data should be available after browser restarted.
What to cache: depend on your application, you may want to cache
Protected data for particular user
Public static (or semi-static, e.g rarely to change) data
How to cache:
For the first caching level, we can use GWT code as suggested in the answers or write your own one.
For the second one, we must use Browser caching features. The standard approach is put your data inside html (whether static html files or dynamic generated by jsp/servlet for example). Your application then use http://code.google.com/webtoolkit/doc/latest/DevGuideCodingBasicsOverlay.html techniques to get the data.

I thought Itemscript was kind of neat. It's a RESTful JSON database that works on both the client (GWT) and server.
Check it out!
-JP

What architecture? Distribute content building across a cluster

I am building an content serving application composing of a cluster of two types of node, ContentServers and ContentBuilders.
The idea is to always serve fresh content. Content is fresh if it was built recently, i.e. Content.buildTime < MAX_AGE.
Requirements:
*ContentServers will only have to lookup content and serve it up (e.g. from a distributed cache or similar), no waiting for anything to be built except on first request for each item of Content.
*ContentBuilders should be load balanced, should rebuild Content just before it expires, should only build content that is actually being requested. The built content should be quickly retrievable by all ContentServers
What architecture should I use? I'm currently thinking a distributed cache (EhCache maybe) to hold the built content and a messaging queue (JMS/ActiveMQ maybe) to relay the Content requests to builders though I would consider any other options/suggestions. How can I be sure that the ContentBuilders will not build the same thing at the same time and will only build content when it nears expiry?
Thanks.

Honestly I would rethink your approach and I'll tell you why.
I've done a lot of work on distributed high-volume systems (financial transactions specifically) and your solution--if the volume is sufficiently high (and I'll assume it is or you wouldn't be contemplating a clustered solution; you can get an awful lot of power out of one off-the-shelf box these days)--then you will kill yourself with remote calls (ie calls for data from another node).
I will speak about Tangosol/Oracle Coherence here because it's what I've got the most experience with, although Terracotta will support some or most of these features and is free.
In Coherence terms what you have is a partitioned cache where if you have n nodes, each node possesses 1/n of the total data. Typically you have redundancy of at least one level and that redundancy is spread as evenly as possible so each of the other n-1 nodes possesses 1/n-1 of the backup nodes.
The idea in such a solution is to try and make sure as many of the cache hits as possible are local (to the same cluster node). Also with partitioned caches in particular, writes are relatively espensive (and get more expensive with the more backup nodes you have for each cache entry)--although write-behind caching can minimize this--and reads are fairly cheap (which is what you want out of your requirements).
So your solution is going to ensure that every cache hit will be to a remote node.
Also consider that generating content is undoubtedly much more expensive than serving it, which I'll assume is why you came up with this idea because then you can have more content generators than servers. It's the more tiered approach and one I'd characterize as horizontal slicing.
You will achieve much better scalability if you can vertically slice your application. By that I mean that each node is responsible for storing, generating and serving a subset of all the content. This effectively eliminates internode communication (excluding backups) and allows you to adjust the solution by simply giving each node a different sized subset of the content.
Ideally, whatever scheme you choose for partitioning your data should be reproducible by your Web server so it knows exactly which node to hit for the relevant data.
Now you might have other reasons for doing it the way you're proposing but I can only answer this in the context of available information.
I'll also point you to a summary of grid/cluster technologies for Java I wrote in response to another question.

You may want to try Hazelcast. It is open source, peer2peer, distributed/partitioned map and queue with eviction support. Import one single jar, you are good to go! Super simple.

If the content building can be parallelized (builder 1 does 1..1000, builder 2 does 1001..2000) then you could create a configuration file to pass this information. A ContentBuilder will be responsible for monitoring its area for expiration.
If this is not possible, then you need some sort of manager to orchestrate the content building. This manager can also play the role of the load balancer.The manager can be bundled together with a ContentBuilder or be a node of it's own.
I think that the ideas of the distributed cache and the JMS messaging are good ones.

It sounds like you need some form of distributed cache, distributed locking and messaging.
Terracotta gives you all three - a distributed cache, distributed locking and messaging, and your programming model is just Java (no JMS required).
I wrote a blog about how to ensure that a cache only ever populates its contents once and only once here: What is a memoizer and why you should care about it.
I am in agreement with Cletus - if you need high performance you will need to consider partitioning however unlike most solutions, Terracotta will work just fine without partitioning until you need it, and then when you apply partitioning it will just divy up the work according to your partitioning algorithm.

Strategy for Offline/Online data synchronization

My requirement is I have server J2EE web application and client J2EE web application. Sometimes client can go offline. When client comes online he should be able to synchronize changes to and fro. Also I should be able to control which rows/tables need to be synchronized based on some filters/rules. Is there any existing Java frameworks for doing it? If I need to implement on my own, what are the different strategies that you can suggest?
One solution in my mind is maintaining sql logs and executing same statements at other side during synchronization. Do you see any problems with this strategy?

There are a number of Java libraries for data synchronizing/replication. Two that I'm aware of are daffodil and SymmetricDS. In a previous life I foolishly implemented (in Java) my own data replication process. It seems like the sort of thing that should be fairly straightforward, but if the data can be updated in multiple places simultaneously, it's hellishly complicated. I strongly recommend you use one of the aforementioned projects to try and bypass dealing with this complexity yourself.

The biggist issue with synchronization is when the user edits something offline, and it is edited online at the same time. You need to merge the two changed pieces of data, or deal with the UI to allow the user to say which version is correct. If you eliminate the possibility of both being edited at the same time, then you don't have to solve this sticky problem.
The method is usually to add a field 'modified' to all tables, and compare the client's modified field for a given record in a given row, against the server's modified date. If they don't match, then you replace the server's data.
Be careful with autogenerated keys - you need to make sure your data integrity is maintained when you copy from the client to the server. Strictly running the SQL statements again on the server could put you in a situation where the autogenerated key has changed, and suddenly your foreign keys are pointing to different records than you intended.
Often when importing data from another source, you keep track of the primary key from the foreign source as well as your own personal primary key. This makes determining the changes and differences between the data sets easier for difficult synchronization situations.

Your synchronizer needs to identify when data can just be updated and when a human being needs to mediate a potential conflict. I have written a paper that explains how to do this using logging and algebraic laws.

What is best suited as the client-side data store in your application? You can choose from an embedded database like SQLite or a message queue or some object store or (if none of these can be used since it is a web application) files/ documents saved on the client using Web DB or IndexedDB through HTML 5's LocalStorage API.
Check the paper Gold Rush: Mobile Transaction Middleware with Java-Object Replication. Microsoft's documentation of occasionally connected systems describes two approaches: service-oriented or message-oriented and data-oriented. Gold Rush takes the earlier approach. The later approach uses database merge-replication.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.