Monitoring Changes of a Bean to build deltas? - java

I have several Beans in my Application which getting updated regularly by the usual setter methods. I want to synchronize these beans with a remote application which has the same bean classes. In my case, bandwidth matters, so i have to keep the amount of transferred bytes as low as possible. My idea was to create deltas of the state changes and transfer them instead of the whole Objects. Currently, I want to write the protocol to transfer those changes by myself but I'm not bound to it and would prefer an existing solution.
Is there already a solution for this Problem out there? And if not, how could I easily monitor those state changes in an generalized way? AOP?
Edit: This problem is not caching related even it may first seem so. The data must be replicated from a central server to several clients (about 4 to 10) over the internet. The client is a standalone desktop application.

This sounds remarkably similar to JBossCache running in POJO mode.
This is a distributed, delta-based cache that breaks down java objects into a tree structure, and only transmits changes to the bits of the tree that changes.
Should be a perfect fit for you.

I like your idea of creating deltas and sending them.
A simple Map could handle the delta for one object. Serialization could simply get you the effective message send.
To reduce the number of messages that would kill your performance, you should group your deltas for all objects and send them as a whole. So you could have others collections or maps to contain this.
To monitor all changes to many beans, AOP seem like a good solution.
EDIT : see Skaffmann's answer.
Using an existing cache technology could be better.
Many problems could already have solutions implemented...

Related

Server design suggestions to support multiple workflows managed by multiple teams

Looking for inputs to a design problem. We are redesigning our existing server which has served well so far but won't scale in the future.
Current design: It is one server(we run multiple instances of the same server) which has many workflows. Lets call these workflows A,B,C,D handled inside the server. Until now, we have one development team working on this server which made handling releases easy. Performance is also decent because we leverage in memory caching.
Future design: We now have multiple teams(each team handling one workflow, Team A handling workflow A, Team B handling workflow B and so on). With this new team structure and current design, we are unable to scale our releases(Since its one server, we have only one team releasing at any given time thus reducing overall team efficiency). There is a need for isolation so teams can release their changes to workflows independent of each other. Also, we expect more workflows to be on-boarded into this server.
Any design ideas on how we can solve this problem of ever increasing workflows
My current solution: Split the server into 4 servers so each team can manage the workflows individually. The disadvantage with this approach is the code management. Most of these workflow share common code base. Also, splitting the server causes us to lose out on cache(which is not an issue with current design)
Look forward to hearing your suggestions.
Splitting into different workflow according to different teams makes complete sense. Some advantages that come to mind are:
Independent releases like you mentioned.
Crashes/Memory leaks/resource hogging from workflow A won't affect workflow B
Each Workflow server can be scaled independently. A popular workflow A could be scaled to say more servers while a rarely used workflow B could be working with just 1 server.
There could be more, just pointing out the obvious ones supporting the split.
How to handle disadvantages - let us try to understand with the example of Library Management System. Let us say we need workflows for member borrowing a book, member returning a book, registering a new member.
Most of these workflow share common code base
To resolve this we identify the core common part, in my example I will take definition of book(id, name, field), member(id, name, email). Besides the definitions, I can also have common functions that work on them, like serialisers, parsers, validators.
Now my workflow/s will depend on this common repo. The borrow book workflow will completely be different from add a member workflow, but they will use the same building blocks.
the server causes us to lose out on cache
Exactly what needs to cached and what is the behaviour of the cache is very important.
A fairly static cache(say member cache) can be setup on distributed cache like redis. Say there is a workflow which will identify the close deadlines for the borrowed books and send reminder emails to those members. Once the member ids are identified, their emails could be looked up in the redis cache.
A workflow can have a personalised cache as well. For example during searching for books in the library with the name, the result can be cached in the workflow server only in memory with a TTL, and can be served if same query is asked in near future.
To conclude, the disadvantages you have are nothing but design challenges. I hope that with this random example I was able to give you a few points to wonder upon. Depending on your actual use case, my answer might completely be irrelevant. If so, sincere apologies. :)

Distributed Metrics

I have been working on a single box application which uses codehale metrics heavily for instrumentation. Right now we are moving to cloud and I have below questions on how I can monitor metrics when the application is distributed.
Is there a metrics reporter that can write metrics data to Cassandra?
When and how does the aggregation happen if there are records per server in the database?
Can I define the time interval at which the metrics data gets saved into the database?
Are there any inbuilt frameworks that are available to achieve this?
Thanks a bunch and appreciate all your help.
I am answering your questions first, but I think you are misunderstanding how to use Metrics.
You can google this fairly easily. I don't know of any (I also don't understand what you'll do with it in cassandra?). You would normally use something like graphite for that. In any case, a reporter implementation is very straight forward and easy.
That question does not make too much sense. Why would you aggregate over 2 different servers - they are independent. Each of your monitored instances should be standalone. Aggregation happens on the receiving side (e.g. graphite)
You can - see 1. Write a reporter, and configure it accordingly.
Not that i know of.
Now to metrics in general:
I think you are having the wrong idea. You can monitor X servers, that is not a problem at all, but you should not aggregate that on the client side (or database side). how would that even work? Restarts zero the clients, and essentially that means you need to track the state of each of your servers so that your aggregation does work. How do you manage outages?
The way you should monitor your servers with metrics:
create a namespace
io.my.server.{hostname}.my.metric
now you have X different namespaces, but they all have a common prefix. That means, you have grouped them.
Send them to your prefered monitoring solution.
There are heaps out there. I do not understand why you want this to be cassandra - what kind of advantage do you gain from that? http://graphite.wikidot.com/ for example is a graphng solution. Your applications can automatically submit data there (graphite comes with a reporter in java that you can use). See http://graphite.wikidot.com/screen-shots on how it looks like.
The main point is that graphite (and all or most providers) know how to handle your namespaces. E.g. also look at Zabix, which can do the same thing.
Aggregations
Now the aggregation happens on the receiving side. Your provider knows how to do that, and you can define rules.
For example, you could wildcard alerts like:
io.my.server.{hostname}.my.metric.count > X
Graphite (I believe) even supports operations, e.g:
sum(io.my.server.{hostname}.my.metric.request) - which would sum up ALL your hosts's requests
That is where the aggregation happens. At that point, your servers are again standalone (as they should), and have no dependency on each other or any monitoring database etc. They simply report on their own metrics (which is what they should do) and you - as the consumer of those metrics - are responsible to make the right alerts/aggregations/formulars on the receiving end.
Aggregating this on server side would involve:
Discover all other servers
Monitor their state
Receive/send metrics back and forth
Synchronise what they report etc
That just sounds like a nightmare for maintenance :) I hope that gives you some inside/ideas.
(Disclaimer: Neither a metrics dev nur a graphite dev - this is just how I did this in the past/ and the approach I still use)
Edit:
With your comment in mind, here are my two fave solutions on what you want to achieve:
DB
you can use the DB and store dates e.g. for start message and end message.
This is not really a metric thing so maybe not preferred. As per your question you could write your own reporter on that, but it would get complicated with regards to upserts/updates etc. I think option 2 is easier and has more potential.
Logs
This is I think what you need. Your servers independently log on Start/Stop/Pause etc - whatever it is you want to report on. You then set up logstash and collect those logs.
Logstash allows you to track these events over time and create metrics on it, see:
https://www.elastic.co/guide/en/logstash/current/plugins-filters-metrics.html
Or:
https://github.com/logstash-plugins/logstash-filter-elapsed
The first one uses actual metrics. The second one is a different plugin that just measures times between start/stop events.
This is the option with the most potential because it does not rely on any format/ any data store or anything other. You even get Kibana for plotting out of the box if you use the entire ELK stack.
Say you wanted to measure your messages. You can just look for the logs, there are no application changes involved. The solution does not even touch on your application (e.g. storing your reporting data manually does take up threads and processing in your applications, so if you need to be real-time compatible this will put your overall performance down), it is a complete standalone solution. Later on, when wanting to measure other metrics, you can easily add to your logstash configuration and start doing other metrics.
I hope this helps

Sharing a java object across a cluster

My requirement is to share a java object across a cluster.
I get Confused
whether to write an EJB and share the java objects across the cluster
or
to use any third party such as infinispan or memecached or terracotta or
what about JCache?
with the constraint that
I can't change any of my source code with specific to any application
server (such as implementing the weblogic's singleton services).
I can't offer two builds for cluster and non cluster environment.
Performance should not be downgraded.
I am looking for only open source third party if I need to use it.
It need to work in weblogic , Websphere , Jbos and Tomcat too.
Can any one come up with the best option with these constraints in mind.
It can depend on the use case of the objects you want to share in the cluster.
I think it comes down to really the following options in most complex to least complex
Distributed cacheing
http://www.ehcache.org
Distributed cacheing is good if you need to ensure that an object is accessible from a cache on every node. I have used ehache to distribute quite successfully, no need to setup a terracotta server unless you need the scale, can just point instances together via rmi. Also works synchronously and asynchronously depending on requirements. Also cache replication is handy if nodes go down so cache is actually redundant and dont lose anything. Good if you need to make sure that the object has been updated across all the nodes.
Clustered Execution/data distribution
http://www.hazelcast.com/
Hazelcast is also a nice option as provides a way of executing java classes across a cluster. This is more useful if you have an object that represents a unit of work that needs to be performed and you dont care so much where it gets executed.
Also useful for distributed collections, i.e. a distributed map or queue
Roll your own RMI/Jgroups
Can write your own client/server but I think you will start to run into issues that the bigger frameworks solve if the requirements of the objects your dealing with starts to get complex. Realistically Hazelcast is really simple and should really eliminate the need to roll your own.
It's not open source, but Oracle Coherence would easily solve this problem.
If you need an implementation of JCache, the only one that I'm aware of being available today is Oracle Coherence; see: http://docs.oracle.com/middleware/1213/coherence/develop-applications/jcache_part.htm
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.
It is just an idea. you might want to check the exact implementation.
It will downgrade performance but I don't see how it is possible to avoid it.
It not an easy one to implement. might be you should consider load balance instead of clustering.
you might consider RMI and/or dynamic-proxy.
extract interface of your objects.
use RMI to access the real object (from all clusters even the one that actually holds the object)
in order to create RMI for an existing code you might use dynamic-proxy (again..not sure about implementation)
*dynamic proxy can wrap any object and do some pre and post task on each method invocation. in this case it might use the original object for RMI invocation
you will need connectivity between clusters in order to propogate the RMI object.

is it possible save state between requests in GAE/java

I plan to implement a GAE app only for my own usage.
The application will get its data using URL Fetch service, updating it every x minutes (using Scheduled tasks). Then it will serve that information to me when I request it.
I have barely started to look into GAE, but I have a main question that I am not able to clear. Can state be maintained in GAE between different requests without using jdo/jpa and the datastore?
As I am the only user, I guess I could keep the info in a servlet subclass and so I can avoid having to deal with Datastore...but my concern is that, as this app will have very few request, if it is moved to disk or whatever (don't know yet if it has some specific name), it will loose its status?
I am not concerned about having to restart the whole app and start collecting data from scratch from time to time, that is ok.
If this is an app for your own use, and you're double-extra sure that you won't be making it multi-user, and you're not concerned about the possibility that you might be using it from two browsers at once, you can skip using sessions and use a known key for storing information in memcache.
If your reason for avoiding datastore is concern over performance, then I strong recommend testing that assumption. You may be pleasantly surprised.
You could use the http session to maintain state between requests, but that will use the datastore itself (although you won't have to write any code to get this behaviour).
You might also consider using the Cache API (like memcache). It's JSR 107 I think, which Google provide an implementation of. The Cache is shared between instances, but it can empty at anytime. But if you're happy with that behaviour this may be an option. Looking at your requirements this may be the most feasible option, if you don't want to write your own persistence code.
You could store data as a static against your Class or in an application scoped Object, but doing that means when your instance spins down or your instance switches to another instance, the data would be lost as your classes would need to be loaded into the new instance.
Or you could serialize the state to the client and send it back in with each request.
The most robust option is persistence to the datastore - the JPA code is trivial. Perhaps you should reconsider?

What architecture? Distribute content building across a cluster

I am building an content serving application composing of a cluster of two types of node, ContentServers and ContentBuilders.
The idea is to always serve fresh content. Content is fresh if it was built recently, i.e. Content.buildTime < MAX_AGE.
Requirements:
*ContentServers will only have to lookup content and serve it up (e.g. from a distributed cache or similar), no waiting for anything to be built except on first request for each item of Content.
*ContentBuilders should be load balanced, should rebuild Content just before it expires, should only build content that is actually being requested. The built content should be quickly retrievable by all ContentServers
What architecture should I use? I'm currently thinking a distributed cache (EhCache maybe) to hold the built content and a messaging queue (JMS/ActiveMQ maybe) to relay the Content requests to builders though I would consider any other options/suggestions. How can I be sure that the ContentBuilders will not build the same thing at the same time and will only build content when it nears expiry?
Thanks.
Honestly I would rethink your approach and I'll tell you why.
I've done a lot of work on distributed high-volume systems (financial transactions specifically) and your solution--if the volume is sufficiently high (and I'll assume it is or you wouldn't be contemplating a clustered solution; you can get an awful lot of power out of one off-the-shelf box these days)--then you will kill yourself with remote calls (ie calls for data from another node).
I will speak about Tangosol/Oracle Coherence here because it's what I've got the most experience with, although Terracotta will support some or most of these features and is free.
In Coherence terms what you have is a partitioned cache where if you have n nodes, each node possesses 1/n of the total data. Typically you have redundancy of at least one level and that redundancy is spread as evenly as possible so each of the other n-1 nodes possesses 1/n-1 of the backup nodes.
The idea in such a solution is to try and make sure as many of the cache hits as possible are local (to the same cluster node). Also with partitioned caches in particular, writes are relatively espensive (and get more expensive with the more backup nodes you have for each cache entry)--although write-behind caching can minimize this--and reads are fairly cheap (which is what you want out of your requirements).
So your solution is going to ensure that every cache hit will be to a remote node.
Also consider that generating content is undoubtedly much more expensive than serving it, which I'll assume is why you came up with this idea because then you can have more content generators than servers. It's the more tiered approach and one I'd characterize as horizontal slicing.
You will achieve much better scalability if you can vertically slice your application. By that I mean that each node is responsible for storing, generating and serving a subset of all the content. This effectively eliminates internode communication (excluding backups) and allows you to adjust the solution by simply giving each node a different sized subset of the content.
Ideally, whatever scheme you choose for partitioning your data should be reproducible by your Web server so it knows exactly which node to hit for the relevant data.
Now you might have other reasons for doing it the way you're proposing but I can only answer this in the context of available information.
I'll also point you to a summary of grid/cluster technologies for Java I wrote in response to another question.
You may want to try Hazelcast. It is open source, peer2peer, distributed/partitioned map and queue with eviction support. Import one single jar, you are good to go! Super simple.
If the content building can be parallelized (builder 1 does 1..1000, builder 2 does 1001..2000) then you could create a configuration file to pass this information. A ContentBuilder will be responsible for monitoring its area for expiration.
If this is not possible, then you need some sort of manager to orchestrate the content building. This manager can also play the role of the load balancer.The manager can be bundled together with a ContentBuilder or be a node of it's own.
I think that the ideas of the distributed cache and the JMS messaging are good ones.
It sounds like you need some form of distributed cache, distributed locking and messaging.
Terracotta gives you all three - a distributed cache, distributed locking and messaging, and your programming model is just Java (no JMS required).
I wrote a blog about how to ensure that a cache only ever populates its contents once and only once here: What is a memoizer and why you should care about it.
I am in agreement with Cletus - if you need high performance you will need to consider partitioning however unlike most solutions, Terracotta will work just fine without partitioning until you need it, and then when you apply partitioning it will just divy up the work according to your partitioning algorithm.

Categories