What architecture? Distribute content building across a cluster

What architecture? Distribute content building across a cluster - java

I am building an content serving application composing of a cluster of two types of node, ContentServers and ContentBuilders.
The idea is to always serve fresh content. Content is fresh if it was built recently, i.e. Content.buildTime < MAX_AGE.
Requirements:
*ContentServers will only have to lookup content and serve it up (e.g. from a distributed cache or similar), no waiting for anything to be built except on first request for each item of Content.
*ContentBuilders should be load balanced, should rebuild Content just before it expires, should only build content that is actually being requested. The built content should be quickly retrievable by all ContentServers
What architecture should I use? I'm currently thinking a distributed cache (EhCache maybe) to hold the built content and a messaging queue (JMS/ActiveMQ maybe) to relay the Content requests to builders though I would consider any other options/suggestions. How can I be sure that the ContentBuilders will not build the same thing at the same time and will only build content when it nears expiry?
Thanks.

Honestly I would rethink your approach and I'll tell you why.
I've done a lot of work on distributed high-volume systems (financial transactions specifically) and your solution--if the volume is sufficiently high (and I'll assume it is or you wouldn't be contemplating a clustered solution; you can get an awful lot of power out of one off-the-shelf box these days)--then you will kill yourself with remote calls (ie calls for data from another node).
I will speak about Tangosol/Oracle Coherence here because it's what I've got the most experience with, although Terracotta will support some or most of these features and is free.
In Coherence terms what you have is a partitioned cache where if you have n nodes, each node possesses 1/n of the total data. Typically you have redundancy of at least one level and that redundancy is spread as evenly as possible so each of the other n-1 nodes possesses 1/n-1 of the backup nodes.
The idea in such a solution is to try and make sure as many of the cache hits as possible are local (to the same cluster node). Also with partitioned caches in particular, writes are relatively espensive (and get more expensive with the more backup nodes you have for each cache entry)--although write-behind caching can minimize this--and reads are fairly cheap (which is what you want out of your requirements).
So your solution is going to ensure that every cache hit will be to a remote node.
Also consider that generating content is undoubtedly much more expensive than serving it, which I'll assume is why you came up with this idea because then you can have more content generators than servers. It's the more tiered approach and one I'd characterize as horizontal slicing.
You will achieve much better scalability if you can vertically slice your application. By that I mean that each node is responsible for storing, generating and serving a subset of all the content. This effectively eliminates internode communication (excluding backups) and allows you to adjust the solution by simply giving each node a different sized subset of the content.
Ideally, whatever scheme you choose for partitioning your data should be reproducible by your Web server so it knows exactly which node to hit for the relevant data.
Now you might have other reasons for doing it the way you're proposing but I can only answer this in the context of available information.
I'll also point you to a summary of grid/cluster technologies for Java I wrote in response to another question.

You may want to try Hazelcast. It is open source, peer2peer, distributed/partitioned map and queue with eviction support. Import one single jar, you are good to go! Super simple.

If the content building can be parallelized (builder 1 does 1..1000, builder 2 does 1001..2000) then you could create a configuration file to pass this information. A ContentBuilder will be responsible for monitoring its area for expiration.
If this is not possible, then you need some sort of manager to orchestrate the content building. This manager can also play the role of the load balancer.The manager can be bundled together with a ContentBuilder or be a node of it's own.
I think that the ideas of the distributed cache and the JMS messaging are good ones.

It sounds like you need some form of distributed cache, distributed locking and messaging.
Terracotta gives you all three - a distributed cache, distributed locking and messaging, and your programming model is just Java (no JMS required).
I wrote a blog about how to ensure that a cache only ever populates its contents once and only once here: What is a memoizer and why you should care about it.
I am in agreement with Cletus - if you need high performance you will need to consider partitioning however unlike most solutions, Terracotta will work just fine without partitioning until you need it, and then when you apply partitioning it will just divy up the work according to your partitioning algorithm.

Related

Solr as primary database [duplicate]

My team is working with a third party CMS that uses Solr as a search index. I've noticed that it seems like the authors are using Solr as a database of sorts in that each document returned contains two fields:
The Solr document ID (basically a classname and database id)
An XML representation of the entire object
So basically it runs a search against Solr, download the XML representation of the object, and then instantiate the object from the XML rather than looking it up in the database using the id.
My gut feeling tells me this is a bad practice. Solr is a search index, not a database... so it makes more sense to me to execute our complex searches against Solr, get the document ids, and then pull the corresponding rows out of the database.
Is the current implementation perfectly sound, or is there data to support the idea that this is ripe for refactoring?
EDIT: When I say "XML representation" - I mean one stored field that contains an XML string of all of the object's properties, not multiple stored fields.

Yes, you can use SOLR as a database but there are some really serious caveats :
SOLR's most common access pattern, which is over http doesnt respond particularly well to batch querying. Furthermore, SOLR does NOT stream data --- so you can't lazily iterate through millions of records at a time. This means you have to be very thoughtful when you design large scale data access patterns with SOLR.
Although SOLR performance scales horizontally (more machines, more cores, etc..) as well as vertically (more RAM, better machines, etc), its querying capabilities are severely limited compared to those of a mature RDBMS. That said, there are some excellent functions, like the field stats queries, which are quite convenient.
Developers who are used to using relational databases will often run into problems when they use the same DAO design patterns in a SOLR paradigm, because of the way SOLR uses filters in queries. There will be a learning curve for developing the right approach to building an application that uses SOLR for part of its large queries or statefull modifications.
The "enterprisy" tools that allow for advanced session management and statefull entities that many advanced web-frameworks (Ruby, Hibernate, ...) offer will have to be thrown completely out the window.
Relational databases are meant to deal with complex data and relationships - and they are thus accompanied by state of the art metrics and automated analysis tools. In SOLR, I've found myself writing such tools and manually stress-testing alot, which can be a time sink.
Joining : this is the big killer. Relational databases support methods for building and optimizing views and queries that join tuples based on simple predicates. In SOLR, there aren't any robust methods for joining data across indices.
Resiliency : For high availability, SolrCloud uses a distributed file system underneath (i.e. HCFS). This model is quite different then that of a relational database, which usually does resiliency using slaves and masters, or RAID, and so on. So you have to be ready to provide the resiliency infrastructure SOLR requires if you want it to be cloud scalable and resistent.
That said - there are plenty of obvious advantages to SOLR for certain tasks : (see http://wiki.apache.org/solr/WhyUseSolr) -- loose queries are much easier to run and return meaningful results. Indexing is done as a matter of default, so most arbitrary queries run pretty effectively (unlike a RDBMS, where you often have to optimize and de-normalize after the fact).
Conclusion: Even though you CAN use SOLR as an RDBMS, you may find (as I have) that there is ultimately "no free lunch" - and the cost savings of super-cool lucene text-searches and high-performance, in-memory indexing, are often paid for by less flexibility and adoption of new data access workflows.

It's perfectly reasonable to use Solr as a database, depending on your application. In fact, that's pretty much what guardian.co.uk is doing.
It's definitely not bad practice per se. It's only bad if you use it the wrong way, just like any other tool at any level, even GOTOs.
When you say "An XML representation..." I assume you're talking about having multiple stored Solr fields and retrieving this using Solr's XML format, and not just one big XML-content field (which would be a terrible use of Solr). The fact that Solr uses XML as default response format is largely irrelevant, you can also use a binary protocol, so it's quite comparable to traditional relational databases in that regard.
Ultimately, it's up to your application's needs. Solr is primarily a text search engine, but can also act as a NoSQL database for many applications.

This was probably done for performance reasons, if it doesn't cause any problems I would leave it alone. There is a big grey area of what should be in a traditional database vs a solr index. Ive seem people do similar things to this (usually key value pairs or json instead of xml) for UI presentation and only get the real object from the database if needed for updates/deletes. But all reads just go to Solr.

I've seen similar things done because it allows for very fast lookup. We're moving data out of our Lucene indexes into a fast key-value store to follow DRY principles and also decrease the size of the index. There's not a hard-and-fast rule for this sort of thing.

I had similar idea, in my case to store some simple json data in Solr, using Solr as a database. However, a BIG caveat that changed my mind was the Solr upgrade process.
Please see https://issues.apache.org/jira/browse/LUCENE-9127.
Apparently, there has been in the past (pre v6) the recommendation to re-index documents after major version upgrades (not just use IndexUpdater) although you did not have to do this to maintain functionality (I cannot vouch for this myself, this is from what I have read). Now, after you have upgraded 2 major versions but did not re-index (actually, fully delete docs then the index files themselves) after the first major version upgrade, your core is now not recognized.
Specifically in my case, I started with Solr v6. After upgrade to v7, I ran IndexUpdater so index is now at v7. After upgrade to v8, the core would not load. I had no idea why - my index was at v7, so that satisfies the version-minus-1 compatibility statement from Solr, right? Well, no - wrong.
I did an experiment. I started fresh from v6.6, created a core and added some documents. Upgraded to v7.7.3 and ran IndexUpdater, so index for that core is now at v7.7.3. Upgraded to v8.6.0, after which the core would not load. Then I repeated the same steps, except after running IndexUpdater I also re-indexed the documents. Same problem. Then I again repeated everything, except I did not just re-index, I deleted the docs from the index and deleted the index files and then re-indexed. Now, when I arrived in v8.6.0, my core was there and everything OK.
So, the takeaway for the OP or anyone else contemplating this idea (using Solr as db) is that you must EXPECT and PLAN to re-index your documents/data from time to time, meaning you must store them somewhere else anyway (a previous poster alluded to this idea), which sort of defeats the concept of a database. Unless of course your Solr core/index will be short-lived (not last more than one major version Solr upgrade), you never intend to upgrade Solr more than 1 version, or the Solr devs change this upgrade limitation. So, as an index for data stored elsewhere (and readily available for re-indexing when necessary), Solr is excellent. As a database for the data itself, it strongly "depends".

Adding to #Jayunit100 response, using solar as a database, you get availability and partition tolerance at the cost of some consistency. There is going to be a configurable lag between what you write and when you can read it back.

Sharing a java object across a cluster

My requirement is to share a java object across a cluster.
I get Confused
whether to write an EJB and share the java objects across the cluster
or
to use any third party such as infinispan or memecached or terracotta or
what about JCache?
with the constraint that
I can't change any of my source code with specific to any application
server (such as implementing the weblogic's singleton services).
I can't offer two builds for cluster and non cluster environment.
Performance should not be downgraded.
I am looking for only open source third party if I need to use it.
It need to work in weblogic , Websphere , Jbos and Tomcat too.
Can any one come up with the best option with these constraints in mind.

It can depend on the use case of the objects you want to share in the cluster.
I think it comes down to really the following options in most complex to least complex
Distributed cacheing
http://www.ehcache.org
Distributed cacheing is good if you need to ensure that an object is accessible from a cache on every node. I have used ehache to distribute quite successfully, no need to setup a terracotta server unless you need the scale, can just point instances together via rmi. Also works synchronously and asynchronously depending on requirements. Also cache replication is handy if nodes go down so cache is actually redundant and dont lose anything. Good if you need to make sure that the object has been updated across all the nodes.
Clustered Execution/data distribution
http://www.hazelcast.com/
Hazelcast is also a nice option as provides a way of executing java classes across a cluster. This is more useful if you have an object that represents a unit of work that needs to be performed and you dont care so much where it gets executed.
Also useful for distributed collections, i.e. a distributed map or queue
Roll your own RMI/Jgroups
Can write your own client/server but I think you will start to run into issues that the bigger frameworks solve if the requirements of the objects your dealing with starts to get complex. Realistically Hazelcast is really simple and should really eliminate the need to roll your own.

It's not open source, but Oracle Coherence would easily solve this problem.
If you need an implementation of JCache, the only one that I'm aware of being available today is Oracle Coherence; see: http://docs.oracle.com/middleware/1213/coherence/develop-applications/jcache_part.htm
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.

It is just an idea. you might want to check the exact implementation.
It will downgrade performance but I don't see how it is possible to avoid it.
It not an easy one to implement. might be you should consider load balance instead of clustering.
you might consider RMI and/or dynamic-proxy.
extract interface of your objects.
use RMI to access the real object (from all clusters even the one that actually holds the object)
in order to create RMI for an existing code you might use dynamic-proxy (again..not sure about implementation)
*dynamic proxy can wrap any object and do some pre and post task on each method invocation. in this case it might use the original object for RMI invocation
you will need connectivity between clusters in order to propogate the RMI object.

Using Solr search index as a database - is this "wrong"?

My team is working with a third party CMS that uses Solr as a search index. I've noticed that it seems like the authors are using Solr as a database of sorts in that each document returned contains two fields:
The Solr document ID (basically a classname and database id)
An XML representation of the entire object
So basically it runs a search against Solr, download the XML representation of the object, and then instantiate the object from the XML rather than looking it up in the database using the id.
My gut feeling tells me this is a bad practice. Solr is a search index, not a database... so it makes more sense to me to execute our complex searches against Solr, get the document ids, and then pull the corresponding rows out of the database.
Is the current implementation perfectly sound, or is there data to support the idea that this is ripe for refactoring?
EDIT: When I say "XML representation" - I mean one stored field that contains an XML string of all of the object's properties, not multiple stored fields.

It's perfectly reasonable to use Solr as a database, depending on your application. In fact, that's pretty much what guardian.co.uk is doing.
It's definitely not bad practice per se. It's only bad if you use it the wrong way, just like any other tool at any level, even GOTOs.
When you say "An XML representation..." I assume you're talking about having multiple stored Solr fields and retrieving this using Solr's XML format, and not just one big XML-content field (which would be a terrible use of Solr). The fact that Solr uses XML as default response format is largely irrelevant, you can also use a binary protocol, so it's quite comparable to traditional relational databases in that regard.
Ultimately, it's up to your application's needs. Solr is primarily a text search engine, but can also act as a NoSQL database for many applications.

This was probably done for performance reasons, if it doesn't cause any problems I would leave it alone. There is a big grey area of what should be in a traditional database vs a solr index. Ive seem people do similar things to this (usually key value pairs or json instead of xml) for UI presentation and only get the real object from the database if needed for updates/deletes. But all reads just go to Solr.

I've seen similar things done because it allows for very fast lookup. We're moving data out of our Lucene indexes into a fast key-value store to follow DRY principles and also decrease the size of the index. There's not a hard-and-fast rule for this sort of thing.

Adding to #Jayunit100 response, using solar as a database, you get availability and partition tolerance at the cost of some consistency. There is going to be a configurable lag between what you write and when you can read it back.

Monitoring Changes of a Bean to build deltas?

I have several Beans in my Application which getting updated regularly by the usual setter methods. I want to synchronize these beans with a remote application which has the same bean classes. In my case, bandwidth matters, so i have to keep the amount of transferred bytes as low as possible. My idea was to create deltas of the state changes and transfer them instead of the whole Objects. Currently, I want to write the protocol to transfer those changes by myself but I'm not bound to it and would prefer an existing solution.
Is there already a solution for this Problem out there? And if not, how could I easily monitor those state changes in an generalized way? AOP?
Edit: This problem is not caching related even it may first seem so. The data must be replicated from a central server to several clients (about 4 to 10) over the internet. The client is a standalone desktop application.

This sounds remarkably similar to JBossCache running in POJO mode.
This is a distributed, delta-based cache that breaks down java objects into a tree structure, and only transmits changes to the bits of the tree that changes.
Should be a perfect fit for you.

I like your idea of creating deltas and sending them.
A simple Map could handle the delta for one object. Serialization could simply get you the effective message send.
To reduce the number of messages that would kill your performance, you should group your deltas for all objects and send them as a whole. So you could have others collections or maps to contain this.
To monitor all changes to many beans, AOP seem like a good solution.
EDIT : see Skaffmann's answer.
Using an existing cache technology could be better.
Many problems could already have solutions implemented...

Share file storage index with multiple open applications in Java

I'm writing an HTTP Cache library for Java, and I'm trying to use that library in the same application which is started twice. I want to be able to share the cache between those instances.
What is the best solution for this? I also want to be able to write to that same storage, and it should be available for both instances.
Now I have a memory-based index of the files available to the cache, and this is not shareable over multiple VMs. It is serialized between startups, but this won't work for a shared cache.
According to the HTTP Spec, I can't just map files to URIs as there might be a variation of the same payload based on the request. I might, for instance, have a request that varies on the 'accept-language' header: In that case I would have a different file for each subsequent request which specifies a different language.
Any Ideas?

First, are you sure you want to write your own cache when there are several around? Things like:
ehcache
jboss cache
memcached
The first two are written in Java and the third can be accessed from Java. The first two also handle distributed caching, which is the general case of what you are asking for, I think. When they start up, they look to connect to other members so that they maintain a consistent cache across instances. Changes to one are reflected across instances. They can be set up to connect via multicast or with specific lists of servers specified.
Memcached typically works in a slightly different manner in that it is running externally to the Java processes you are running, so that all Java instances that start up will be talking to a common service. You can set up memcached to work in a distributed manner, but it does so by hashing keys so that the server you want to connect to can be determined by what it is you are looking for.
Doing a true distributed cache with consistent content is very hard to do well, which is why I suggest looking at an existing library. If you want to do it yourself, it would still help to look at those listed to see how they go about it and consider using something like JGroups as your underlying mechanism.

I think you should have a look at the WebDav-Specifications. It's an HTTP extension for sharing/editing/storing/versioning resources on a server. There exists an implementation as an Apache module, wich allows you a swift start using them.
So instead of implementing your own cache server implementation, you might be better off with a local Apache + mod-dav instance that is available to both of your applications.
Extra bonus: Since WebDav is a specified protocoll you get the interoperability with lots of tools for free.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.