Java framework for distributed system

Java framework for distributed system - java

I am looking for a library (or a combination of libraries) to build a java distributed system, made of several applications exchanging data through several pairwise connections (no mapreduce). For the moment I did an expolration of existing libraries and I could only discard what I'v found. Here are my requirements:
Easy discovery of systems at runtime (possibly through a central server/directory)
Lightweight and low latency messages (no CORBA, RMI, SOAP,. etc.)
Decentralized communications (no LINDA like)
Easy enough to use and learn (no JXTA)
Compatible with GPL license (so GPL, BSD, etc.)
Do you have any suggestion ? Thanks in advance

Are you familiar with JGroups? You could use it to design your own architecture. They provide easy-to-use multicast abstraction.

I'm a big fan of JGroups, but I recently discovered hazelcast and will probably give it a try. It might be what you're looking for.

You might want to take a peek at Terracotta ( http://www.terracotta.org/ )

You could take a look at Jade if you like multi-agents paradigm http://jade.tilab.com/

I think Apache River (formerly Jini) should at least be mentioned. It never received too much attention, probably also because it had (don't know if it still has) a rather steep learning curve. Anyhow, it is under active development:
http://river.apache.org/

JBoss, ok, ok, it is not a framework but they have a number of projects that sound just like what you want.

You may use Redisson - distributed and scalable Java data structures (BitSet, BloomFilter, Set, SortedSet, Map, ConcurrentMap, List, Queue, Deque, BlockingQueue, BlockingDeque, ReadWriteLock, Semaphore, Lock, AtomicLong, CountDownLatch, Publish / Subscribe, RemoteService, ExecutorService, LiveObjectService, ScheduledExecutorService) on top of high performance Redis server.

Related

What difficulties should I expect if I write a NoSQL db using golang but want to run Hadoop mapreduce on it?

I would like to build a distributed NoSQL database or key-value store using golang, to learn golang and practice distribute system knowledge I've learnt from school. The target use case I can think of is running MapReduce on top of it, and implement a HDFS-compatible "filesystem" to expose the data to Hadoop, similar to running Hadoop on Ceph and Amazon S3.
My question is, what difficulties should I expect to integrate such an NoSQl database with Hadoop? Or integrate with other languages (e.g., providing Ruby/Python/Node.js/C++ APIs?) if I use golang to build the system.

Ok, I'm not much of a Hadoop user so I'll give you some more general lessons learned about the issues you'll face:
Protocol. If you're going with REST Go will be fine, but expect to find some gotchas in the default HTTP library's defaults (not expiring idle keepalive connections, not necessarily knowing when a reader has closed a stream). But if you want something more compact, know that: a. the Thrift implementation for Go, last I checked, was lacking and relatively slow. b. Go has great support for RPC but it might not play well with other languages. So you might want to check out protobuf, or work on top the redis protocol or something like that.
GC. Go's GC is very simplistic (STW, not generational, etc). If you plan on heavy memory caching in the orders of multiple Gs, expect GC pauses all over the place. There are techniques to reduce GC pressure but the straight forward Go idioms aren't usually optimized for that.
mmap'ing in Go is not straightforward, so it will be a bit of a struggle if you want to leverage that.
Besides slices, lists and maps, you won't have a lot of built in data structures to work with, like a Set type. There are tons of good implementations of them out there, but you'll have to do some digging up.
Take the time to learn concurrency patterns and interface patterns in Go. It's a bit different than other languages, and as a rule of thumb, if you find yourself struggling with a pattern from other languages, you're probably doing it wrong. A good talk about Go concurrency is this one IMHO http://www.youtube.com/watch?v=QDDwwePbDtw
A few projects you might want to have a look at:
Groupcache - a distributed key/value cache written in Go by Brad Fitzpatrick for Google's own use. It's a great implementation of a simple yet super robust distributed system in Go. https://github.com/golang/groupcache and check out Brad's presentation about it: http://talks.golang.org/2013/oscon-dl.slide
InfluxDB which includes a Go based version of the great Raft algorithm: https://github.com/influxdb/influxdb
My own humble (pretty dead) project, a redis compliant database that's based on a plugin architecture. My Go has improved since, but it has some nice parts, and it includes a pretty fast server for the redis protocol. https://bitbucket.org/dvirsky/boilerdb

Process communication in JAVA, JavaSpaces

Two Java programs have to communicate with each other, To do that I found two possibilities
Using Sockets
Using JavaSpaces
After looking into the description, I found out, that JavaSpaces is apparently the better solution. Sadly, I can't get it to run. Ever totourial roots me to another installation process, to other files and so on.... :(
How to install JavaSpaces, where to download them etc?
If someone offers me a better solution for it I'll be thankful to (JavaSpaces seems to be from 2005)
this are the websites I found so far:
http://www.jroller.com/matsh/entry/intreagued_by_javaspaces_try_blitz
(Installation description, not working...)
http://www.jini.org/wiki/Main_Page
(Download links are broken)
http://www.jarvana.com/jarvana/inspect/com/sun/jini/jini-starterkit/2.1/jini-starterkit-2.1.zip?folder=jini2_1/
(Download of jini starter kit)

For one quick start using GigaSpaces, a commercial JavaSpaces product (with a community edition available), see http://www.gigaspaces.com/wiki/display/XAP8/Data+Grid+Quick+Start
Also see http://replay.waybackmachine.org/20070202031207/http://www.theserverside.com/tt/articles/article.tss%3Fl%3DUsingJavaSpaces and http://www.theserverside.com/news/thread.tss?thread_id=42164 and http://www.enigmastation.com/?page_id=425
JavaSpaces is great, IMO (I'm biased, as I work for GigaSpaces... but then again, I work for GigaSpaces because I think the underlying technology is great.) - it's got a very simple API but the transaction model is actually pretty strong, and it's very fast. It's simpler and stronger than JMS, and has a simpler deployment/connection model.
If you're GigaSpaces-averse for some reason ("yikes, someone makes money from this") you can look into Blitz as well.

In Feb 2009 another user on SO mentioned, that GigaSpaces is a mature version of JavaSpaces.
Looking at that older question, I start believing that JavaSpaces is dead..!?

Have you considered also using something like RMI where it becomes transparent the fact that you are invoking a method on a remote system?
Or JMS where you just send and read messages -- and the infrastructure routes them to the right place/process?
Or how about another approach where you have a network cache (e.g. memcached) where both processes can put and get items to/from the cache -- thus allowing for inter-process communication to a certain extent?

How to disribute and maintain java/scala program over a Linux cluster?

I have a small cluster of Linux machines and an account on all of them.
I have ssh access to all of them with no-password login.
How can I use actors or other Scala's concurrency abstraction to achieve distribution ?
What is the simplest path?
Does some library can distribute the processes for me?
The computers are unreliable, they can go on and off wheter they (students) feel like it.
Does some library can distribute the processes for me and watch for ready computers?
Can I avoid bash scripts?

I'd use Akka in your place. It is a distributed computing platform for Scala and Java, based on Erlang's Actor model. In particular, the let-it-fail philosophy it inherits from Erlang is particularly well suited to an environment where the nodes might go off at any time.

You could use Scala's build in support for remote actors. Browse the web or see Scala remote actors for additional information.
You could also take a look at GridGain a very easy to use grid computing framework for Java and Scala.

What you are looking for is a grid.
For a free one take a look at http://www.jppf.org/
JavaSpaces allows you to create distributed Data Structures that facilitate Distributed Computing. Not cheap, but look at GigaSpaces for a robust implementation.
This book gave me an eye opening experience of the possibilities.

Space-based architecture?

One chapter in Pragmatic Programmer recommends looking at a blackboard/space-based architecture + a rules engine as a more flexible alternative to a traditional workflow system.
The project I'm working on currently uses a workflow engine, but I'd like to evaluate alternatives. I really feel like a SBA would be a better solution to our business problems, but I'm worried about a total lack of community support/user base/venders/options.
JavaSpaces is dead, and the JINI spin-off Apache River seems to be on life support. SemiSpace looks perfect, but it's a one-man show. The only viable solution seems to be GigaSpaces.
I'd like to hear your thoughts on space based architecture and any experiences you've had with real world implementations.

Why do you regard Javaspaces as dead, beyond the fact that the Jini 2.1 release was some time ago (October 2005) ? Having used that, I'd suggest that it indicates a mature and complete technology set rather than something abandoned and defunct.
For another implementation of Javaspaces, take a look at Blitz Javaspaces. That's maintained and enhanced more regularly (latest release July 2008) and offers a more performant and manageable Javaspace implementation than the default outrigger supplied by Sun.

Gigaspaces is a successful commercial implementation of JavaSpaces -- so, I wouldn't say JavaSpaces is dead.
You might take a look at Java Shared Data Toolkit (also this article) to see if it meets your requirements.

Space-based architecture is a distributed-computing architecture for achieving linear scalability of stateful, high-performance applications using the tuple space paradigm With a space-based architecture, applications are built out of a set of self-sufficient units, known as processing-units.
Ex: Gigaspaces
here I attached the reference for gigaspaces.
https://docs.gigaspaces.com/latest/overview/space-based-architecture.html

While it doesn't support the JavaSpaces API, I'd suggest looking at Oracle Coherence for a distributed and reliable "live" data store that can drive event-based workflow. Deutsche Bank, for example, successfully replaced a "SBA" (Space Based Architecture) with an event-driven system built on Coherence for their FX trading, because of both reliability and performance issues.
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.

Any experience using Terracotta open source?

Does anybody have experience using the open source offering from Terracotta as opposed to their enterprise offering? Specifically, I'm interested if it is worth the effort to use terracotta without the enterprise tools to manage your cluster?
Over-simplified usage summary: we're a small startup with limited budget that needs to process millions of records and scale for hundreds-of-thousands of page views per day.

I am in a process of integrating Terracotta with my project (a sensor node network simulator). About three weeks ago I found out about Terracotta from one of my colleagues. And now my application takes advantage of grid computing using Terracotta. Below I summarized some essential points of my experience with Terracotta.
The Terracotta site contains pretty detailed documentation. This article probably a good starting point for a developer Concept and Architecture Guide
When you are stuck with a problem and found no answer in the documentation, the Terracotta community forum is a good place to ask questions. It seems that Terracotta developers check it on a regular basis and pretty responsive.
Even though Terracotta is running under JVM and it is advertised that it is only a matter of configuration to make you application running in a cluster, you should be ready that it may require to introduce some serious changes in you application to make it perform reasonably well. E.g. I had to completely rewrite synchronization logic of my application.
Good integration with Eclipse.
Admin Console is a great tool and it helped me a lot in tweaking my application to perform decently under Terracotta. It collects all performance metrics from servers and clients you can only think of. It certainly has some GUI related issues, but who does not :-)
Prefer standard Java Synchronization primitives (synchronized/wait/notify) over java.util.concurrent.* citizens. I found that standard primitives provide higher flexibility (can be configured to be a read or write cluster lock or even not-a-lock at all), easier to track in the Admin Console (you see the class name of the object being locked rather then e.g. some ReentrantLock).
Hope that helps.

At the moment, the Terracotta enterprise tools provide only a few features beyond the open source version around things like visualization and management (like the ability to kick a client out of the cluster). That will continue to diverge and the enterprise tools are likely to boast more operator-level functionality around things like managing and monitoring, but you can certainly manage and tune an app even with the open source tools.
The enterprise license also gives you things like support, indemnification, etc which may or may not be as important to you as the tooling.
I would urge you to try it for yourself. If you'd like to see an example of a real app using Terracotta, you should check out this reference web app that was just released:
The Examinator

You may want to take a look at JBossCache/PojoCache which is an in-memory distributed caching solution. The difference is it uses a simple API to propagate objects across your 'cluster' of caches, where as Terracotta works at the classloading/jvm level.
(They don't actually have their own JVM, but they modify classes as they are loaded to allow them to be 'clusterable')
Our company had a lot of luck with JBossCache, I'd recommend checking it out.

Update
What I see in the OP message is "well, I don't really know what we need (thus the lack of detailed requirements), but may be some enterprizey tool will magically solve all our problems, known and unforeseen? That would be awesome!"
With an architectural approach like this it's not gonna fly. No success stories from Teracotta would change that.
OSS is beneficial when the community around it can replace the commercial support. Suppose the guy have a problem in production. Community cannot help -- it's too small for the obscure product like this. Servers are down, business is in danger. You see? You need a commercial license up-front. No money? Well, then you're not a business, and probably not gonna become one (if nobody's willing to invest into it).
Sorry for interrupting your day-dreaming.
IMHO:
Terracotta is a clustering solution. Clustering is required for large, enterprise-grade applications. Large applications mean big budgets. Big budgets mean you can afford commercial license from Terracotta.
To put it in another way: if you don't have budget to buy it, it's probably not beneficial for your project.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.