How to disribute and maintain java/scala program over a Linux cluster? - java

I have a small cluster of Linux machines and an account on all of them.
I have ssh access to all of them with no-password login.
How can I use actors or other Scala's concurrency abstraction to achieve distribution ?
What is the simplest path?
Does some library can distribute the processes for me?
The computers are unreliable, they can go on and off wheter they (students) feel like it.
Does some library can distribute the processes for me and watch for ready computers?
Can I avoid bash scripts?

I'd use Akka in your place. It is a distributed computing platform for Scala and Java, based on Erlang's Actor model. In particular, the let-it-fail philosophy it inherits from Erlang is particularly well suited to an environment where the nodes might go off at any time.

You could use Scala's build in support for remote actors. Browse the web or see Scala remote actors for additional information.
You could also take a look at GridGain a very easy to use grid computing framework for Java and Scala.

What you are looking for is a grid.
For a free one take a look at http://www.jppf.org/
JavaSpaces allows you to create distributed Data Structures that facilitate Distributed Computing. Not cheap, but look at GigaSpaces for a robust implementation.
This book gave me an eye opening experience of the possibilities.

Related

What difficulties should I expect if I write a NoSQL db using golang but want to run Hadoop mapreduce on it?

I would like to build a distributed NoSQL database or key-value store using golang, to learn golang and practice distribute system knowledge I've learnt from school. The target use case I can think of is running MapReduce on top of it, and implement a HDFS-compatible "filesystem" to expose the data to Hadoop, similar to running Hadoop on Ceph and Amazon S3.
My question is, what difficulties should I expect to integrate such an NoSQl database with Hadoop? Or integrate with other languages (e.g., providing Ruby/Python/Node.js/C++ APIs?) if I use golang to build the system.
Ok, I'm not much of a Hadoop user so I'll give you some more general lessons learned about the issues you'll face:
Protocol. If you're going with REST Go will be fine, but expect to find some gotchas in the default HTTP library's defaults (not expiring idle keepalive connections, not necessarily knowing when a reader has closed a stream). But if you want something more compact, know that: a. the Thrift implementation for Go, last I checked, was lacking and relatively slow. b. Go has great support for RPC but it might not play well with other languages. So you might want to check out protobuf, or work on top the redis protocol or something like that.
GC. Go's GC is very simplistic (STW, not generational, etc). If you plan on heavy memory caching in the orders of multiple Gs, expect GC pauses all over the place. There are techniques to reduce GC pressure but the straight forward Go idioms aren't usually optimized for that.
mmap'ing in Go is not straightforward, so it will be a bit of a struggle if you want to leverage that.
Besides slices, lists and maps, you won't have a lot of built in data structures to work with, like a Set type. There are tons of good implementations of them out there, but you'll have to do some digging up.
Take the time to learn concurrency patterns and interface patterns in Go. It's a bit different than other languages, and as a rule of thumb, if you find yourself struggling with a pattern from other languages, you're probably doing it wrong. A good talk about Go concurrency is this one IMHO http://www.youtube.com/watch?v=QDDwwePbDtw
A few projects you might want to have a look at:
Groupcache - a distributed key/value cache written in Go by Brad Fitzpatrick for Google's own use. It's a great implementation of a simple yet super robust distributed system in Go. https://github.com/golang/groupcache and check out Brad's presentation about it: http://talks.golang.org/2013/oscon-dl.slide
InfluxDB which includes a Go based version of the great Raft algorithm: https://github.com/influxdb/influxdb
My own humble (pretty dead) project, a redis compliant database that's based on a plugin architecture. My Go has improved since, but it has some nice parts, and it includes a pretty fast server for the redis protocol. https://bitbucket.org/dvirsky/boilerdb

cloud computing simulation environment for project

I am an Engineering final year student. I am doing project in cloud computing. I have confident idea about the concept. But i don't know how to simulate the concept in cloud. For PG student level Which cloud computing simulation environment is easy to use? Kindly give your
valuable suggestion. ( Now i am implementing the concept in java )
Try taking a look at OpenShift, its free and very easy to use if your familiar with Unix/Git. I host my blog there on a Java/Unix/MySql stack and have been very satisfied.
Firstly, I recommend you to understand the difference between an IaaS and a PaaS. Wikipedia is always a good place where you can find this information. Maybe you could compare both cloud computer models.
You will see that on PaaS is much easier start with a service since you don't need to install, neither to configure anything. Usually, you just need a button to make available a specific service and not a lot of steps to deploy your application.
You should look for the "How to start" of different PaaS providers. You can start for this How to start tutorial and after this, look for similar guides and compare the most important providers. You could see that it is really easy start working on this cloud model.
Agree: PaaS might be a good starting point. I don't have any experience with Java though, a quick Google search: http://www.cloudbees.com/ might be something.
If you want to go a bit deeper, you should try out Amazon's EC2. I believe they have done a very good job, plus they offer a free tier for one year.
If you want to build cloud computing simulations in Java, take a look at CloudSim Plus. It is a modern, full-featured, highly extensible and easier-to-use Java 8 Framework for Modeling and Simulation of Cloud Computing Infrastructures and Services.
It is an actively maintained, totally re-designed, better organized and largely documented project. It has a large number of exclusive features and is the only cloud simulation framework available at maven central.
Some of its main characteristics and features include:
Vertical VM Scaling
that performs on-demand up and down allocation of VM resources such as Ram, Bandwidth and PEs (CPUs).
Horizontal VM scaling, allowing dynamic creation of VMs according to an overload condition. Such a condition is defined by a predicate that can check different VM resources usage such as CPU, RAM or BW.
Parallel execution of simulations, allowing several simulations to be run simultaneously, in a isolated way, inside a multi-core computer.
Listeners to enable simulation monitoring.
Classes and interfaces to allow implementation of heuristics such as
Tabu Search, Simulated Annealing,
Ant Colony Systems and so on. See an example using Simulated Annealing here.

How to integrate Java with nodejs for handling CPU-heavy tasks?

I am trying to pick a right web technology both for I/O heavy and CPU heavy tasks. NodeJs is perfect for handling large load and it also can be scaled out. However, I am stuck with the cpu heavy part. Is it possible to integrate another technology (e.g. Java) into node, so that I will have it running my algorithms in other threads and then use the results again in node. Is there any existing solution? Any other suggestions will be very good.
You can intergrate NodeJS with Java using node-java.
As mentioned in a previous answer, you can use node-java which is an npm module that talks to Java. You can also use J2V8 which wraps Node.js as a Java library and provides a Node.js API in Java.
The answer is lambda architecture.
NodeJs is nice by itself - handling fast queries in a lightweight manner, not doing any extra computations on data.
The CPU heavy tasks can be easily delegated to specialized components based on JVM (well, the most famous ones are on JVM). This is nicely implemented by using message brokers and microservices.
An event-based architecture, where nodejs can be hooked up to databases like Cassandra or Mongodb and cluster computing frameworks like Apache Spark (not necessarily, though, it depends on the problem) to handle the cpu-heavy parts of the system. And lightweight containers add an icing to the cake by providing nice isolated runtime environments for each of the components to live in.
That's my conclusion so far regarding this question.
I think the suggestions above sort of eliminate the need to wrap node under java or other JVM based solution for cpu-heavy tasks.
NodeJS is based on the v8 javascript engine which is written in c++.
It is therefore possible to write fully native addons in c++ for NodeJS. Check out some of these resources:
https://github.com/nodejs/node-addon-api
https://github.com/nodejs/node-addon-examples

moving java class bytecode from jvm to jvm

So i have a server jvm and a client jvm. The client communicates with the server by sending serialized java objects over tcp. Now, normally the server would have the classes of the objects it was receiving in its classpath, in order to deserialize the objects properly.
But what i'm looking for is some way to avoid that; ie, have the client "somehow" send the class bytecode over the wire, on-demand. This would of course require recursing down the class tree (in case any members of the original class where themselves objects of other classes that the server didn't know about).
So i was wondering about any technologies out there that do this sort of thing.
Thx.
RMI includes the notion of a "class server." Sounds like you're pretty much reinventing that, so consider looking into using all or part of RMI. Here's a tutorial.
RMI has the ability to dynamically download entire class file definitions over the wire on demand.
Even if you don't use (or want to use) RMI, the technologies underlying the classloading may be of interest, and they're standard Java.
You are asking about Code Mobility. Also the area of grid computing is somewhat relevant.
Take a look at Mobility-RPC, it's a library which does exactly what you ask at the same level of granularity (class-level).
Security is something to bear in mind. But I'd also remember that SQL sent to databases, bash commands executed over SSH, Business rules engines, Adobe Flash, Java Applets, RMI as described above, ActiveX, JavaScript, Hadoop/grid computing frameworks - all of these are examples of remote code execution in widespread use. Like everything, turning the security dial to the max is going to limit your options. But all of the above are used to good effect when properly firewalled or sandboxed.
In this instance, it sounds like you want something to eliminate a minor hassle, and you're not (say) designing a full-blown distributed application. So based on what you've said, despite myself being somewhat of a code mobility proponent, I'd say code mobility is probably overkill in this case. (But useful in others.)
Regarding grid computing, take a look at GridGain, and Hadoop. GridGain is a pure (CPU-centric) grid computing framework, whereas Hadoop is more a data mining/data warehouse platform with its own replicated distributed file system (HDFS).
Both GridGain and Hadoop transmit user-defined Java code implementing tasks/jobs to remote worker nodes. Last time I checked, they did this by transferring user-supplied jar files to the relevant nodes. I think the GridGain ClassLoader is more sophisticated than Hadoop's however (but less sophisticated than Mobility-RPC's). Hadoop basically starts a new JVM for every job, not especially efficient (but not exactly the bottleneck given the IO load!).
Mobility-RPC is somewhat different because it doesn't expect the remote machine to be a worker node at all, it could be any application running the library. So it's more like RPC or ad-hoc task/object transfer.
This sounds like a very bad idea. Basically, it would mean you allow your client to send code to the server which is directly executed inside the current process. Something like this is generally considered a serious vulnerability, namely Arbitrary code execution which is one of the worst vulnerabilities you could ever have.
Building a system design based on that is well, not so smart.
Create a classloader that loads classes from a stream. See the JarFileClassLoader example for details.
This, of course, will become hugely problematic, particularly if any of the classes use reflection and don't directly name an implementation in the bytecode, in addition to potential security issues; you'll need to look into secure classloaders.
If plain RMI does not work for your requirements take a look to mobile agents frameworks in Java (e.g., Aglets).

Java framework for distributed system

I am looking for a library (or a combination of libraries) to build a java distributed system, made of several applications exchanging data through several pairwise connections (no mapreduce). For the moment I did an expolration of existing libraries and I could only discard what I'v found. Here are my requirements:
Easy discovery of systems at runtime (possibly through a central server/directory)
Lightweight and low latency messages (no CORBA, RMI, SOAP,. etc.)
Decentralized communications (no LINDA like)
Easy enough to use and learn (no JXTA)
Compatible with GPL license (so GPL, BSD, etc.)
Do you have any suggestion ? Thanks in advance
Are you familiar with JGroups? You could use it to design your own architecture. They provide easy-to-use multicast abstraction.
I'm a big fan of JGroups, but I recently discovered hazelcast and will probably give it a try. It might be what you're looking for.
You might want to take a peek at Terracotta ( http://www.terracotta.org/ )
You could take a look at Jade if you like multi-agents paradigm http://jade.tilab.com/
I think Apache River (formerly Jini) should at least be mentioned. It never received too much attention, probably also because it had (don't know if it still has) a rather steep learning curve. Anyhow, it is under active development:
http://river.apache.org/
JBoss, ok, ok, it is not a framework but they have a number of projects that sound just like what you want.
You may use Redisson - distributed and scalable Java data structures (BitSet, BloomFilter, Set, SortedSet, Map, ConcurrentMap, List, Queue, Deque, BlockingQueue, BlockingDeque, ReadWriteLock, Semaphore, Lock, AtomicLong, CountDownLatch, Publish / Subscribe, RemoteService, ExecutorService, LiveObjectService, ScheduledExecutorService) on top of high performance Redis server.

Categories