I have a project which briefly is as follows: Create an application that can accept tasks written in Java that perform some kind of computation and run the tasks on multiple machines* (the tasks are separate and have no dependency on one another).
*the machines could be running different OSs (mainly Windows and Ubuntu)
My question is, should I be using a distributed system like Apache Mesos for this?
The first thing I looked into was Java P2P libraries/ frameworks and the only one I could find was JXTA (https://jxta.kenai.com/) which has been abandoned by Oracle.
Then I looked into Apache Mesos (http://mesos.apache.org/) which seems to me like a good fit, an underlying system that can run on multiple machines that allows it to share resources while processing tasks. I have spent a little while trying to get it running locally as an example however it seems slightly complicated and takes forever to get working.
If I should use Mesos, would I then have to develop a Framework for my project that takes all of my java tasks or are there existing solutions out there?
To test it on a small scale locally would you install it on your machine, set that to a master, create a VM, install it on that and make that a slave, somehow routing your slave to that master? The documentation and examples don't show how to exactly hook up a slave on the network to a master.
Thanks in advance, any help or suggestions would be appreciated.
You can definitely use Mesos for the task that you have described. You do not need to develop a framework from scratch, instead you can use a scheduler like Marathon in case you have long running tasks, or Chronos for one-off or recurring tasks.
For a real-life setup you definitely would want to have more than one machine, but you might as well just run everything (Mesos master, Mesos slave, and the frameworks) off of a single machine if you're only interested in experimenting. Examples section of Mesos Getting Started Guide demonstrates how to do that.
Related
I am trying to wrap my head around Apache Mesos and need clarification on a few items.
My understanding of Mesos is that it is an executable that gets installed on every physical/VM server ("node") in a cluster, and then provides a Java API (somehow) that treats each individual node as a collective pool of computing resources (CPU/RAM/etc.). Hence, to programs coding against the Java API, they only see 1 single set of resources, and don't have to worry about how/where the code is deployed.
So for one, I could be fundamentally wrong in my understanding here (in which case, please correct me!). But if I'm on target, then how does the Java API (provided by Mesos) allow Java clients to tap into these resources?!? Can someone give a concrete example of Mesos in action?
Update
Take a look at my awful drawing below. If I understand the Mesos architecture correctly, we have a cluster of 3 physical servers (phys01, phys02 and phys03). Each of these physicals is running an Ubuntu host (or whatever). Through a hypervisor, say, Xen, we can run 1+ VMs.
I am interested in Docker & CoreOS, so I'll use those in this example, but I'm guessing the same could apply to other non-container setups.
So on each VM we have CoreOS. Running on each CoreOS instance is a Mesos executable/server. All Mesos nodes in a cluster see everything underneath them as a single pool of resources, and artifacts can be arbitrarily deployed to the Mesos cluster and Mesos will figure out which CoreOS instance to actually deploy them to.
Running on top of Mesos is a "Mesos framework" such as Marathon or Kubernetes. Running inside Kubernetes are various Docker containers (C1 - C4).
Is this understanding of Mesos more or less correct?
Your summary is almost right but it does not reflect the essence of what mesos represents. The vision of mesosphere, the Company behind the project, is to create a "Datacenter Operating System" and the mesos is the kernel of it in analogy to the kernel of a normal OS.
The API is not limited to Java, you can use C, C++, Java/Scala, or Python.
If you have set-up your mesos cluster, as you describe in your question and want to use your resources, you usually do this through a framework instead of running your workload directly on it. This doesn't mean that this is complicated here is a very small example in Scala which demonstrates this. Frameworks exist for multiple popular distributed data processing systems like Apache Spark, Apache Cassandra. There are other frameworks such as Chronos a cron on data center level or Marathon which allows you to run Docker based applications.
Update:
Yes, mesos will take care about the placement in the cluster as that's what a kernel does -- scheduling and management of limited resources. The setup you have sketched raises several obvious questions, however.
Layers below mesos:
Installing mesos on CoreOS is possible but cumbersome as I think. This is not a typical scenario for running mesos -- usually it is moved to the lowest possible layer (above Ubuntu in your case). So I hope you have good reasons for running CoreOS and a hypervisor.
Layers above mesos:
Kubernetes ist available as framework and mesosphere seems to put much effort in it. It is, however, without question that there are partly overlapping in terms of functionality -- especially with regard to scheduling. If you want to schedule basic workloads based on Containers, you might be better off with Marathon or in the future maybe Aurora. So also here I hope you have good reasons for this very arrangement.
Sidenote: Kubernetes is similar to Marathon with a broader approach and pretty opinionated.
What are recommended strategies for building Java application that will be run on "desktop", not in browser. Characteristics of the application would be:
1. Multiple application instances would be running on different machines
2. Applications must communicate in real-time (if one user make changes,
in another application data must be refreshed)
Do you want to create a networking application maybe? based on sockets and so on? Regarding your 2 questions, I have implemented that scenario some time ago and I am working in something similar for my job, it is not complex at all, but I will answer to you according the two issues that concern to you.
Multiple application instances would be run on different machines.
If you are going to install an instance of the application in the people's desktop, I'd suggest to be very careful with "paths", do not hard code any path, since the resources loading will be dynamic.
Check carefully what is the network architecture in which your application will be installed. Maybe it is just a LAN, or maybe it will work in a big network and access through VPN, etc. Check what is the scenario.
Once you make sure your application works fine in different machines without any path conflict or resource loading conflict, you can export your jar, generate it using maven, ant, etc.
Also, if you want to move forward, you can create an installer using any Install wizard creation and create a batch file (.exe) for Windows or (.sh) for Linux distr. But these are only suggestions for the installation stage.
On the other hand, if you wanna execute the application as a Java desktop but using an URL to launch it, you can take a look to JNLP.
Applications must communicate in real-time (if one user make changes, then other will be able to see that)
If you want to do that, you will need, for sure, a server to provide and store information. The server can be a physical machine set up in the office or a remote one.
You have two options here:
Use Java Networking: Create an application that works as a server that provides and saves the information (it should be a concurrent environment since many people will perform transactions or queries over it). Check how can you create a basic server - client application using Sockets to understand better how it works and then you will not have problems to add the complexity of the requirements your environment demands.
You can simply, develop a Java REST Based application and make your Client application connect to the machine (or machines if you plan to implement load balancing) and consume those REST. You can take a look to Jersey libraries in order to implement your scenario. Make sure to add security to these Web Services and make the server private access for the network in which your application instances will work.
Well, that's what I can tell you regarding the scenario you try to implement, based on what I've done and what I'm doing now so far.
Maybe if you need additional or further information, you can reply in the comments, and it will be great to help you.
Regards and happy coding :)
you want to look into using sockets, TCP or UDP, and also figure out if you want a central authoritative server ( what if two users change the same thing in different ways, whose data is saved?)
read this article from Oracle/Java hereJava Custom Networking
I need to develop a Java platform to download and process information from Twitter. The basic idea is to have a centralized controller to generate tasks (id and keywords basically) and send this tasks to remote workers (one per computer). I need to receive an status report periodically to know about the status of both, the task and the worker. I'll have at least 60 workers (ten times more in a near future).
My initial idea was to use RMI but I need to communicate in both directions and I don't feel comfortable with RMI. The other approach was to use SSLSockets to send serialized objects but I would have to control a lot of errors and add a lot of code to monitor tasks and workers. Some people told me about use a framework like Spring Batch, Gigaspaces or Quartz.
What do you think would be the best option for this project? By the time being I've read a lot of good things about Gigaspaces but I don't find a good tutorial about how to implement it and Quartz seems promising. What do you think? Is it worth using any of them?
It's not easy to tell you to go for a technology based on your question. GigaSpaces is certainly up to the job but so is Spring Batch. Quartz is just the scheduling part of your question and not so much the remoting and the distribution of workload.
GigaSpaces is a fully fledged application platform to handle scenario's where parallelism, high throughput and scalability is a factor. Spring Batch can definitely also do the job, but unlike GigaSpaces, it is not an application platform. So you would still need to deploy your application somewhere.
However, GigaSpaces is a commericial product (free version available) but there are other frameworks that can help you such as Storm Project (http://storm-project.net/) and Hazelcast (www.hazelcast.com) also come to mind.
So without clarifying your use case it's hard to give a single answer. It all depends on what exactly you want and how you want to use it, now and in the future.
Given the information about machines in a cluster (IP address/machine name) and a program (Java language) to run, is there a software (manager) available which would execute this program and returns the output along with the runtimes on each of the machines?
Currently, I am using a shell script to do this, but I couldn't get time taken (in secs) to run the java program back. It would be good if there is some distributed program execution manager like the one I described above.
Instead of writing your own script, you could simply use something like tentakel or shmux to run your application parallel on multiple nodes . You can run tentakel as
tentakel 'time <your application name>'
to get the output and the time it takes for the application to run.
I like to use Hudson for stuff like that. It was originally written for performing software builds and tests, but is more generic than that. Basically a controller for managing jobs and executions along with a client to deploy on nodes. Hadoop is another option if you have flexibility to re-write your app for a specific distributed computing framework.
I don't understand your question very much. What "runtime" do you want to get back? What clustering solution are you using? For distributed communication in Java I would recommend JGroups. FOr distributed JVM check Terracotta.
I have encountered many different ways to turn a Java program into a Windows Service or a *nix daemon, such as Java Service Wrapper, Apache Commons Daemon, and so on. Barring licensing concerns (such as JSW's GPL or pay dual-license), and more advanced features, which one would you recommend? All I intend to do is convert a simple Java program into a service; I don't need anything fancy, just something that runs as a service or a daemon, so I can start it or stop it in the service manager, or it runs for the lifetime of my *nix uptime.
EDIT:
I've decided to make this community wiki. I didn't start this question with an intention to find an answer for a problem I really had. I was just doing some reading and researching and chanced upon this question, so I was looking for recommendations and the like. Sorry for not doing this sooner or doing this at first. I didn't know what community wiki was for when I first started, and I completely forgot about this question until now. Many thanks for the answers!
I've used JavaService for years and have been very happy with it. Very simple.
That said, we're switching to JSW for the next major release - its multi-platform support is awesome. Also, having all of the params in a .conf file vs the Windows registry is a major plus. But if you're only looking at Windows, JavaService might be a good way to go. (no experience with Apache Commons Daemon)
On Unix, I tried and quite liked daemontools when I set up a VPS to run Tomcat instances.
Using daemontools, I could write a fairly simple start script and have the Tomcat process run as part of my regular system startup routines. I was running several different Tomcats under different user IDs, to support private JVMs for a couple of sites.
Of course, this is all possible with a SysV style init script that runs jsvc, but having tried the former I found it much easier to set up the daemontools alternative. Also, I was using daemontools across the board for a VPS to try to reduce resource usage as much as possible. The biggest downside to daemontools was I couldn't find a way to indicate a dependency between services easily, but it hasn't caused problems in the end as nothing falls over just because it takes a few extra seconds for the database to start.