I am trying to wrap my head around Apache Mesos and need clarification on a few items.
My understanding of Mesos is that it is an executable that gets installed on every physical/VM server ("node") in a cluster, and then provides a Java API (somehow) that treats each individual node as a collective pool of computing resources (CPU/RAM/etc.). Hence, to programs coding against the Java API, they only see 1 single set of resources, and don't have to worry about how/where the code is deployed.
So for one, I could be fundamentally wrong in my understanding here (in which case, please correct me!). But if I'm on target, then how does the Java API (provided by Mesos) allow Java clients to tap into these resources?!? Can someone give a concrete example of Mesos in action?
Update
Take a look at my awful drawing below. If I understand the Mesos architecture correctly, we have a cluster of 3 physical servers (phys01, phys02 and phys03). Each of these physicals is running an Ubuntu host (or whatever). Through a hypervisor, say, Xen, we can run 1+ VMs.
I am interested in Docker & CoreOS, so I'll use those in this example, but I'm guessing the same could apply to other non-container setups.
So on each VM we have CoreOS. Running on each CoreOS instance is a Mesos executable/server. All Mesos nodes in a cluster see everything underneath them as a single pool of resources, and artifacts can be arbitrarily deployed to the Mesos cluster and Mesos will figure out which CoreOS instance to actually deploy them to.
Running on top of Mesos is a "Mesos framework" such as Marathon or Kubernetes. Running inside Kubernetes are various Docker containers (C1 - C4).
Is this understanding of Mesos more or less correct?
Your summary is almost right but it does not reflect the essence of what mesos represents. The vision of mesosphere, the Company behind the project, is to create a "Datacenter Operating System" and the mesos is the kernel of it in analogy to the kernel of a normal OS.
The API is not limited to Java, you can use C, C++, Java/Scala, or Python.
If you have set-up your mesos cluster, as you describe in your question and want to use your resources, you usually do this through a framework instead of running your workload directly on it. This doesn't mean that this is complicated here is a very small example in Scala which demonstrates this. Frameworks exist for multiple popular distributed data processing systems like Apache Spark, Apache Cassandra. There are other frameworks such as Chronos a cron on data center level or Marathon which allows you to run Docker based applications.
Update:
Yes, mesos will take care about the placement in the cluster as that's what a kernel does -- scheduling and management of limited resources. The setup you have sketched raises several obvious questions, however.
Layers below mesos:
Installing mesos on CoreOS is possible but cumbersome as I think. This is not a typical scenario for running mesos -- usually it is moved to the lowest possible layer (above Ubuntu in your case). So I hope you have good reasons for running CoreOS and a hypervisor.
Layers above mesos:
Kubernetes ist available as framework and mesosphere seems to put much effort in it. It is, however, without question that there are partly overlapping in terms of functionality -- especially with regard to scheduling. If you want to schedule basic workloads based on Containers, you might be better off with Marathon or in the future maybe Aurora. So also here I hope you have good reasons for this very arrangement.
Sidenote: Kubernetes is similar to Marathon with a broader approach and pretty opinionated.
Related
I have a project which briefly is as follows: Create an application that can accept tasks written in Java that perform some kind of computation and run the tasks on multiple machines* (the tasks are separate and have no dependency on one another).
*the machines could be running different OSs (mainly Windows and Ubuntu)
My question is, should I be using a distributed system like Apache Mesos for this?
The first thing I looked into was Java P2P libraries/ frameworks and the only one I could find was JXTA (https://jxta.kenai.com/) which has been abandoned by Oracle.
Then I looked into Apache Mesos (http://mesos.apache.org/) which seems to me like a good fit, an underlying system that can run on multiple machines that allows it to share resources while processing tasks. I have spent a little while trying to get it running locally as an example however it seems slightly complicated and takes forever to get working.
If I should use Mesos, would I then have to develop a Framework for my project that takes all of my java tasks or are there existing solutions out there?
To test it on a small scale locally would you install it on your machine, set that to a master, create a VM, install it on that and make that a slave, somehow routing your slave to that master? The documentation and examples don't show how to exactly hook up a slave on the network to a master.
Thanks in advance, any help or suggestions would be appreciated.
You can definitely use Mesos for the task that you have described. You do not need to develop a framework from scratch, instead you can use a scheduler like Marathon in case you have long running tasks, or Chronos for one-off or recurring tasks.
For a real-life setup you definitely would want to have more than one machine, but you might as well just run everything (Mesos master, Mesos slave, and the frameworks) off of a single machine if you're only interested in experimenting. Examples section of Mesos Getting Started Guide demonstrates how to do that.
I am creating a (semi) big data analysis app. I am utilizing apache-mahout. I am concerned about the fact that with java, I am limited to 4gb of memory. This 4gb limitation seems somewhat wasteful of the memory modern computers have at their disposal. As a solution, I am considering using something like RMI or some form of MapReduce. (I, as of yet, have no experience with either)
First off: is it plausible to have multiple JVM's running on one machine and have them talk? and if so, am I heading in the right direction with the two ideas alluded to above?
Furthermore,
In attempt to keep this an objective question, I will avoid asking "Which is better" and instead will ask:
1) What are key differences (not necessarily in how they work internally, but in how they would be implemented by me, the user)
2) Are there drawbacks or benefits to one or the other and are there certain situations where one or the other is used?
3) Is there another alternative that is more specific to my needs?
Thanks in advance
First, re the 4GB limit, check out Understanding max JVM heap size - 32bit vs 64bit . On a 32 bit system, 4GB is the maximum, but on a 64 bit system the limit is much higher.
It is a common configuration to have multiple jvm's running and communicating on the same machine. Two good examples would be IBM Websphere and Oracle's Weblogic application servers. They run the administrative console in one jvm, and it is not unusual to have three or more "working" jvm's under its control.
This allows each JVM to fail independently without impacting the overall system reactiveness. Recovery is transparent to the end users because some fo the "working" jvm's are still doing their thing while the support team is frantically trying to fix things.
You mentioned both RMI and MapReduce, but in a manner that implies that they fill the same slot in the architecture (communication). I think that it is necessary to point out that they fill different slots - RMI is a communications mechanism, but MapReduce is a workload management strategy. The MapReduce environment as a whole typically depends on having a (any) communication mechanism, but is not one itself.
For the communications layer, some of your choices are RMI, Webservices, bare sockets, MQ, shared files, and the infamous "sneaker net". To a large extent I recommend shying away from RMI because it is relatively brittle. It works as long as nothing unexpected happens, but in a busy production environment it can present challenges at unexpected times. With that said, there are many stable and performant large scale systems built around RMI.
The direction the world is going this week for cross-tier communication is SOA on top of something like spring integration or fuse. SOA abstracts the mechanics of communication out of the equation, allowing you to hook things up on the fly (more or less).
MapReduce (MR) is a way of organizing batched work. The MR algorithm itself is essentially turn the input data into a bunch of maps on input, then reduce it to the minimum amount necessary to produce an output. The MR environment is typically governed by a workload manager which receives jobs and parcels out the work in the jobs to its "worker bees" splattered around the network. The communications mechanism may be defined by the MR library, or by the container(s) it runs in.
Does this help?
I am an Engineering final year student. I am doing project in cloud computing. I have confident idea about the concept. But i don't know how to simulate the concept in cloud. For PG student level Which cloud computing simulation environment is easy to use? Kindly give your
valuable suggestion. ( Now i am implementing the concept in java )
Try taking a look at OpenShift, its free and very easy to use if your familiar with Unix/Git. I host my blog there on a Java/Unix/MySql stack and have been very satisfied.
Firstly, I recommend you to understand the difference between an IaaS and a PaaS. Wikipedia is always a good place where you can find this information. Maybe you could compare both cloud computer models.
You will see that on PaaS is much easier start with a service since you don't need to install, neither to configure anything. Usually, you just need a button to make available a specific service and not a lot of steps to deploy your application.
You should look for the "How to start" of different PaaS providers. You can start for this How to start tutorial and after this, look for similar guides and compare the most important providers. You could see that it is really easy start working on this cloud model.
Agree: PaaS might be a good starting point. I don't have any experience with Java though, a quick Google search: http://www.cloudbees.com/ might be something.
If you want to go a bit deeper, you should try out Amazon's EC2. I believe they have done a very good job, plus they offer a free tier for one year.
If you want to build cloud computing simulations in Java, take a look at CloudSim Plus. It is a modern, full-featured, highly extensible and easier-to-use Java 8 Framework for Modeling and Simulation of Cloud Computing Infrastructures and Services.
It is an actively maintained, totally re-designed, better organized and largely documented project. It has a large number of exclusive features and is the only cloud simulation framework available at maven central.
Some of its main characteristics and features include:
Vertical VM Scaling
that performs on-demand up and down allocation of VM resources such as Ram, Bandwidth and PEs (CPUs).
Horizontal VM scaling, allowing dynamic creation of VMs according to an overload condition. Such a condition is defined by a predicate that can check different VM resources usage such as CPU, RAM or BW.
Parallel execution of simulations, allowing several simulations to be run simultaneously, in a isolated way, inside a multi-core computer.
Listeners to enable simulation monitoring.
Classes and interfaces to allow implementation of heuristics such as
Tabu Search, Simulated Annealing,
Ant Colony Systems and so on. See an example using Simulated Annealing here.
I was wondering if there is a precedent for using PID controller type mechanisms to manage computation resources (see http://en.wikipedia.org/wiki/PID_controller).
By computational resources I mean:
Spare Threads, Spare Processes, Queue Lengths, etc.
For example in apache.conf you can specify the number of spare server, min servers, etc.
The question I have is how do you control the spawning of new server or contraction of your resource pool.
The same could be applied to saw spawning nodes on say an Amazon Grid if your load increases beyond some level.
As a response to this question I am interested in:
If there is a Yes, No, Maybe answer to this questions
If there are accessible examples of where this is used in the open source world
If there are libraries that implement PID control in java, python, etc. for this purpose.
Thanks.
According to this research article, the Thread-pool in .NET framework seem to have one. I also found articles on load balancing Apache web servers using autonomous control, controlling memory footprint in DB2 etc.
The code here is a java implementation used in an open source project.
I'm using JSP+Struts2+Tomcat6+Hibernate+MySQL as my J2EE developing environment. The first phase of the project has finished and it's up and running on a single server. due to growing scale of the website it's predicted that we're gonna face some performance issues in the future.
So we wanna distribute the application on several servers, What are my options around here?
Before optimize anything you should detect where your bottleneck is (Services, Database,...). If you do not do this, the optimization will be a waste of time and money.
And then the optimization is for example depending on you use case.
For example, if you have a read only application, add the bottleneck is both, Java Server and Database, then you can setup two database servers and two java servers.
Hardware is very important too. May the easiest way to to update the hardware. But this will only work if the hardware is the bottleneck.
You can use any J2EE application server that supports clustering (e.g. WebLogic, WebSphere, JBoss, Tomcat). You are already using Tomcat so you may want use their clustering solution. Note that each offering provides different levels of clustering support so you should do some research before picking a particular app server (make sure it is the right clustering solution for your needs).
Also porting code from a standalone to a cluster environment often requires a non-negligible amount of development work. Among many other things you'll need to make sure that your application doesn't rely on any local files on the file system (this is a bad J2EE practice anyway), that state (HTTP sessions or stateful EJB - if any) gets properly propagated to all nodes in your cluster, etc. As a general rule, the more stateless, the smoother the transition to a cluster environment.
As you are using Tomcat, I'd recommend to take a look at mod_cluster. But I suggest you to consider a real application server, like JBoss AS. Also, make sure to run some performance tests and understand where is the bottleneck of your application. Throwing more application servers is ineffective if, for instance, the bottleneck is at the database.