Is there a way to package storm and cassandra into an executable jar so that as part of running the jar a single node storm and cassandra will be deployed and serve the program.
thanks.
I think you are somewhat confused about how Storm architecture works, unless you really are planning to run Storm in local mode, in which case I'm not sure why you're bothering to do that if you have a Cassandra cluster: local mode is only meant for testing and many cases will perform worse than an actual cluster would. You can get better performance locally by just writing multithreaded code without pulling in all the Storm functionality which is really intended to aid in robust stream processing over a possibly unreliable cluster.
To me it sounds like what you probably really mean to do is have each Cassandra node also be a Storm supervisor node running one (or more) workers. You will also somewhere need to have a Nimbus server and a Zookeeper cluster to make the whole thing go.
Given all this, I suppose it's theoretically possible to have it all in one jar, but that seems like more trouble than it's worth. Cassandra nodes and Storm supervisors are already dead simple to setup, and there's no reason they can't run together on the same server, so I would recommend against it.
Further, I'm not clear on your use case, but it's hard to imagine a situation where you'd actually want to do this. The only things that come to mind are either (a) your Cassandra workload is very disk heavy with no real computation happening on the nodes or (b) you have over-provisioned physical hardware that you are looking to take up slack capacity on. Otherwise I think you'd almost certainly be better off with separate machines for Storm and Cassandra.
Related
I was wondering if there is any way to simulate some workers in my local machine, I do know all spark can be set up locally and all can be fully tested. However, I was wondering if it is possible to actually simulate workers, with the idea to see how the repartition of work, loads and the DAG dynamics work.
I also can think about ways that can help me out, for example debugging and tracing data transformations. What I want to do is to develop an optimal way do my program in a testing fashion, I do not wanna rely in the big theory behind shuffling and expensive operations. Or are we dommed to try to test this up in a trial and error in real clusterS? thanks!
Yes, this is possible, please have a look at link https://sparkbyexamples.com/spark/apache-spark-installation-on-windows/
Also, if you can setup spark history server on your local machine, you can trace transformations, DAG etc.
I'm just starting to test Hazelcast Jet on our cluster. Now starting an instance isn't to difficult. BUT: doing so requires a full (albeit small) jar to be executed including a full sized (and probably not optimized) jvm.
Are there alternatives? Like pre-optimized standalone "only start this instance to use a cluster node" packets?
My googling hasn't brought any results, so maybe someone else had more success?
I have a project which briefly is as follows: Create an application that can accept tasks written in Java that perform some kind of computation and run the tasks on multiple machines* (the tasks are separate and have no dependency on one another).
*the machines could be running different OSs (mainly Windows and Ubuntu)
My question is, should I be using a distributed system like Apache Mesos for this?
The first thing I looked into was Java P2P libraries/ frameworks and the only one I could find was JXTA (https://jxta.kenai.com/) which has been abandoned by Oracle.
Then I looked into Apache Mesos (http://mesos.apache.org/) which seems to me like a good fit, an underlying system that can run on multiple machines that allows it to share resources while processing tasks. I have spent a little while trying to get it running locally as an example however it seems slightly complicated and takes forever to get working.
If I should use Mesos, would I then have to develop a Framework for my project that takes all of my java tasks or are there existing solutions out there?
To test it on a small scale locally would you install it on your machine, set that to a master, create a VM, install it on that and make that a slave, somehow routing your slave to that master? The documentation and examples don't show how to exactly hook up a slave on the network to a master.
Thanks in advance, any help or suggestions would be appreciated.
You can definitely use Mesos for the task that you have described. You do not need to develop a framework from scratch, instead you can use a scheduler like Marathon in case you have long running tasks, or Chronos for one-off or recurring tasks.
For a real-life setup you definitely would want to have more than one machine, but you might as well just run everything (Mesos master, Mesos slave, and the frameworks) off of a single machine if you're only interested in experimenting. Examples section of Mesos Getting Started Guide demonstrates how to do that.
I am trying to wrap my head around Apache Mesos and need clarification on a few items.
My understanding of Mesos is that it is an executable that gets installed on every physical/VM server ("node") in a cluster, and then provides a Java API (somehow) that treats each individual node as a collective pool of computing resources (CPU/RAM/etc.). Hence, to programs coding against the Java API, they only see 1 single set of resources, and don't have to worry about how/where the code is deployed.
So for one, I could be fundamentally wrong in my understanding here (in which case, please correct me!). But if I'm on target, then how does the Java API (provided by Mesos) allow Java clients to tap into these resources?!? Can someone give a concrete example of Mesos in action?
Update
Take a look at my awful drawing below. If I understand the Mesos architecture correctly, we have a cluster of 3 physical servers (phys01, phys02 and phys03). Each of these physicals is running an Ubuntu host (or whatever). Through a hypervisor, say, Xen, we can run 1+ VMs.
I am interested in Docker & CoreOS, so I'll use those in this example, but I'm guessing the same could apply to other non-container setups.
So on each VM we have CoreOS. Running on each CoreOS instance is a Mesos executable/server. All Mesos nodes in a cluster see everything underneath them as a single pool of resources, and artifacts can be arbitrarily deployed to the Mesos cluster and Mesos will figure out which CoreOS instance to actually deploy them to.
Running on top of Mesos is a "Mesos framework" such as Marathon or Kubernetes. Running inside Kubernetes are various Docker containers (C1 - C4).
Is this understanding of Mesos more or less correct?
Your summary is almost right but it does not reflect the essence of what mesos represents. The vision of mesosphere, the Company behind the project, is to create a "Datacenter Operating System" and the mesos is the kernel of it in analogy to the kernel of a normal OS.
The API is not limited to Java, you can use C, C++, Java/Scala, or Python.
If you have set-up your mesos cluster, as you describe in your question and want to use your resources, you usually do this through a framework instead of running your workload directly on it. This doesn't mean that this is complicated here is a very small example in Scala which demonstrates this. Frameworks exist for multiple popular distributed data processing systems like Apache Spark, Apache Cassandra. There are other frameworks such as Chronos a cron on data center level or Marathon which allows you to run Docker based applications.
Update:
Yes, mesos will take care about the placement in the cluster as that's what a kernel does -- scheduling and management of limited resources. The setup you have sketched raises several obvious questions, however.
Layers below mesos:
Installing mesos on CoreOS is possible but cumbersome as I think. This is not a typical scenario for running mesos -- usually it is moved to the lowest possible layer (above Ubuntu in your case). So I hope you have good reasons for running CoreOS and a hypervisor.
Layers above mesos:
Kubernetes ist available as framework and mesosphere seems to put much effort in it. It is, however, without question that there are partly overlapping in terms of functionality -- especially with regard to scheduling. If you want to schedule basic workloads based on Containers, you might be better off with Marathon or in the future maybe Aurora. So also here I hope you have good reasons for this very arrangement.
Sidenote: Kubernetes is similar to Marathon with a broader approach and pretty opinionated.
I'm developing high-scalable application, so I decided to use Hazelcast for it. I have one frontend server, which puts messages for nodes. Every node in cluster change it's workload in background thread in distributed map, so, frontend server choose queue (every node has it's own message queue) to put message in. My question is: Is Hazelcast suitable for such design (we need workload distribution and load balancing) or may be some alternatives? I like Hazelcast for it's simplicity and nice design.
Hazelcast is great, it's very lightweight and easy to use, however, it's still in development and there are a few issues when using it.
If you look here: http://code.google.com/p/hazelcast/issues/list you can see that there are some bugs with the queue data structure while using transactions. Overall, it's provides what it advertises and basically gives a distributed cache for free.
I have first hand experience with hazelcast. The version we went to production with is version 1.9.4. We recently upgraded to 2.2, and now 2.3 is the latest. I am quite pleased with it. What you are describing is a pretty good use case for hazelcast. I had a similar use case where each node has its own queue and messages are pushed to the appropriate queue based on which node the client was connected to. It worked great and the business loved it.