I was wondering if there is any way to simulate some workers in my local machine, I do know all spark can be set up locally and all can be fully tested. However, I was wondering if it is possible to actually simulate workers, with the idea to see how the repartition of work, loads and the DAG dynamics work.
I also can think about ways that can help me out, for example debugging and tracing data transformations. What I want to do is to develop an optimal way do my program in a testing fashion, I do not wanna rely in the big theory behind shuffling and expensive operations. Or are we dommed to try to test this up in a trial and error in real clusterS? thanks!
Yes, this is possible, please have a look at link https://sparkbyexamples.com/spark/apache-spark-installation-on-windows/
Also, if you can setup spark history server on your local machine, you can trace transformations, DAG etc.
Related
Is there a way to package storm and cassandra into an executable jar so that as part of running the jar a single node storm and cassandra will be deployed and serve the program.
thanks.
I think you are somewhat confused about how Storm architecture works, unless you really are planning to run Storm in local mode, in which case I'm not sure why you're bothering to do that if you have a Cassandra cluster: local mode is only meant for testing and many cases will perform worse than an actual cluster would. You can get better performance locally by just writing multithreaded code without pulling in all the Storm functionality which is really intended to aid in robust stream processing over a possibly unreliable cluster.
To me it sounds like what you probably really mean to do is have each Cassandra node also be a Storm supervisor node running one (or more) workers. You will also somewhere need to have a Nimbus server and a Zookeeper cluster to make the whole thing go.
Given all this, I suppose it's theoretically possible to have it all in one jar, but that seems like more trouble than it's worth. Cassandra nodes and Storm supervisors are already dead simple to setup, and there's no reason they can't run together on the same server, so I would recommend against it.
Further, I'm not clear on your use case, but it's hard to imagine a situation where you'd actually want to do this. The only things that come to mind are either (a) your Cassandra workload is very disk heavy with no real computation happening on the nodes or (b) you have over-provisioned physical hardware that you are looking to take up slack capacity on. Otherwise I think you'd almost certainly be better off with separate machines for Storm and Cassandra.
I intended to use hadoop as "computation cluster" in my project. However then I read that Hadoop is not inteded for real-time systems because of overhead connected with start of a job. I'm looking for solution which could be use this way - jobs which could can be easly scaled into multiple machines but which does not require much input data. What is more I want to use machine learning jobs e.g. using created before neural network in real-time.
What libraries/technologies I can use for this purposes?
You are right, Hadoop is designed for batch-type processing.
Reading the question, I though about the Storm framework very recently open sourced by Twitter, which can be considered as "Hadoop for real-time processing".
Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it's fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language.
(from: InfoQ post)
However, I have not worked with it yet, so I really cannot say much about it in practice.
Twitter Engineering Blog Post: http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
Github: https://github.com/nathanmarz/storm
Given the fact that you want a real-time response in de "seconds" area I recommend something like this:
Setup a batched processing model for pre-computing as much as possible. Essentially try to do everything that does not depend on the "last second" data. Here you can use a regular Hadoop/Mahout setup and run these batches daily or (if needed) every hour or even 15 minutes.
Use a real-time system to do the last few things that cannot be precomputed.
For this you should look at either using the mentioned s4 or the recently announced twitter storm.
Sometimes it pays to go really simple and store the precomputed values all in memory and simply do the last aggregation/filter/sorting/... steps in memory. If you can do that you can really scale because each node can run completely independently of all others.
Perhaps having a NoSQL backend for your realtime component helps.
There are lot's of those available: mongodb, redis, riak, cassandra, hbase, couchdb, ...
It all depends on your real application.
Also try S4, initially released by Yahoo! and its now Apache Incubator project. It has been around for a while, and I found it to be good for some basic stuff when I did a proof of concept. Haven't used it extensively though.
What you're trying to do would be a better fit for HPCC as it has both, the back end data processing engine (equivalent to Hadoop) and the front-end real-time data delivery engine, eliminating the need to increase complexity through third party components. And a nice thing of HPCC is that both components are programmed using the same exact language and programming paradigms.
Check them out at: http://hpccsystems.com
I am looking for something that will make it easy to run (correctly coded) embarrassingly parallel JVM code on a cluster (so that I can use Clojure + Incanter).
I have used Parallel Python in the past to do this. We have a new PBS cluster and our admin will soon set up IPython nodes that use PBS as the backend. Both of these systems make it almost a no-brainer to run certain types of code in a cluster.
I made the mistake of using Hadoop in the past (Hadoop is just not suited to the kind of data that I use) - the latency made even small runs execute for 1-2 minutes.
Is JPPF or Gridgain better for what I need? Does anyone here have any experience with either? Is there anything else you can recommend?
Check out cascalog - http://github.com/nathanmarz/cascalog
Clojure is reported to work on Terracotta, subject to some patching.
Look at Skandium
Edit :
Above link is no more live, so adding github link
https://github.com/mleyton/Skandium
I suggest you look at Skandium, alternative licenses to GPL can be negotiated with the developers upon request.
What's a good method for assigning work to a set of remote machines? Consider an example where the task is very CPU and RAM intensive, but doesn't actually process a large dataset. The language of choice would be Java. I was thinking Hadoop would be a good option, but the dataset passed between remote machines is fairly small, and Hadoop seems to focus mainly on the distribution of data rather than distribution of work.
What are some good technologies that can help?
EDIT: I'm mainly interested in load balancing. There will be a series of jobs with a small (< 3MB) dataset, but significant processing and memory needs.
MPI would probably be a good choice, there's even a JAVA implementation.
MPI may be part of your answer, but looking at the question, I'm not sure if it addresses the portion of the problem you care about.
MPI provides a communication layer between processing components. It is low level requiring you to do a fair amount of work, but from what I saw in an introduction presentation, it also comes with some common matrix data manipulation functions.
In your question, you seem to be more interested in the load balancing/job processing aspects of the problem. If that really is your focus, maybe a small program hosted in a Servlet or an RMI server might be sufficient. Let each program go to the server for their next unit of work and then submit the results back (you might even be able to use a database/file share, but pay attention to locking issues). In other words, a pull mechanism versus a push mechanism.
This approach is fairly simple to implement and gives you the advantage of scaling up by just running more distributed clients. Load balancing isn't too important if you intend to allow your process to take full control of the machine. You can experiment with running multiple clients on a machine that has multiple cores to see if you can improve overall through-put for the node. A multi-threaded client would be more efficient, but can increase complexity depending on the structure of the code you are using to solve the problem.
have written a stochastic simulation in Java, which loads data from a few CSV files on disk (totaling about 100MB) and writes results to another output file (not much data, just a boolean and a few numbers). There is also a parameters file, and for different parameters the distribution of simulation outputs would be expected to change. To determine the correct/best input parameters I need to run multiple simulations, across multiple input parameter configurations, and look at the distributions of the outputs in each group. Each simulation takes 0.1-10 min depending on parameters and randomness.
I've been reading about Hadoop and wondering if it can help me running lots of simulations; I may have access to about 8 networked desktop machines in the near future. If I understand correctly, the map function could run my simulation and spit out the result, and the reducer might be the identity.
The thing I'm worried about is HDFS, which seems to meant for huge files, not a smattering of small CSV files, (none of which would big enough to even make up the minimum recommended block size of 64MB). Furthermore, each simulation would only need an identical copy of each of the CSV files.
Is Hadoop the wrong tool for me?
I see a number of answers here that basically are saying, "no, you shouldn't use Hadoop for simulations because it wasn't built for simulations." I believe this is a rather short sighted view and would be akin to someone saying in 1985, "you can't use a PC for word processing, PCs are for spreadsheets!"
Hadoop is a fantastic framework for construction of a simulation engine. I've been using it for this purpose for months and have had great success with small data / large computation problems. Here's the top 5 reasons I migrated to Hadoop for simulation (using R as my language for simulations, btw):
Access: I can lease Hadoop clusters through either Amazon Elastic Map Reduce and I don't have to invest any time and energy into the administration of a cluster. This meant I could actually start doing simulations on a distributed framework without having to get administrative approval in my org!
Administration: Hadoop handles job control issues, like node failure, invisibly. I don't have to code for these conditions. If a node fails, Hadoop makes sure the sims scheduled for that node gets run on another node.
Upgradeable: Being a rather generic map reduce engine with a great distributed file system if you later have problems that involve large data if you're used to using Hadoop you don't have to migrate to a new solution. So Hadoop gives you a simulation platform that will also scale to a large data platform for (nearly) free!
Support: Being open source and used by so many companies, the number of resources, both on line and off, for Hadoop are numerous. Many of those resources are written with the assumption of "big data" but they are still useful for learning to think in a map reduce way.
Portability: I have built analysis on top of proprietary engines using proprietary tools which took considerable learning to get working. When I later changed jobs and found myself at a firm without that same proprietary stack I had to learn a new set of tools and a new simulation stack. Never again. I traded in SAS for R and our old grid framework for Hadoop. Both are open source and I know that I can land at any job in the future and immediately have tools at my fingertips to start kicking ass.
Hadoop can be made to perform your simulation if you already have a Hadoop cluster, but it's not the best tool for the kind of application you are describing. Hadoop is built to make working on big data possible, and you don't have big data -- you have big computation.
I like Gearman (http://gearman.org/) for this sort of thing.
While you might be able to get by using MapReduce with Hadoop, it seems like what you're doing might be better suited for a grid/job scheduler such as Condor or Sun Grid Engine. Hadoop is more suited for doing something where you take a single (very large) input, split it into chunks for your worker machines to process, and then reduce it to produce an output.
Since you are already using Java, I suggest taking a look at GridGain which, I think, is particularly well suited to your problem.
Simply said, though Hadoop may solve your problem here, its not the right tool for your purpose.