have written a stochastic simulation in Java, which loads data from a few CSV files on disk (totaling about 100MB) and writes results to another output file (not much data, just a boolean and a few numbers). There is also a parameters file, and for different parameters the distribution of simulation outputs would be expected to change. To determine the correct/best input parameters I need to run multiple simulations, across multiple input parameter configurations, and look at the distributions of the outputs in each group. Each simulation takes 0.1-10 min depending on parameters and randomness.
I've been reading about Hadoop and wondering if it can help me running lots of simulations; I may have access to about 8 networked desktop machines in the near future. If I understand correctly, the map function could run my simulation and spit out the result, and the reducer might be the identity.
The thing I'm worried about is HDFS, which seems to meant for huge files, not a smattering of small CSV files, (none of which would big enough to even make up the minimum recommended block size of 64MB). Furthermore, each simulation would only need an identical copy of each of the CSV files.
Is Hadoop the wrong tool for me?
I see a number of answers here that basically are saying, "no, you shouldn't use Hadoop for simulations because it wasn't built for simulations." I believe this is a rather short sighted view and would be akin to someone saying in 1985, "you can't use a PC for word processing, PCs are for spreadsheets!"
Hadoop is a fantastic framework for construction of a simulation engine. I've been using it for this purpose for months and have had great success with small data / large computation problems. Here's the top 5 reasons I migrated to Hadoop for simulation (using R as my language for simulations, btw):
Access: I can lease Hadoop clusters through either Amazon Elastic Map Reduce and I don't have to invest any time and energy into the administration of a cluster. This meant I could actually start doing simulations on a distributed framework without having to get administrative approval in my org!
Administration: Hadoop handles job control issues, like node failure, invisibly. I don't have to code for these conditions. If a node fails, Hadoop makes sure the sims scheduled for that node gets run on another node.
Upgradeable: Being a rather generic map reduce engine with a great distributed file system if you later have problems that involve large data if you're used to using Hadoop you don't have to migrate to a new solution. So Hadoop gives you a simulation platform that will also scale to a large data platform for (nearly) free!
Support: Being open source and used by so many companies, the number of resources, both on line and off, for Hadoop are numerous. Many of those resources are written with the assumption of "big data" but they are still useful for learning to think in a map reduce way.
Portability: I have built analysis on top of proprietary engines using proprietary tools which took considerable learning to get working. When I later changed jobs and found myself at a firm without that same proprietary stack I had to learn a new set of tools and a new simulation stack. Never again. I traded in SAS for R and our old grid framework for Hadoop. Both are open source and I know that I can land at any job in the future and immediately have tools at my fingertips to start kicking ass.
Hadoop can be made to perform your simulation if you already have a Hadoop cluster, but it's not the best tool for the kind of application you are describing. Hadoop is built to make working on big data possible, and you don't have big data -- you have big computation.
I like Gearman (http://gearman.org/) for this sort of thing.
While you might be able to get by using MapReduce with Hadoop, it seems like what you're doing might be better suited for a grid/job scheduler such as Condor or Sun Grid Engine. Hadoop is more suited for doing something where you take a single (very large) input, split it into chunks for your worker machines to process, and then reduce it to produce an output.
Since you are already using Java, I suggest taking a look at GridGain which, I think, is particularly well suited to your problem.
Simply said, though Hadoop may solve your problem here, its not the right tool for your purpose.
Related
I'm learning a bit more about hadoop and its applications, and I understand it is geared toward massive datasets and large files. Let's say I had an application in which I was processing a relatively small number of files (say 100k), which isn't a huge number for something like hadoop/hdfs. However, it does take a macro amount of time to run on a single machine, so I'd like to distribute the process.
The problem can be broken down into a map reduce style problem (e.g. each of the files can be processed independently and then I can aggregate the results). I'm open to using infrastructure such as Amazon EC2, but I'm not so sure about what technologies to be exploring for actually aggregating the results of the process. Seems like hadoop might be a bit overkill here.
Can anyone provide guidance on this type of problem?
First off, you may want to reconsider your assumption that you can't combine files. Even images can be combined- you just need to figure out how to do that in a way that allows you to break them out again in your mappers. Combining them with some sort of sentinel value or magic number between them might make it possible to turn them into one giant file.
Other options include HBase, where you could store the images in cells. HBase also has a built-in TableMapper and TableReducer, and can store the results of your processing alongside the raw data in a semi-structured way.
EDIT: As for the "is Hadoop overkill" question, you need to consider the following:
Hadoop adds at least one machine of overhead (the HDFS NameNode). You typically dont want to store data or run jobs on that machine, since it is a SPOF.
Hadoop is best suited for processing data in batch, with relatively high latency. As #Raihan mentions, there are several other FOSS distributed compute architectures that may server your needs better if you need realtime or low-latency results.
100k files isn't so very few. Even if they are 100k each, that's 10GB of data.
Other than the above, Hadoop is a relatively low-overhead way of approaching distributed computing problems. It has a huge, helpful community behind it, so you can get help quickly if you need it. And it is focused on running on cheap hardware and a free OS, so there really isnt any significant overhead.
In short, I'd try it before you discard it for something else.
I intended to use hadoop as "computation cluster" in my project. However then I read that Hadoop is not inteded for real-time systems because of overhead connected with start of a job. I'm looking for solution which could be use this way - jobs which could can be easly scaled into multiple machines but which does not require much input data. What is more I want to use machine learning jobs e.g. using created before neural network in real-time.
What libraries/technologies I can use for this purposes?
You are right, Hadoop is designed for batch-type processing.
Reading the question, I though about the Storm framework very recently open sourced by Twitter, which can be considered as "Hadoop for real-time processing".
Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it's fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language.
(from: InfoQ post)
However, I have not worked with it yet, so I really cannot say much about it in practice.
Twitter Engineering Blog Post: http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
Github: https://github.com/nathanmarz/storm
Given the fact that you want a real-time response in de "seconds" area I recommend something like this:
Setup a batched processing model for pre-computing as much as possible. Essentially try to do everything that does not depend on the "last second" data. Here you can use a regular Hadoop/Mahout setup and run these batches daily or (if needed) every hour or even 15 minutes.
Use a real-time system to do the last few things that cannot be precomputed.
For this you should look at either using the mentioned s4 or the recently announced twitter storm.
Sometimes it pays to go really simple and store the precomputed values all in memory and simply do the last aggregation/filter/sorting/... steps in memory. If you can do that you can really scale because each node can run completely independently of all others.
Perhaps having a NoSQL backend for your realtime component helps.
There are lot's of those available: mongodb, redis, riak, cassandra, hbase, couchdb, ...
It all depends on your real application.
Also try S4, initially released by Yahoo! and its now Apache Incubator project. It has been around for a while, and I found it to be good for some basic stuff when I did a proof of concept. Haven't used it extensively though.
What you're trying to do would be a better fit for HPCC as it has both, the back end data processing engine (equivalent to Hadoop) and the front-end real-time data delivery engine, eliminating the need to increase complexity through third party components. And a nice thing of HPCC is that both components are programmed using the same exact language and programming paradigms.
Check them out at: http://hpccsystems.com
Is there any automatic tool that I can transform legacy uniprocessor programs to the cloud, meaning that the target program is ready to execute in the cloud (e.g. programs written for Hadoop)? If not, what are the best practices when doing such transformations (maybe total rewrite) manually? Also, how can I know/evaluate whether a legacy program (or programming task) is suitable for computing?
Example: suppose I have a WordCount program written solely with standard Java library (e.g. HashMap), how can I transform it to one written with Hadoop like the one provided in the sample code of the Hadoop distribution?
Is there any automatic tool that I can transform legacy uniprocessor programs to the cloud?
I don't think there is an automatic tool that can transform a legacy uniprocessor program to the cloud.
If the legacy program is written using the MapReduce paradigm then it should be somewhat easy to run in a cloud using Hadoop with some modifications. If not then the problem has to be thought in a MapReduce way and rewritten for Hadoop using Java or some other language which supports read/write to the STDIN/STDOUT.
Also, if the language in which the legacy program was written can read/write to the STDIN/STDOUT then you can use Hadoop Streaming.
Also, how can I know/evaluate whether a legacy program (or programming task) is suitable for computing?
If the processing can happen in parallel independently and the data can also be split across more than one machine, then it might be a suitable candidate for Hadoop.
HDFS (Hadoop Distributed File System) is designed for high latency and high throughput. If, the requirement is for low latency then you might consider HBase.
Also, HDFS is designed for large file (GB, TB and PB). If the legacy application has too many small file then an alternative approach has to be considered.
Some more things to consider.
Hadoop runs straight out of the box with some minimum configuration changes, but to run it efficiently a lot of parameters have to be tweaked and some times it's required to get straight into the code.
Also, try to a POC and start with something small to solve the problem area and evaluate the pros and cons.
Suggest to buy 'Hadoop : The Definitive Guide' book.
Like any concurrent application, it has to be able to do multiple independant things at the same time. If you want this to be faster you have save time more than the overhead distributing the application takes.
In the example of the word count, your bottleneck is like to be how fast it can read the file from disk. To distribute the word count efficiently you have to have copies of the file (or portions of the file) on each machine. This of course can take much longer than it saves.
However say file access is not your bottleneck, you can break the file(s) into portions so that each thread or node can count the words in that portion and then sum the results to get the total.
There are lots of folks looking for magic tools to convert programs implemented using serial computing methods, into ones that are highly parallel.
Mostly this doesn't work, as the parallelism isn't easily found in the code either a) because it isn't there, or b) because the analysis required to see it is beyond the present technology of the tools.
If the parallelism can be found by a tool, or simply marked as present by a programmer ("annotations", "directives", see OpenMP) for example, there are tools that can automatically insert parallelism directives.
These tools are mostly found in the Fortran space (to suppport supercomputing tasks). There are some research tools for Java; lots of Universities doing "Java" + "Parallelism" because it a hot topic in a "cool" [meaning "available"] langauge. I doubt you'll find a tool that really works for a University for this; they only do demos.
I'd guess you're stuck, and you'll have to do this yourself.
I have to do a class project for data mining subject. My topic will be mining stackoverflow's data for trending topics.
So, I have downloaded the data from here but the data set is so huge (posts.xml is 3gb in size), that I cannot process it on my machine.
So, what do you suggest, is going for AWS for data processing a good option or not worth it?
I have no prior experience on AWS, so how can AWS help me with my school project? How would you have gone about it?
UPDATE 1
So, my data processing will be in 3 stages:
Convert XML (from so.com dump) to .ARFF (for weka jar),
Mine the data using algos in weka,
Convert the output to GraphML format which will be read by prefuse library for visualization.
So, where does AWS fit in here? I support there are two features in AWS which can help me:
EC2 and
Elastic MapReduce,
but I am not sure how mapreduce works and how can I use it in my project. Can I?
You can consider EC2 (the part of AWS you would be using for doing the actual computations) as nothing more than a way to rent computers programmatically or through a simple web interface. If you need a lot of machines and you intend to use them for a short period of time, then AWS is probably good for you. However, there's no magic bullet. You will still have to pick the right software to install on them, load the data either in EBS volumes or S3 and all the other boring details.
Also be advised that EC2 instances and storage are relatively expensive. Be prepared to pay 5-10x more than you would pay if you actually owned the machine/disks and used it for say 3 years.
Regarding your problem, I sincerely doubt that a modern computer is not able to process a 3 gigabyte xml file. In fact, I just indexed all of stack overflow's posts.xml in SOLR on my workstation and it all went swimmingly. Are you using a SAX-like parser? If not, that will help you more than all the cloud services combined.
Sounds like an interesting project or at least a great excuse to get in touch with new technology -- I wish there would have been stuff like that when I went to school.
In most cases AWS offers you a barebone server, so the obvious question is, have you decided how you want to process your data? E.g. -- do you just want to run a shell script on the .xml's or do you want to use hadoop, etc.?
The beauty of AWS is that you can get all the capacity you need -- on demand. E.g., in your case you probably don't need multiple instances just one beefy instance. And you don't have to pay for a root server for an entire month or even a week if you need the server only for a few hours.
If you let us know a little bit more on how you want to process the data, maybe we can help further.
I have been trying to understand the MapReduce concept and apply it to my current situation. What is my situation? Well, I have an ETL tool here, in which data transformation happens outside of source and destination data sources (databases). Hence,the source data source is purely used for extract and destination for load.
So, this act of transformation today, say takes about X hours for a million records. I would like to address a scenario where I would have a billion records, but I would want the work done in the same X hours. So, here is the need, for my product to scale out (adding more commodity machines) based on the scale of data. As you can see, I am only worried about the ability of distributing my product's transformation functionality to different machines, there by, leveraging CPU power from all these machines.
I started looking for options and I came across Apache Hadoop and then eventually the concept of MapReduce. I was pretty successful in settin up Hadoop quickly without running into issues in cluster mode and was happy to run a wordcount demo too. Soon, I realized that for implementing my own MapReduce model, I would have to redefine my product's transformation functionality into MAP and REDUCE functions.
Here's when trouble began. I read a copy of Hadoop: Definitive Guide, and I understood that many of the common use cases of Hadoop are in scenarios where one is faced with:
Unstructed data and one would like to perform aggregation/ sort/ or something of that kind.
Unstrucuted text and there is a need to perform mining
etc!
Here is my scenario where I extract from a database and load to a database (which has structured data), and my sole purpose is about bringing in more CPUs into play, in a reliable manner, and there by distribute my transformation. And redefining my transformation to fit a Map and Reduce model makes it a huge challenge in itself. So here are my questions:
Have you used Hadoop in ETL
scenarios? If yes, could be specific
about how you handled MapReducing of
your transformation? Have you used
Hadoop purely for leveraging extra
CPU power?
Is MapReduce concept the
universal answer to distributed
computing? Are there other equally
good options?
My understanding is
that MapReduce applies to large
dataset for
sorting/analytics/grouping/counting/aggregation/etc,
is my understading correct?
If you want to scale-out a processing problem over a lot of systems you must do two things:
Make sure you can process the information in independent parts.
There should be NO shared resource that is needed among these parts.
If there are dependencies then these will be the limit in your horizontal scalability.
So if you are starting from a relational model then the main obstruction is the fact that you have relationships. Having these relationships is a great asset in relational databases but is a pain in the ... when trying to scale-out.
The simplest way to go from relational to independent parts is to make a jump and de-normalize your data into records that have everything in them and are focussed around the part you want to do the processing around. Then you can disribute them over a huge cluster and after the processing has been completed you use the results.
If you cannot do such a jump you're in trouble.
So coming back to your questions:
# Have you used Hadoop in ETL scenarios?
Yes, the input being Apache logfiles and the loading and transformation consisted of parsing, normalizing and filtering these loglines. The result wan't put in a normal RDBMS!
# Is MapReduce concept the universal answer to distributed computing? Are there other equally good options?
MapReduce is a very simple processing model that will work great for any processing problem you are able to split into a lot of smaller 100% independent parts. The MapReduce model is so simple that as far as I know any problem that can be split into independent parts can be written as series of mapreduce steps.
HOWEVER: It is important to note that at this moment only BATCH oriented processing can be done with Hadoop. If you want "realtime" processing you are currently out of luck.
I don't know of a better model at this moment that an actual implementation exists for.
# My understanding is that MapReduce applies to large dataset for sorting/analytics/grouping/counting/aggregation/etc, is my understading correct?
Yep, that is the most common application.
MapReduce is "one" solution for "some" class of problems. It does not solve all the distributed systems problems - think about large TPS systems as the ones in banks or telecoms or telco signaling - there MR might be ineffective. But for the non real-time data processing MR performs awesome and you might consider it for massive ETL.
I cannot answer #1, as I haven't used MapReduce in ETL scenarios. However, I can say that MapReduce is not an "universal answer" for distributed computing; it's a useful tool for handling certain types of situations, where data is structured in a certain way. Think of it like a hashtable; very useful for certain situations, but not an "ultimate algorithm" by any definition of terms.
My personal understanding is that MapReduce is particularly useful for large quantities of "understructured" data; that is, it's useful for imposing some structure (basically, effectively providing a "first order" operation on large unstructured datasets). However, for datasets that are very large and relatively "tightly bound" (i.e. strong association between disparate data elements), it's (in my understanding) not a great solution.