I have to do a class project for data mining subject. My topic will be mining stackoverflow's data for trending topics.
So, I have downloaded the data from here but the data set is so huge (posts.xml is 3gb in size), that I cannot process it on my machine.
So, what do you suggest, is going for AWS for data processing a good option or not worth it?
I have no prior experience on AWS, so how can AWS help me with my school project? How would you have gone about it?
UPDATE 1
So, my data processing will be in 3 stages:
Convert XML (from so.com dump) to .ARFF (for weka jar),
Mine the data using algos in weka,
Convert the output to GraphML format which will be read by prefuse library for visualization.
So, where does AWS fit in here? I support there are two features in AWS which can help me:
EC2 and
Elastic MapReduce,
but I am not sure how mapreduce works and how can I use it in my project. Can I?
You can consider EC2 (the part of AWS you would be using for doing the actual computations) as nothing more than a way to rent computers programmatically or through a simple web interface. If you need a lot of machines and you intend to use them for a short period of time, then AWS is probably good for you. However, there's no magic bullet. You will still have to pick the right software to install on them, load the data either in EBS volumes or S3 and all the other boring details.
Also be advised that EC2 instances and storage are relatively expensive. Be prepared to pay 5-10x more than you would pay if you actually owned the machine/disks and used it for say 3 years.
Regarding your problem, I sincerely doubt that a modern computer is not able to process a 3 gigabyte xml file. In fact, I just indexed all of stack overflow's posts.xml in SOLR on my workstation and it all went swimmingly. Are you using a SAX-like parser? If not, that will help you more than all the cloud services combined.
Sounds like an interesting project or at least a great excuse to get in touch with new technology -- I wish there would have been stuff like that when I went to school.
In most cases AWS offers you a barebone server, so the obvious question is, have you decided how you want to process your data? E.g. -- do you just want to run a shell script on the .xml's or do you want to use hadoop, etc.?
The beauty of AWS is that you can get all the capacity you need -- on demand. E.g., in your case you probably don't need multiple instances just one beefy instance. And you don't have to pay for a root server for an entire month or even a week if you need the server only for a few hours.
If you let us know a little bit more on how you want to process the data, maybe we can help further.
Related
I am creating an image/graphics intensive application on android. Thus I have decided to keep images at server side and fecth them in batches when needed for each user. Apart from this I would like to manage some minor user data at backend for any future extension to the app or dynamic loading of some content.
For this I am looking out for the easiest but not a very rigid back-end solution. After some research I have boiled down to below mentioned options(In the order of priority):-
Amazon SDK for android :- It looks like this provides a lot of pre-built components but I am not sure how flexible it is when doing some custom back-end coding/feature implementation.
Parse :- Easy to understand and use but not flexible when it comes to custom feature development.
Amazon EC2 Java Backend:- I will have to do all the server side coding from scratch here but this will provide complete independence in feature implementations. Though I would love if I can find some code samples relates to user management, backend db management and java restful web services.
Any suggestions or pointers that you guys have in the above choice would be great
Thanks in advance
I have been using Parse but I haven't explored the other 2. So, this may not be a comprehensive answer but I would try to give you some pointers based on my experience with Parse.
I have been into Android development for quite some time now but I do not have any significant expertise (I would say very minimal) on the backend. Also, you mentioned you wish to work on graphics/image intensive application. As far as the application I use Parse for is more of user data and minimal images, (requiring extensive relational database).
Parse makes it really simple to create the backend structure. And the client SDK is also very powerful. Their API's are very straight-forward and doesn't require you to worry about writing complex queries, caching them and saving the data. Given my background as I mentioned above, I would say there is no learning curve involved into getting started with the dev. You can simply start building your app right away!
Also, Parse uses AWS S3 on the backend with Mongo-DB. So, I believe computation on the server side should not be a problem. Server side logic can be implemented using ParseCloud (requires some javascript). But, if you plan to write some complex algorithms, I am not very sure how much can that be done.
Documentation of Parse on Android is quite good to get through most of the dev. Extensive doc for iPhone dev.
As far as cost structure goes, it allows 1 million free API requests per month and this is very much sufficient to get through quite a number of users. In your case, the storage should be of more concern. Parse allows 1GB free and some 20 cents above per GB.
Hope this helps!
I am looking out for the easiest but not a very rigid back-end solution
Have you considered AppEngine? Here's a tutorial about how to get app engine working for you fast
You can store up to 5 GB of blob storage for free, should be more than enough for experimenting. If you go over you can pay the $0.13/GB/mo extra for blob storage, which is more than reasonable.
I don't know what kind of app you are doing, but I'll propose one approach.
Use https://imageshack.com/ for images.
Create your user saving data application with a lightweight webservice (REST+JSON)
and expose it at heroku (https://www.heroku.com/) with your prefered language/plataform.
It could be java or ruby.
Using imageshack for images will save cloud space for you and the service is quite fast.
I am looking for an open source solution to store and monitor some application performances.
To be more precise, I use several Java components in the software I develop and I would like to gather performance statistics for each of these components in order to figure out on what I need to focus to keep fast processing.
The idea would be to send a message to a repository to store some timestamps (everytime a Java component starts or ends) and having a web interface to browse the timestamps, and do some analytics on top of them.
These needs seem really basic but unfortunately I haven't found anything on the web, probably because I don't know the right terminology for this kind of tools.
Could someone recommend me such a tool?
Thanks in advance !
Adrien
What you described is RRDtool that stores time-series data. To access it from Java, there is java-rrd.
I also get the impression that you are looking for whole solution instead of just data back-end. If so, check out following open source cluster monitoring system: cacti, ganglia and graphite. They all have web interface. Cacti and ganglia have RRD-like back-end, while graphite has its own whisper database, etc.
I intended to use hadoop as "computation cluster" in my project. However then I read that Hadoop is not inteded for real-time systems because of overhead connected with start of a job. I'm looking for solution which could be use this way - jobs which could can be easly scaled into multiple machines but which does not require much input data. What is more I want to use machine learning jobs e.g. using created before neural network in real-time.
What libraries/technologies I can use for this purposes?
You are right, Hadoop is designed for batch-type processing.
Reading the question, I though about the Storm framework very recently open sourced by Twitter, which can be considered as "Hadoop for real-time processing".
Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it's fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language.
(from: InfoQ post)
However, I have not worked with it yet, so I really cannot say much about it in practice.
Twitter Engineering Blog Post: http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
Github: https://github.com/nathanmarz/storm
Given the fact that you want a real-time response in de "seconds" area I recommend something like this:
Setup a batched processing model for pre-computing as much as possible. Essentially try to do everything that does not depend on the "last second" data. Here you can use a regular Hadoop/Mahout setup and run these batches daily or (if needed) every hour or even 15 minutes.
Use a real-time system to do the last few things that cannot be precomputed.
For this you should look at either using the mentioned s4 or the recently announced twitter storm.
Sometimes it pays to go really simple and store the precomputed values all in memory and simply do the last aggregation/filter/sorting/... steps in memory. If you can do that you can really scale because each node can run completely independently of all others.
Perhaps having a NoSQL backend for your realtime component helps.
There are lot's of those available: mongodb, redis, riak, cassandra, hbase, couchdb, ...
It all depends on your real application.
Also try S4, initially released by Yahoo! and its now Apache Incubator project. It has been around for a while, and I found it to be good for some basic stuff when I did a proof of concept. Haven't used it extensively though.
What you're trying to do would be a better fit for HPCC as it has both, the back end data processing engine (equivalent to Hadoop) and the front-end real-time data delivery engine, eliminating the need to increase complexity through third party components. And a nice thing of HPCC is that both components are programmed using the same exact language and programming paradigms.
Check them out at: http://hpccsystems.com
I have a number of rather large binary files (fixed length records, the layout of which is described in another –textual– file). Data files can get as big as 6 GB. Layout files (cobol copybooks) are small in size, usually less than 5 KB.
All data files are concentrated in a GNU/Linux server (although they were generated in a mainframe).
I need to provide the testers with the means to edit those binary files. There is a free product called RecordEdit (http://record-editor.sourceforge.net/), but it has two severe drawbacks:
It forces the testers to download
the huge files through SFTP, only to
upload them once again every time a slight
change has been made. Very
inefficient.
It loads the entire
file into working memory, rendering
it useless for all but the relatively small
data files.
What I have in mind is a client/server architecture based in Java:
The server would be running a permanent
process, listening for
edition-oriented requests coming from
the client. Such requests would
include stuff like
return the list of available files
lock certain file for edition
modify this data in that record
return the n-th page of records
and so on…
The client could take any form
(RCP-based in a desktop –which is my first candidate-, ncurses in the same server, a middle web
application…) as long as it is able to
send requests to the server.
I've been exploring NIO (because of its buffers) and MINA (because of protocol transparency) in order to implement the scheme. However, before any further advancement of this endeavor, I would like to collect your expert opinions.
Is mine a reasonable way to frame the problem?
Is it feasible to do it using the language and frameworks I'm thinking of? Is it convenient?
Do you know of any patterns, blue prints, success cases or open projects that resemble or have to do with what I'm trying to do?
As I see it, the tricky thing here is decoding the files on the server. Once you've written that, it should be pretty easy.
I would suggest that, whatever the thing you use client-side is, it should basically upload a 'diff' of the person's changes.
Might it make sense to make something that acts like a database (or use an existing database) for this data? Or is there just too much of it?
Depending on how many people need to do this, the quick-and-dirty solution is to run the program via X forwarding -- that eliminates a number of the issues.. as long as that server has quite a lot of RAM free.
Is mine a reasonable way to frame the problem?
IMO, yes.
Is it feasible to do it using the language and frameworks I'm thinking of?
I think so. But there are other alternatives. For example:
Put the records into a database, and access by a key consisting of a filename + a record number. Could be a full RDBMS, or a more lightweight solution.
Implement as a RESTful web service with a UI implemented in HTML + javascript.
Implement using a scalable distributed file-system.
Also, from your description there doesn't seem to be a pressing need to use a highly scalable / transport independent layer ... unless you need to support hundreds of simultaneous users.
Is it convenient?
Convenient for who? If you are talking about you the developer, it depends if you are already familiar with those frameworks.
Have you considered using a distributed file system like OpenAFS? That should be able to handle very large files. Then you can write a client-side app for editing the files as if they are local.
have written a stochastic simulation in Java, which loads data from a few CSV files on disk (totaling about 100MB) and writes results to another output file (not much data, just a boolean and a few numbers). There is also a parameters file, and for different parameters the distribution of simulation outputs would be expected to change. To determine the correct/best input parameters I need to run multiple simulations, across multiple input parameter configurations, and look at the distributions of the outputs in each group. Each simulation takes 0.1-10 min depending on parameters and randomness.
I've been reading about Hadoop and wondering if it can help me running lots of simulations; I may have access to about 8 networked desktop machines in the near future. If I understand correctly, the map function could run my simulation and spit out the result, and the reducer might be the identity.
The thing I'm worried about is HDFS, which seems to meant for huge files, not a smattering of small CSV files, (none of which would big enough to even make up the minimum recommended block size of 64MB). Furthermore, each simulation would only need an identical copy of each of the CSV files.
Is Hadoop the wrong tool for me?
I see a number of answers here that basically are saying, "no, you shouldn't use Hadoop for simulations because it wasn't built for simulations." I believe this is a rather short sighted view and would be akin to someone saying in 1985, "you can't use a PC for word processing, PCs are for spreadsheets!"
Hadoop is a fantastic framework for construction of a simulation engine. I've been using it for this purpose for months and have had great success with small data / large computation problems. Here's the top 5 reasons I migrated to Hadoop for simulation (using R as my language for simulations, btw):
Access: I can lease Hadoop clusters through either Amazon Elastic Map Reduce and I don't have to invest any time and energy into the administration of a cluster. This meant I could actually start doing simulations on a distributed framework without having to get administrative approval in my org!
Administration: Hadoop handles job control issues, like node failure, invisibly. I don't have to code for these conditions. If a node fails, Hadoop makes sure the sims scheduled for that node gets run on another node.
Upgradeable: Being a rather generic map reduce engine with a great distributed file system if you later have problems that involve large data if you're used to using Hadoop you don't have to migrate to a new solution. So Hadoop gives you a simulation platform that will also scale to a large data platform for (nearly) free!
Support: Being open source and used by so many companies, the number of resources, both on line and off, for Hadoop are numerous. Many of those resources are written with the assumption of "big data" but they are still useful for learning to think in a map reduce way.
Portability: I have built analysis on top of proprietary engines using proprietary tools which took considerable learning to get working. When I later changed jobs and found myself at a firm without that same proprietary stack I had to learn a new set of tools and a new simulation stack. Never again. I traded in SAS for R and our old grid framework for Hadoop. Both are open source and I know that I can land at any job in the future and immediately have tools at my fingertips to start kicking ass.
Hadoop can be made to perform your simulation if you already have a Hadoop cluster, but it's not the best tool for the kind of application you are describing. Hadoop is built to make working on big data possible, and you don't have big data -- you have big computation.
I like Gearman (http://gearman.org/) for this sort of thing.
While you might be able to get by using MapReduce with Hadoop, it seems like what you're doing might be better suited for a grid/job scheduler such as Condor or Sun Grid Engine. Hadoop is more suited for doing something where you take a single (very large) input, split it into chunks for your worker machines to process, and then reduce it to produce an output.
Since you are already using Java, I suggest taking a look at GridGain which, I think, is particularly well suited to your problem.
Simply said, though Hadoop may solve your problem here, its not the right tool for your purpose.