Hierarchical MapReduce - java

I was wondering if it is possible to define a Hierarchical MapReduce job?.
In other words I would like to have a map-reduce job, that in the mapper phase will call a different MapReduce job. Is it possible? Do you have any recommendations how to do it?
I want to do it in order to have additional level of parallelism/distribution in my program.
Thanks,
Arik.

Hadoop definitive guide book contains lot of recipes related to MapReduce job chaining including sample code and detailed explanation. Especially chapter called like 'advanced API usage' or something near it.
I personally succeeded with replacement of complex map-reduce job with several HBase tables used as sources with handmade TableInputFormat extension. The result was input format which combines source data with minimal reduction so job was transformed to single mapper step. So I recommend you to look in this direction too.

You should try Cascading. It allows you to define pretty complex jobs with multiple steps.

I guess you need oozie tool. Oozie helps in defining workflows using an xml file.

Related

Producing an Oozie workflow using Java

Does someone has an idea if it is possible writing Oozie workflow using Java. I'm currently defining them using XML and I think we could improve reusability if I was able to use OOP. I would be happy if I could generate the xml from java classes or even better if I could run the workflow from Java.
The solutions I have in mind right now are:
Create Java POJO from Oozie XSD (using JAXB), code workflow in Java
and then output the result in XML.
Use oozie-core library and find out, without documentation, how to use ActionExecutor.
Also, if the idea doesn't make any sense I would like to know. Thank you!

Using hadoop for data analytics

I have a question regarding implementation of hadoop in one of my projects. Basically the requirement is that, we receive buch of logs on daily basis containing information regarding videos(When it was played, when it stopped, which user playe it etc).
What we have to do is analyze these files and return stats data in response to an HTTP request.
Example request: http://somesite/requestData?startDate=someDate&endDate=anotherDate. Basically this request asks for count of all videos played between a date Range.
My question is can we use hadoop to solve this?
I have read in various articles hadoop is not real time. So to approach this scenario should i use hadoop in conjunction with MySQL?
What i have thought of doing is to write a Map/Reduce job and store count for each video for each day in mysql. The hadoop job can be scheduled to run like once a day. Mysql data can then be used to serve the request in real time.
Is this approach correct? Is hive useful in this in any way? Please provide some guidance on this.
Yes, your approach is correct - you can create the per day data with MR job or Hive and store them in MySQL for serving in real time.
However newer versions of Hive when configured with Tez can provide decent query performance. You could try storing your per day data in Hive serve them directly from there. If the query is a simple select, it should be fast enough.
Deciding using Hadoop is an investment, as you'll need clusters and development/operational effort.
For a Hadoop solution to make sense, your data must be big. Big, as in terabytes of data, coming in real fast, possibly without proper catalog information. If you can store/process your data in your current environment, run your analysis there.
Assuming your aim is not educational, I strongly recommend you to reconsider your choice of Hadoop. Unless you have real big data, it'll only cost you more effort.
On the other hand, if you really need a distributed solution, I think your approach of daily runs is correct, accept that there are better alternatives to writing a Map/Reduce job, such as Hive, Pig or Spark.

Most simple "Word hunter" example for GAE MapReduce Library

Up until now I cannot find a really concise working example for the GAE MapReduce library.
Much are plain snippets if not ideas. The example that Google provides is quite not easy to understand consisting of pipelined Jobs etc.
I have a data in the datastore that which I need to check each one using MapReduce to check if a given word exist, and if it does, update the boolean wordFound to true field on the same Datastore entity.
How can I easily do this with GAE MapReduce library?

Does mahout work real time or does it pre-process the data based on the algorithm rules?

I am trying to build a recommendation engine, for that I am thinking of using apache mahout but I am unable to make out if mahout process the data in real time or does it pre-process the data when the server is idle and store the results somewhere in the database.
Also does anyone have any idea what approach do sites like amazon,netflix follow?
Either/or, but not both. There are parts inside from an older project that are essentially real time for moderate scale. There are also Hadoop based implementations which are all offline. The two are not related.
I am a primary creator of these parts, and if you want a system that does both together, I suggest you look at my current project Myrrix (http://myrrix.com)

Are there any samples for appengine Java report generation?

We are using AppEngine and the datastore for our application where we have a moderately large table of information containing a list with entries.
I would like to summarize the list of entries in a report specifying how many times each one appears e.g. normally in SQL I would just use a select distinct for a column, then loop over every entry and just use select count(x) where value = valueOfEntry.
While the count portion is easily done, the distinct problem is "a problem". The only solution I could find remotely close to this is MapReduce and most of the samples are based on Python. There is this blog entry which is very helpful but somewhat outdated since it predated the reduce portion. Then there is the video here and a few more resources I was able to find.
However, its really hard for me to understand how to build he summary table if I can't write to a separate entity and I don't have a reduce stage?
This seems like something trivial and simple to accomplish but requires so many hoops, is there no sample or existing reporting engine I can just plugin to AppEngine without all the friction?
I saw BigQuery, but it seems like a huge hassle to move the data out of app engine and into that store. I tried downloading the data as CSV but ran into many issues with that as well. It doesn't seem like a practical solution in the long run either.
There is a document explaining some of the concepts of the mapreduce for java. Although it is incomplete, it shares most of the architecture with the python version. In that document, there's also a pointer to a complete java sample mapreduce app, that reads from the datastore.
For writing the results, you specify an Output class. To write the results to a new datastore entity you would need to create your own Output Class. But you could also use the blobstore (see BlobFileOutput.java).
Other alternative, is that whenever you write one of your entities, you also write/update another entry to a EntityDistinct data model.
If you plan on performing complex reports and you can anticipate all your needs now, I would suggest you to look again at Big Query. BigQuery is really powerful and works perfectly on very massive datasets. You can inspect http://code.google.com/p/log2bq/ which is a python project that loads the logs into Big Query using mapreduce. Or you could also have a cron job, that every once in a while fetches all new entities and moves them into Big Query.
Related to the friction, remember that this is a no-sql database, and as such has some advantages but some things are inherently different to SQL. Remember you can always use Google Cloud SQL, given that your dataset is of limited size, but you would loose the replication and fault-tolerant capabilities.
I think this could help you: http://jjmpsj.blogspot.ro/2008/05/appengine-output-tricks-reporting.html?m=1

Categories