I understand that reducer pulls map output through http. But since each map task mergers all its spills to one file, how can a reduce task pull those intermediate data from map task? Just a piece of that file?
The output of map tasks are sorted by partition number. Each partition number corresponds to one reducer. When a a reducer pulls the output, the file pointer will be offset to the starting position of the partition number for the reducer and start reading. Of course, some partition number to file offset table is maintained on the mapper side to achieve this.
Related
I am working with Map/Reduce algorithm where I am trying to merge two or more trees in single reducer (will try to fine-tune amount of trees that are merged in one reducer later). I am trying to implement this algorithm using N reducer rounds.
I have tried solve this problem using ChainReducer, but it allows to define only one reducer (I'd probably would be able to achieve in creating that chain using loop). Moreover, I would like to define custom logic to specify when to emit the result.
Here's diagram of my algorithm architecture:
You can make use of job control where in you can execute a number of map reduce jobs in a sequence. In your case there are three phases in reducers and only one in mappers. You can have three map reduce jobs and for the jobs where you need only the reducer action you can make use of identity mappers.
I need to compute aggregate over HBase table.
Say I have this hbase table: 'metadata' Column family:M column:n
Here metadata object has a list of strings
class metadata
{
List tags;
}
I need to compute the count of tags for which I was thinking of using either using mapreduce or scan over hbase directly.
The result has to be returned on the fly. So which one can I use in this scenario? Scan over hbase and compute the aggregate or mapreduce?
Mapreduce ultimately is going to scan hbase and compute the count.
What are the pros and cons of using either of these?
I suspect you're not aware about what are the pros and cons of HBase, it's not suited for computing realtime aggregations of large datasets.
Let's start by saying that MapReduce is a scheduled job by itself, you won't be able to return the response on the fly, expect no less than 15 seconds for the Task Tracker to initialize the job.
In the end, the MapReduce Job will do exactly the same thing: a HBase scan, the difference between performing the scan right-away and the MapReduce it's just the paralellization and data locality, which excels when you have millions/billions of rows. If your queries only needs to read a few thousand consecutive rows to aggregate them, sure, you could just do a scan and it will probably have an acceptable response time, but for larger datasets it's just going to be impossible to do that at query time.
HBase is best suited for handling tons of atomic reads and writes, that way, you can maintain those aggregations in real time, no matter how many pre-aggregated counters you'll need or how many requests you're going to receive: with a proper row key design and split policy you can scale to satisfy the demand.
Think of it as a word count, you could store all the words in a list and count them at query-time when requested or you can process that list at insert-time and store the number of times each word is used in the document, as global counter, and in a daily, monthly, yearly, per-country, per-author tables (or even families).
I'm learning Hadoop using the book Hadoop in Practice, and while reading chapter 1 i came across this diagram:
From the Hadoop docs:(http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapred/Reducer.html)
1.Shuffle
Reducer is input the grouped output of a Mapper. In the phase the framework, for each Reducer, fetches the relevant partition of the output of all the Mappers, via HTTP.
2.Sort
The framework groups Reducer inputs by keys (since different Mappers may have output the same key) in this stage.
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
While i understand that shuffle and sorting happens at the same time, it's not clear to me how the framework decides which reducer receives which mapper output. From the docs, it seems that each reducer has a way to know which mapoutput to collect, but i can't understand how.
So my question is, given the mappers output above, the final result is always the same for each reducer? If so, what are the steps to achieve this result?
Thanks for any clarifications!
It is the Partitioner that decides how to distribute the output of mappers to different reducers.
Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent for reduction.
I need to anonymize GBs of data consisting of thousands of files. Doing this normally takes forever; hence, I plan to use an already installed pseudo-distributed Hadoop cluster on our server.
Anonymization needs to be done on couple of columns for each record in every file and these anonymized column are to be stored in a hash map.
Ideally, I would like a mapper instance to process each file and produce a corresponding anonymized output file. In addition, mappers should spit out anonymized columns as key value pairs which a reducer would aggregate into single file.
Is the above process possible to achieve in hadoop framework? If not, is there any better way to do this? Any help or suggestion is appreciated. Thanks.
Check out MultipleOutputs. It allows you to define multiple file names for the output of the Mapper or Reducer.
As for the anonymization, just make sure the file names you want are anonymized, and that the mappers output anonymized keys. context.write(anonymized(key), value);
I'm sorry that I haven't deeply understood HBase and Hadoop MapReduce, but I think you can help me to find the way of using them, or maybe you could propose frameworks I need.
Part I
There is 1st stream of records that I have to store somewhere. They should be accessible by some keys depending on them. Several records could have the same key. There are quite a lot of them. I have to delete old records by timeout.
There is also 2nd stream of records, that is very intensive too. For each record (argument-record) I need to: get all records from 1st strem with that argument-record's key, find first corresponding record, delete it from 1st stream storage, return the result (res1) of merging these two records.
Part II
The 3rd stream of records is like 1st. Records should be accessable by keys (differ from that ones of part I). Several records as usual will have the same key. There are not so many of them like in the 1st stream. I have to delete old records by timeout.
For each res1 (argument-record) I have to: get all records from 3rd strem with that record's another key, map these records having res1 as parameter, reduce into result. 3rd stream records should stay unmodified in storage.
The records with the same key are prefered to be stored at the same node, and procedures that get records by the key and make some actions based on given argument-record are preferred to be run on the node where that records are.
Are HBase and Hadoop MapReduce applicable in my case? And how such app should look like (base idea)? If the answer is no, is there frameworks to buld such app?
Please, ask questions, if you couldn't get what I want.
I am relating to the storage backend technologies. Front end accepting records can be stateless and thereof trivially scalable.
We have streams of records and we want to join them on the fly. Some of records should be persisted why some (as far as I understood - 1st stream) are transient.
If we take scalability and persistence out of equation - it can be implemented in single java process using HashMap for randomly accessible data and TreeMap for data we want to store sorted
Now let see how it can be mapped into NoSQL technologies to gain scalability and performance we need.
HBase is distributed sorted map. So it can be good candidate for stream 2. If we used our key as hbase table key - we will gain data locality for the records with the same key.
MapReduce on top of HBase is also available.
Stream 1 looks like transient randomly accessed data. I think it does not make sense to pay a price of persistence for those records - so distributed in memory hashtable should do. For example: http://memcached.org/ Probably element of storage there will be list of records with the same key.
I still not 100% sure about 3rd stream requirements but need for secondary index (if it known beforehand) can be implemented on application level as another distributed map.
In a nutshell - my suggestion to pick up HBase for data you want to persist and store sorted and consider some more lightweight solutions for transient (but still considerable big) data.