I'm learning Hadoop using the book Hadoop in Practice, and while reading chapter 1 i came across this diagram:
From the Hadoop docs:(http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapred/Reducer.html)
1.Shuffle
Reducer is input the grouped output of a Mapper. In the phase the framework, for each Reducer, fetches the relevant partition of the output of all the Mappers, via HTTP.
2.Sort
The framework groups Reducer inputs by keys (since different Mappers may have output the same key) in this stage.
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
While i understand that shuffle and sorting happens at the same time, it's not clear to me how the framework decides which reducer receives which mapper output. From the docs, it seems that each reducer has a way to know which mapoutput to collect, but i can't understand how.
So my question is, given the mappers output above, the final result is always the same for each reducer? If so, what are the steps to achieve this result?
Thanks for any clarifications!
It is the Partitioner that decides how to distribute the output of mappers to different reducers.
Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent for reduction.
Related
I'm learning spring-batch. I'm currently working with biological data that look like this:
interface Variant {
public String getChromosome();
public int getPosition();
public Set<String> getGenes();
}
(A Variant is a position on the genome which may overlap somes genes).
I've already written some Itemreaders/Itemwriters
Now I would like to run some analysis per gene. Thus I would like to split my workflow for each gene (gene1, gene2,... geneN) to do some statistics about all the variants linked to one gene.
What is the best way to implement a Partioner for this (is it the correct class anyway ?) ? All the examples I've seen use some 'indexes' or a finite number of gridSize ? Furthermore, does the map returned by partiton(gridsize) must have less than gridSize items or can I returned a 'big' map and spring-batch is able to run no more than gridSize jobs in parallel ? how can join the data at the end ?
thanks
EDIT: or may be I should look at MultiResourceItemWriter ?
When using Spring Batch's partitioning capabilities, there are two main classes involved, the Partitioner and the PartitionHandler.
Partitioner
The Partitioner interface is responsible for dividing up the data to be processed into partitions. It has a single method Partitioner#partition(int gridSize) that is responsible for analyzing the data that is to be partitioned and returning a Map with one entry per partition. The gridSize parameter is really just a piece of input into the overall calculation that can be used or ignored. For example, if the gridSize is 5, I may choose to return exactly 5 partitions, I may choose to overpartition and return some multiple of 5, or I may analyze the data and realize that I only need 3 partitions and completely ignore the gridSize value.
PartionHandler
The PartitionHandler is responsible for the delegation of the partitions returned by the Partitioner to workers. Within the Spring ecosystem, there are three provided PartitionHandler implementations, a TaskExecutorPartitionHandler that delegates the work to threads internal to the current JVM, a MessageChannelPartitionHandler that delegates work to remote workers listening on some form of messaging middleware, and a DeployerPartitionHandler out of the Spring Cloud Task project that launches new workers dynamically to execute the provided partitions.
With all the above laid out, to answer your specific questions:
What is the best way to implement a Partioner for this (is it the correct class anyway ?) ? That typically depends on the data your partitioning and the store it's in. Without further insights into how you are storing the gene data, I can't really comment on what the best approach is.
Does the map returned by partiton(gridsize) must have less than gridSize items or can I returned a 'big' map and spring-batch is able to run no more than gridSize jobs in parallel ? You can return as many items in the Map as you see fit. As mentioned above, the gridSize is really meant as a guide.
How can join the data at the end ? A partitioned step is expected to have each partition processed independently of each other. If you want some form of join at the end, you'll typically do that in a step after the partition step.
Recently I have being trying out Spark and do far I have observed quite interesting results, but currently I am stuck with famous groupByKey OOM problem. Basically what the job does it tries to search in the large datasets the periods where measured value is increasing consecutively for at least N times. I managed to get rid of the problem by writing the results to the disk, but the application is running much slower now (which is expected due to the disk IO). Now the question: is there any other memory efficient strategy where I can run sorted data and check whether adjacent values(for the same key) are increasing in at least N consecutive observations, without recurring to groupByKey method?
I have designed an algorithm to do it with reduceByKey, but there is one problem, reduce seems to ignore data ordering and yells completely wrong results at the end.
Any ideas appreciated.
There are a few ways you can approach this problem:
repartitionAndSortWithinPartitions with custom partitioner and ordering:
keyBy (name, timestamp) pairs
create custom partitioner which considers only the name
repartitionAndSortWithinPartitions using custom partitioner
use mapPartitions to iterate over data and yield matching sequences
sortBy(Key) - this is similar to the first solution but provides higher granularity at the cost of additional post-processing.
keyBy (name, timestamp) pairs
sortByKey
process individual partitions using mapPartitionsWithIndex keeping track of leading / trailing patterns for each partition
adjust final results to include patterns which span over more than one partitions
create fixed sized windows over sorted data using sliding from mllib.rdd.RDDFunctions.
sortBy (name, timestamp)
create sliding RDD and filter windows which cover multiple names
check if any window contains desired pattern.
I am working with Map/Reduce algorithm where I am trying to merge two or more trees in single reducer (will try to fine-tune amount of trees that are merged in one reducer later). I am trying to implement this algorithm using N reducer rounds.
I have tried solve this problem using ChainReducer, but it allows to define only one reducer (I'd probably would be able to achieve in creating that chain using loop). Moreover, I would like to define custom logic to specify when to emit the result.
Here's diagram of my algorithm architecture:
You can make use of job control where in you can execute a number of map reduce jobs in a sequence. In your case there are three phases in reducers and only one in mappers. You can have three map reduce jobs and for the jobs where you need only the reducer action you can make use of identity mappers.
I need to compute aggregate over HBase table.
Say I have this hbase table: 'metadata' Column family:M column:n
Here metadata object has a list of strings
class metadata
{
List tags;
}
I need to compute the count of tags for which I was thinking of using either using mapreduce or scan over hbase directly.
The result has to be returned on the fly. So which one can I use in this scenario? Scan over hbase and compute the aggregate or mapreduce?
Mapreduce ultimately is going to scan hbase and compute the count.
What are the pros and cons of using either of these?
I suspect you're not aware about what are the pros and cons of HBase, it's not suited for computing realtime aggregations of large datasets.
Let's start by saying that MapReduce is a scheduled job by itself, you won't be able to return the response on the fly, expect no less than 15 seconds for the Task Tracker to initialize the job.
In the end, the MapReduce Job will do exactly the same thing: a HBase scan, the difference between performing the scan right-away and the MapReduce it's just the paralellization and data locality, which excels when you have millions/billions of rows. If your queries only needs to read a few thousand consecutive rows to aggregate them, sure, you could just do a scan and it will probably have an acceptable response time, but for larger datasets it's just going to be impossible to do that at query time.
HBase is best suited for handling tons of atomic reads and writes, that way, you can maintain those aggregations in real time, no matter how many pre-aggregated counters you'll need or how many requests you're going to receive: with a proper row key design and split policy you can scale to satisfy the demand.
Think of it as a word count, you could store all the words in a list and count them at query-time when requested or you can process that list at insert-time and store the number of times each word is used in the document, as global counter, and in a daily, monthly, yearly, per-country, per-author tables (or even families).
I understand that reducer pulls map output through http. But since each map task mergers all its spills to one file, how can a reduce task pull those intermediate data from map task? Just a piece of that file?
The output of map tasks are sorted by partition number. Each partition number corresponds to one reducer. When a a reducer pulls the output, the file pointer will be offset to the starting position of the partition number for the reducer and start reading. Of course, some partition number to file offset table is maintained on the mapper side to achieve this.