Struggling between Job and JobConf while using TableMapReduceUtil and MultipleOutputFormat - java

I am trying to Read HBase table TableMapReduceUtil and dump data into HDFS (Don't ask me why. It is weired but don't have any other option). So, to achieve that, I want to manipulate final file names (emitted by reducer) w.r.t the reducer key.
On the mapper side I was able to dump hbase rotryingws to HDFS in default order. But to override reducer outputfile format (name as per key), I figured out that MultipleOutputFormat class for reducer (which is absent on 0.20 due to some interface mess up, read somewhere) and the old one takes only JobConf. But if I try to write the code with old JobConf, I am not able to Use HBase 0.90's TableMapReduceUtil which only takes Job class
Doesn't have much handson with Hadoop/HBase. Had spent some time modifying existing MRJObs.
It seems I am stuck with my approach.
Versions Hadoop-Core-0.20.;HBase 0.90.1
Thanks
Pankaj

I am not able to Use HBase 0.90's TableMapReduceUtil which only takes Job class.
There are org.apache.hadoop.hbase.mapred.TableMapReduceUtil and org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil classes. The first will take JobConf (old MR API) and the second will take Job (new MR API). Use the appropriate TableMapReduceUtil class.

Related

How to use the same data with all mappers?

As we know, the number of mapper is defined by the data splits, then the problem comes, if I want to implement a random forest algorithm with MapReduce, where each mapper requires all the data. What should I do within that case? Could we "reuse" the data for different mappers?
Could setNumMapTasks works? I am quite confused about that function, and I could hardly find any information regarding how it works against the natural number of mappers decided by the number of data splits.
Thank you so much.
Side data is data shared by all mappers. You will want to broadcast the data to the mappers as part of the Job setup.
This is accomplished via the DistributedCache https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/filecache/DistributedCache.html .
Here ar some code starting points. First place the files you want to share within the DistributedCache via the Job class:
job.addCacheFile(new URI("<your file location>"));
In the mapper/ reducer you access the file via normal FileSystem api:
File file = new File("<my file name>");

Best way for a job to update output from another job

Here is my scenario. I have a job that processed a large amount of csv data and writes it out using Avro into files divided up by date. I have been given a small file with that I want to use to update a few of these files with additional entries with a second job I can run whenever this needs to happen instead of reprocessing the whole data set again.
Here is sort of what the idea looks like:
Job1: Process lots of csv data, writes it out in compressed Avro files split into files by entry date. The source data is not divided by date so this job will do that.
Job2 (run as needed between Job1 runs): Process small update file and use this to add the entries to the appropriate appropriate Avro file. If it doesn't exist create a new file.
Job3 (always runs): Produce some metrics for reporting from the output of Job1 (and possibly Job 2).
So, I have to do it this way writing a Java job. My first job seems to work fine. So does 3. I'm not sure on how to approach job 2.
Here is what I was thinking:
Pass the update file in using distributed cache. Parse this file to
produce a list of dates in the Job class and use this to filter the
files from Job1 which will be the input of this job.
In the mapper, access the distributed update file and add them to the collection of my avro objects I've read in. What if the file doesn't exist yet here? Does this work?
Use Reducer to write the new object collection
Is this how one would implement this? If not what is the better way? Does a combiner make sense here? I feel like the answer is no.
Thanks in advance.
You can follow below approach:
1) run job1 on all your csv file
2) run job2 on small file and create new output
3) For update, you need to run one more job, in this job, load the output of job2 in setup() method and take output of job1 as a map() input. Then write the logic of update and generate final output.
4) then run your job3 for processing.
According to me, this will work.
Just one crazy idea: why do you need actually update job1 output?
JOB1 does its job producing one file for date. Why not add it with unique postfix like random UUID?
JOB2 processes 'update' information. Maybe several times. The logic of output file naming is the same: date based name and unique postfix.
JOB3 collects JOB1 and JOB2 output grouping them into splits by date prefix with all postfixes and taking as input.
If date-based grouping is target, here you have lot of advantages as for me, obvious ones:
You don't care abuot 'if you have output from JOB1 for this date'.
You even don't care if you need to update one JOB1 output with several JOB2 results.
You don't break HDFS approach with 'no file update' limitation having full power of 'write once' straightforward processing.
You need only some specific InputFormat for your JOB3. Looks not so complex.
If you need to combine data from different sources, no problem.
JOB3 itself can ignore fact that it receives data from several sources. InputFormat should take care.
Several JOB1 outputs can be combined the same way.
Limitations:
This could produce more small files than you can afford for large datasets and several passes.
You need custom InputFormat.
As for me good option if I properly understand your case and you can / need to group files by date as input for JOB3.
Hope this will help you.
For Job2, You can read the update file to filter the input data partitions in Driver code and set it in Input paths. You can follow the current approach to read the update file as distribute cache file.In case you want to fail the job if you are unable to read update file , throw exception in setup method itself.
If your update logic does not require aggregation at reduce side, Set Job2 as map only job.You might need to build logic to identify updated input partitions in Job3 as it will receive the Job1 output and Job2 output.

How to globally read in an auxiliary data file for a MapReduce application?

I've written a MapReduce application that checks whether a very large set of test points (~3000 sets of x,y,x coordinates) fall within a set of polygons. The input files are formatted as follows:
{Polygon_1 Coords} {TestPointSet_1 Coords}
{Polygon_2 Coords} {TestPointSet_1 Coords}
...
{Polygon_1 Coords} {TestPointSet_2 Coords}
{Polygon_2 Coords} {TestPointSet_2 Coords}
...
There is only 1 input file per MR job, and each file ends up being about 500 MB in size. My code works great and the jobs run within seconds. However, there is a major bottleneck - the time it takes to transfer hundreds of these input files to my Hadoop cluster. I could cut down on the file size significantly if I could figure out a way to read in an auxiliary data file that contains one copy of each TestPointSet and then designate which set to use in my input files.
Is there a way to read in this extra data file and store it globally so that it can be accessed across multiple mapper calls?
This is my first time writing code in MR or Java, so I'm probably unaware of a very simple solution. Thanks in advance!
It can be achieved using hadoop's distributedcache feature. DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by applications.Google it and you can find code example.

MapReduce: How can a streaming mapper know which file data comes from?

I am learning MapReduce. I'm trying as a test to set up a 'join' algorithm that takes in data from two files (which contain the two data sets to join).
For this to work, the mapper needs to know which file each line is from; this way, it can tag it appropriately, so that the reducer doesn't (for instance) join elements from one data set to other elements from the same set.
To complicate the matter, I am using Hadoop Streaming, and the mapper and reducer are written in Python; I understand Java, but the documentation for the Hadoop InputFormat and RecordReader classes are gloriously vague and I don't understand how I'd make a Streaming-compatible split so that some sort of file identifier could be bundled in along with the data.
Anyone who can explain how to set up this input processing in a way that my Python programs can understand?
I found out the answer, by the way— in Python, it's:
import os
context = os.environ["map_input_file"]
And 'context' then has the input file name.

Hadoop MapReduce - one output file for each input

I'm new to Hadoop and I'm trying to figure out how it works. As for an exercise I should implement something similar to the WordCount-Example. The task is to read in several files, do the WordCount and write an output file for each input file.
Hadoop uses a combiner and shuffles the output of the map-part as an input for the reducer, then writes one output file (I guess for each instance that is running). I was wondering if it is possible to write one output file for each input file (so keep the words of inputfile1 and write result to outputfile1 and so on). Is it possible to overwrite the Combiner-Class or is there another solution for this (I'm not sure if this should even be solved in a Hadoop-Task but this is the exercise).
Thanks...
map.input.file environment parameter has the file name which the mapper is processing. Get this value in the mapper and use this as the output key for the mapper and then all the k/v from a single file to go to one reducer.
The code in the mapper. BTW, I am using the old MR API
#Override
public void configure(JobConf conf) {
this.conf = conf;
}
#Override.
public void map(................) throws IOException {
String filename = conf.get("map.input.file");
output.collect(new Text(filename), value);
}
And use MultipleOutputFormat, this allows to write multiple output files for the job. The file names can be derived from the output keys and values.
Hadoop 'chunks' data into blocks of a configured size. Default is 64MB blocks. You may see where this causes issues for your approach; Each mapper may get only a piece of a file. If the file is less than 64MB (or whatever value is configured), then each mapper will get only 1 file.
I've had a very similar constraint; I needed a set of files (output from previous reducer in chain) to be entirely processed by a single mapper. I use the <64MB fact in my solution
The main thrust of my solution is that I set it up to provide the mapper with the file name it needed to process, and internal to the mapper had it load/read the file. This allows a single mapper to process an entire file - It's not distributed processing of the file, but with the constraint of "I don't want individual files distributed" - it works. :)
I had the process that launched my MR write out the file names of the files to process into individual files. Where those files were written was the input directory. As each file is <64MB, then a single mapper will be generated for each file. The map process will be called exactly once (as there is only 1 entry in the file).
I then take the value passed to the mapper and can open the file and do whatever mapping I need to do.
Since hadoop tries to be smart about how it does Map/Reduce processes, it may be required to specify the number of reducers to use so that each mapper goes to a single reducer. This can be set via the mapred.reduce.tasks configuration. I do this via job.setNumReduceTasks("mapred.reduce.tasks",[NUMBER OF FILES HERE]);
My process had some additional requirements/constraints that may have made this specific solution appealing; but for an example of a 1:in to 1:out; I've done it, and the basics are laid out above.
HTH

Categories