Hadoop MapReduce - one output file for each input - java

I'm new to Hadoop and I'm trying to figure out how it works. As for an exercise I should implement something similar to the WordCount-Example. The task is to read in several files, do the WordCount and write an output file for each input file.
Hadoop uses a combiner and shuffles the output of the map-part as an input for the reducer, then writes one output file (I guess for each instance that is running). I was wondering if it is possible to write one output file for each input file (so keep the words of inputfile1 and write result to outputfile1 and so on). Is it possible to overwrite the Combiner-Class or is there another solution for this (I'm not sure if this should even be solved in a Hadoop-Task but this is the exercise).
Thanks...

map.input.file environment parameter has the file name which the mapper is processing. Get this value in the mapper and use this as the output key for the mapper and then all the k/v from a single file to go to one reducer.
The code in the mapper. BTW, I am using the old MR API
#Override
public void configure(JobConf conf) {
this.conf = conf;
}
#Override.
public void map(................) throws IOException {
String filename = conf.get("map.input.file");
output.collect(new Text(filename), value);
}
And use MultipleOutputFormat, this allows to write multiple output files for the job. The file names can be derived from the output keys and values.

Hadoop 'chunks' data into blocks of a configured size. Default is 64MB blocks. You may see where this causes issues for your approach; Each mapper may get only a piece of a file. If the file is less than 64MB (or whatever value is configured), then each mapper will get only 1 file.
I've had a very similar constraint; I needed a set of files (output from previous reducer in chain) to be entirely processed by a single mapper. I use the <64MB fact in my solution
The main thrust of my solution is that I set it up to provide the mapper with the file name it needed to process, and internal to the mapper had it load/read the file. This allows a single mapper to process an entire file - It's not distributed processing of the file, but with the constraint of "I don't want individual files distributed" - it works. :)
I had the process that launched my MR write out the file names of the files to process into individual files. Where those files were written was the input directory. As each file is <64MB, then a single mapper will be generated for each file. The map process will be called exactly once (as there is only 1 entry in the file).
I then take the value passed to the mapper and can open the file and do whatever mapping I need to do.
Since hadoop tries to be smart about how it does Map/Reduce processes, it may be required to specify the number of reducers to use so that each mapper goes to a single reducer. This can be set via the mapred.reduce.tasks configuration. I do this via job.setNumReduceTasks("mapred.reduce.tasks",[NUMBER OF FILES HERE]);
My process had some additional requirements/constraints that may have made this specific solution appealing; but for an example of a 1:in to 1:out; I've done it, and the basics are laid out above.
HTH

Related

Process Mutiple Input Files In MapReduce separately

I am working on Map Reduce project "like the Word count example" with some changes, In my case I have many files to be process if I run the program,
I want each map to take one of the files and process it separate from others "I want the output for a file independent from other files output"
I try to use the:
Path filesPath = new Path("file1.txt,file2.txt,file3.txt");
MultipleInputs.addInputPath(job, filesPath, TextInputFormat.class, Map.class);
but the output I got is mixing all the files output together, and if a word appear in more than file, it processed once, and that what I don't want.
I want the word count in each file separate.
So how can i use this?
if I put the files in a directory is it will process independent?
This is the way Hadoop's map-reduce works. All files are merged together, sorted and by key and all records with the same key are fed to the mappers.
If you want one mapper to see only one file, you have to run one job per file, and also you have to force configuration to have only one mapper per job.
Within the Map task you will be able to get the file name for the record which is being processed.
Get File Name in Mapper
Once you have the file name you can add that to the Map output key, form a composite key, and implement a grouping comparator to group keys from same file into one reducer.

Accessing resource file when running as a DataflowPipelineRunner in Google Dataflow

In my project, I am trying to add some metadata to the data processed in my pipeline. The metadata is located in a DBF file in a subfolder called resources next the src-folder.
The src-folder contains the main-class and I have several packages (IO, processing, aggregation, utils).
I read and process the file with metadata in my main class where the pipeline is defined. The code I am using to access the file is as follows:
File temp1 = new File("resources/xxx.dbf");
I check if the file was found using:
LOG.info(temp1.exists())
which runs fine.
There are messages coming in as Strings which I read using PubSubIO. I use the contents of this file to fill a Map containing keys and values.
Map<String, ArrayList<Double>> sensorToCoordinates = coordinateData.getSensorLocations();
I then set a static variable in a custom class called 'SensorValues' I made:
SensorValue.setKeyToCoordinates(sensorToCoordinates);
When parsing the incoming messages from Strings to a SensorValue-class I made using a ParDo function (going from a PCollection to PCollection) the map is used in the constructor of the SensorValue-class.
Running this code using a DirectPipelineRunner works perfect. However, when I use a DataflowPipelineRunner, and I try to access the map in the SensorValue-constructor, I run into a NullPointerException.
Now I am wondering why the setter is not working when using a DataflowPipelineRunner (I'm guessing it has something to do with the execution being distributed among several workers) and what the best practice would be to use any static resource files to enrich your pipeline with?
You're right that the problem is because the execution of the ParDo is distributed to multiple workers. They don't have the local file, and they may don't have the contents of the map.
There are a few options here:
Put the file in GCS, and have the pipeline read the contents of the file (using TextIO or something like that) and use it as a side-input to your later processing.
Include the file in the resources for the pipeline and load that in the startBundle of the DoFn that needs it (in the future there will be ways to make this happen less often than every bundle).
You could serialize the contents of the map into the arguments of the DoFn, by putting it in as a non-static field passed to the constructor of that class.
Option 1 is better as the size of this file increases (since it can support splitting it up to pieces and doing lookups) while Option 2 is likely less network traffic to retrieve the file. Option 3 will only work if the file is extremely small, since it will significantly increase the size of the serialized DoFn, which may lead to the job being to large to submit to the Dataflow service.

Write into Hadoop File System in parallel

I'm quite new with Hadoop, and I have a issue...
I have a output file (the result of a task) and I would like to modify it. As it can be a very big file, i want to do this operation in parralel.
Note : I don't want to simply append data, i want to modify structurally (even the size) so I have to read it completely and write it back
Read the file isn't a problem, i give to each worker a part of the file, they simply have to read it and make the changes they want.
But for writing the new file back to hdfs it seems more tricky.
My question is: how can I Create a big file into the hdfs and make my workers write into it simultaneously (i know the size of each part so two workers will never attempt to write at the same position).
Thanks in advance :)
Since the job is to read the input file and write select content from the input files to a output location in parallel, this is a mapper only job.
Create a Mapper class to read the file and perform your operations on
the file.
set the number for mappers in your driver class.
job.setNumMapTasks(n); n-number of mappers

MapReduce: How can a streaming mapper know which file data comes from?

I am learning MapReduce. I'm trying as a test to set up a 'join' algorithm that takes in data from two files (which contain the two data sets to join).
For this to work, the mapper needs to know which file each line is from; this way, it can tag it appropriately, so that the reducer doesn't (for instance) join elements from one data set to other elements from the same set.
To complicate the matter, I am using Hadoop Streaming, and the mapper and reducer are written in Python; I understand Java, but the documentation for the Hadoop InputFormat and RecordReader classes are gloriously vague and I don't understand how I'd make a Streaming-compatible split so that some sort of file identifier could be bundled in along with the data.
Anyone who can explain how to set up this input processing in a way that my Python programs can understand?
I found out the answer, by the way— in Python, it's:
import os
context = os.environ["map_input_file"]
And 'context' then has the input file name.

How to Override InputFormat and OutputFormat In hadoop Application

I have an application which need to read a file which is a serialized result of ArrayList.(ArrayList<String>, 50000 records in this list, size: 20MB)
I don't know exactly how to read the data in to hadoop platform. I only have some sense I need to override InputFormat and OutpurFormat.
I'm a beginner in hadoop platform. Could you give me some advise?
Thanks,
Zheng.
To start with you'll need to extend the FileInputFormat, notable implementing the abstract FileInputFormat.createRecordReader method.
You can look through the source of something like the LineRecordReader (which is what TextInputFormat uses to process text files).
From there you're pretty much on your own (i.e. it depends on how your ArrayList has been serialized). Look through the source for the LineRecordReader and try and relate that to how your ArrayList has been serialized.
Some other points of note, is your file format splittable? I.e. can you seek to an offset in the file and recover the stream from there (Text files can as they just scan forward to the end of the current line and then start from there). If your file format uses compression, you also need to take this into account (you cannot for example randomly seek to a position in a gzip file). By default FileInputFormat.isSplittable will return true, which you may want to initially override to be false. If you do stick with 'unsplittable' then note that your file will be processed by a single mapper (not matter it's size).
Before processing data on Hadoop you should upload data to HDFS or another supported file system of cause if it wasn't upload here by something else. If you are controlling the uploading process you can convert data on the uploading stage to something you can easily process, like:
simple text file (line per array's item)
SequenceFile if array can contain lines with '\n'
This is the simplest solution since you don't have to interfere to Hadoop's internals.

Categories