I am working on Map Reduce project "like the Word count example" with some changes, In my case I have many files to be process if I run the program,
I want each map to take one of the files and process it separate from others "I want the output for a file independent from other files output"
I try to use the:
Path filesPath = new Path("file1.txt,file2.txt,file3.txt");
MultipleInputs.addInputPath(job, filesPath, TextInputFormat.class, Map.class);
but the output I got is mixing all the files output together, and if a word appear in more than file, it processed once, and that what I don't want.
I want the word count in each file separate.
So how can i use this?
if I put the files in a directory is it will process independent?
This is the way Hadoop's map-reduce works. All files are merged together, sorted and by key and all records with the same key are fed to the mappers.
If you want one mapper to see only one file, you have to run one job per file, and also you have to force configuration to have only one mapper per job.
Within the Map task you will be able to get the file name for the record which is being processed.
Get File Name in Mapper
Once you have the file name you can add that to the Map output key, form a composite key, and implement a grouping comparator to group keys from same file into one reducer.
Related
I've looked at FileInputFormat where filename is KEY and text contents are VALUE, How to get Filename/File Contents as key/value input for MAP when running a Hadoop MapReduce Job?, and Getting Filename/FileData as key/value input for Map when running a Hadoop MapReduce Job, but I'm having trouble getting off the ground. Not having done anything with Hadoop before, I'm wary of starting down the wrong path if someone else can see that I'm making a mistake.
I have a directory containing something like 100K small files containing HTML, and I want to create an inverted index using Amazon Elastic MapReduce, implemented in Java. Once I have the file contents, I know what I want my map and reduce functions to do.
After looking here, my understanding is I need to subclass FileInputFormat and override isSplitable. However, my filenames are related to the URLs from which the HTML came, so I want to keep them. Is replacing NullWritable with Text all I need to do? Any other advice?
You should use WholeFileInputFormat to pass the whole file to your mapper
conf.setInputFormat(WholeFileInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path("input"));
FileOutputFormat.setOutputPath(conf,new Path("output"));
Here is my scenario. I have a job that processed a large amount of csv data and writes it out using Avro into files divided up by date. I have been given a small file with that I want to use to update a few of these files with additional entries with a second job I can run whenever this needs to happen instead of reprocessing the whole data set again.
Here is sort of what the idea looks like:
Job1: Process lots of csv data, writes it out in compressed Avro files split into files by entry date. The source data is not divided by date so this job will do that.
Job2 (run as needed between Job1 runs): Process small update file and use this to add the entries to the appropriate appropriate Avro file. If it doesn't exist create a new file.
Job3 (always runs): Produce some metrics for reporting from the output of Job1 (and possibly Job 2).
So, I have to do it this way writing a Java job. My first job seems to work fine. So does 3. I'm not sure on how to approach job 2.
Here is what I was thinking:
Pass the update file in using distributed cache. Parse this file to
produce a list of dates in the Job class and use this to filter the
files from Job1 which will be the input of this job.
In the mapper, access the distributed update file and add them to the collection of my avro objects I've read in. What if the file doesn't exist yet here? Does this work?
Use Reducer to write the new object collection
Is this how one would implement this? If not what is the better way? Does a combiner make sense here? I feel like the answer is no.
Thanks in advance.
You can follow below approach:
1) run job1 on all your csv file
2) run job2 on small file and create new output
3) For update, you need to run one more job, in this job, load the output of job2 in setup() method and take output of job1 as a map() input. Then write the logic of update and generate final output.
4) then run your job3 for processing.
According to me, this will work.
Just one crazy idea: why do you need actually update job1 output?
JOB1 does its job producing one file for date. Why not add it with unique postfix like random UUID?
JOB2 processes 'update' information. Maybe several times. The logic of output file naming is the same: date based name and unique postfix.
JOB3 collects JOB1 and JOB2 output grouping them into splits by date prefix with all postfixes and taking as input.
If date-based grouping is target, here you have lot of advantages as for me, obvious ones:
You don't care abuot 'if you have output from JOB1 for this date'.
You even don't care if you need to update one JOB1 output with several JOB2 results.
You don't break HDFS approach with 'no file update' limitation having full power of 'write once' straightforward processing.
You need only some specific InputFormat for your JOB3. Looks not so complex.
If you need to combine data from different sources, no problem.
JOB3 itself can ignore fact that it receives data from several sources. InputFormat should take care.
Several JOB1 outputs can be combined the same way.
Limitations:
This could produce more small files than you can afford for large datasets and several passes.
You need custom InputFormat.
As for me good option if I properly understand your case and you can / need to group files by date as input for JOB3.
Hope this will help you.
For Job2, You can read the update file to filter the input data partitions in Driver code and set it in Input paths. You can follow the current approach to read the update file as distribute cache file.In case you want to fail the job if you are unable to read update file , throw exception in setup method itself.
If your update logic does not require aggregation at reduce side, Set Job2 as map only job.You might need to build logic to identify updated input partitions in Job3 as it will receive the Job1 output and Job2 output.
Could I set several mapper class into one job?
For example I have a csv input file from the HDFS. I have two tasks to do. The first one is to count two fields from the csv input file and get the result into a output file. The second one is to count another two fields from the same csv input file and get the result into another output file. The Reducer is the same.
How could I achieve this just using one job and make them process at the same time? (I don't want to do the first one and then do the second after the first one finishing, I want to let them process parallel).
I try the following code:
job1.setMapperClass(Mapper1.class);
job1.setReducerClass(LogReducer.class);
job1.setMapperClass(Mapper2.class);
job1.setReducerClass(LogReducer.class);
I try it but it doesn't work, it only show me the second result, the first one is gone away.
It clearly needs two jobs to run in parallel. what is the problem with running two jobs in parallel as the mapping task and output path are different. Job can't handle multiple mappers, if it is not chained.
So the question is whether or not you want one output or two outputs from the reducer. You could map the two inputs, one mapped by Mapper1 and the other mapped by Mapper2, and then pass the merged intermediate results into a reducer to get one output. That's using the MultipleInputs class in a single job and can be configured in the driver class.
If you want the reduced results of Mapper1 to be separate from the reduced results of Mapper2, then you need to configure two jobs. The two jobs would have different mappers but would configure the same reducer class.
Have a look at MultipleOutputs class in Hadoop to write to multiple files from a reducer. Write the output to the second file based on conditions in your reduce method.
I want hadoop ( 0.22.0 ) to write out the content into different files like
part-r-00000
part-r-00001
part-r-00002
part-r-00003
Each reduce-job a different file.
I know I can use the MultipleOutputs-Class, but this let me only change the 'part'-phrase, but this is not what I want. I want to be able to say which reducer uses which output file and what number it gets at the end.
Of course you have the control. When job finished (ex. after job.waitForCompletion(true) ). You know the output path and the number of reducers that were used. just rename files, it's all.... To run more reducers you should white a partitioner class.
I'm new to Hadoop and I'm trying to figure out how it works. As for an exercise I should implement something similar to the WordCount-Example. The task is to read in several files, do the WordCount and write an output file for each input file.
Hadoop uses a combiner and shuffles the output of the map-part as an input for the reducer, then writes one output file (I guess for each instance that is running). I was wondering if it is possible to write one output file for each input file (so keep the words of inputfile1 and write result to outputfile1 and so on). Is it possible to overwrite the Combiner-Class or is there another solution for this (I'm not sure if this should even be solved in a Hadoop-Task but this is the exercise).
Thanks...
map.input.file environment parameter has the file name which the mapper is processing. Get this value in the mapper and use this as the output key for the mapper and then all the k/v from a single file to go to one reducer.
The code in the mapper. BTW, I am using the old MR API
#Override
public void configure(JobConf conf) {
this.conf = conf;
}
#Override.
public void map(................) throws IOException {
String filename = conf.get("map.input.file");
output.collect(new Text(filename), value);
}
And use MultipleOutputFormat, this allows to write multiple output files for the job. The file names can be derived from the output keys and values.
Hadoop 'chunks' data into blocks of a configured size. Default is 64MB blocks. You may see where this causes issues for your approach; Each mapper may get only a piece of a file. If the file is less than 64MB (or whatever value is configured), then each mapper will get only 1 file.
I've had a very similar constraint; I needed a set of files (output from previous reducer in chain) to be entirely processed by a single mapper. I use the <64MB fact in my solution
The main thrust of my solution is that I set it up to provide the mapper with the file name it needed to process, and internal to the mapper had it load/read the file. This allows a single mapper to process an entire file - It's not distributed processing of the file, but with the constraint of "I don't want individual files distributed" - it works. :)
I had the process that launched my MR write out the file names of the files to process into individual files. Where those files were written was the input directory. As each file is <64MB, then a single mapper will be generated for each file. The map process will be called exactly once (as there is only 1 entry in the file).
I then take the value passed to the mapper and can open the file and do whatever mapping I need to do.
Since hadoop tries to be smart about how it does Map/Reduce processes, it may be required to specify the number of reducers to use so that each mapper goes to a single reducer. This can be set via the mapred.reduce.tasks configuration. I do this via job.setNumReduceTasks("mapred.reduce.tasks",[NUMBER OF FILES HERE]);
My process had some additional requirements/constraints that may have made this specific solution appealing; but for an example of a 1:in to 1:out; I've done it, and the basics are laid out above.
HTH