Hadoop several mappers - java

Could I set several mapper class into one job?
For example I have a csv input file from the HDFS. I have two tasks to do. The first one is to count two fields from the csv input file and get the result into a output file. The second one is to count another two fields from the same csv input file and get the result into another output file. The Reducer is the same.
How could I achieve this just using one job and make them process at the same time? (I don't want to do the first one and then do the second after the first one finishing, I want to let them process parallel).
I try the following code:
job1.setMapperClass(Mapper1.class);
job1.setReducerClass(LogReducer.class);
job1.setMapperClass(Mapper2.class);
job1.setReducerClass(LogReducer.class);
I try it but it doesn't work, it only show me the second result, the first one is gone away.

It clearly needs two jobs to run in parallel. what is the problem with running two jobs in parallel as the mapping task and output path are different. Job can't handle multiple mappers, if it is not chained.

So the question is whether or not you want one output or two outputs from the reducer. You could map the two inputs, one mapped by Mapper1 and the other mapped by Mapper2, and then pass the merged intermediate results into a reducer to get one output. That's using the MultipleInputs class in a single job and can be configured in the driver class.
If you want the reduced results of Mapper1 to be separate from the reduced results of Mapper2, then you need to configure two jobs. The two jobs would have different mappers but would configure the same reducer class.

Have a look at MultipleOutputs class in Hadoop to write to multiple files from a reducer. Write the output to the second file based on conditions in your reduce method.

Related

Process Mutiple Input Files In MapReduce separately

I am working on Map Reduce project "like the Word count example" with some changes, In my case I have many files to be process if I run the program,
I want each map to take one of the files and process it separate from others "I want the output for a file independent from other files output"
I try to use the:
Path filesPath = new Path("file1.txt,file2.txt,file3.txt");
MultipleInputs.addInputPath(job, filesPath, TextInputFormat.class, Map.class);
but the output I got is mixing all the files output together, and if a word appear in more than file, it processed once, and that what I don't want.
I want the word count in each file separate.
So how can i use this?
if I put the files in a directory is it will process independent?
This is the way Hadoop's map-reduce works. All files are merged together, sorted and by key and all records with the same key are fed to the mappers.
If you want one mapper to see only one file, you have to run one job per file, and also you have to force configuration to have only one mapper per job.
Within the Map task you will be able to get the file name for the record which is being processed.
Get File Name in Mapper
Once you have the file name you can add that to the Map output key, form a composite key, and implement a grouping comparator to group keys from same file into one reducer.

spring batch structure for parallel processing

I am seeking some guidance please on how to structure a spring batch application to ingest a bunch of potentially large delimited files, each with a different format.
The requirements are clear:
select the files to ingest from an external source: there can be multiple releases of some files each day so the latest release must be picked
turn each line of each file into json by combining the delimited fields with the column names of the first line (which is skipped)
send each line of json to a RESTFul Api
We have one step which uses a MultiResourceItemReader which processes files in sequence. The files are inputstreams which time out.
Ideally I think we want to have
a step which identifies the files to ingest
a step which processes files in parallel
Thanks in advance.
This is a fun one. I'd implement a customer line tokenizer that extends DelimitedLineTokenizer and also implements LineCallbackHandler. I'd then configure your FlatFileItemReader to skip the first line (list of column names) and pass that first line to your handler/tokenizer to set all your token names.
A custom FieldSetMapper would then receive a FieldSet with all your name/value pairs, which I'd just pass to the ItemProcessor. Your processor could then build your JSON strings and pass them off to your writer.
Obviously, you job falls into typical - reader -> processor -> writer category with writer being optional in your case ( if you don't wish to persist JSON before sending to RESTFul API) or you can call step to send JSON to REST Service as Writer if Writer is done after receiving response from service.
Anyway, you don't need a separate step to just know the file name. Make it part of application initialization code.
Strategies to parallelize your application are listed here.
You just said a bunch of files. If number of lines in those files have similar count, I would go by partitioning approach ( i.e. by implementing Partitioner interface, I will hand over each file to a separate thread and that thread will execute a step - reader -> processor -> writer). You wouldn't need MultiResourceItemReader in this case but simple single file reader as each file will have its own reader. Partitioning
If line count in those files vary a lot i.e. if one file is going to take hours and another getting finished in few minutes, you can continue using MultiResourceItemReader but use approach of Multi-threaded Step to achieve parallelism.This is chunk level parallelism so you might have to make reader thread safe.
Approach Parallel Steps doesn't look suitable for your case since your steps are not independent.
Hope it helps !!

Example snippet code for giving output of one reducer to another reducer using Java

I am new to Hadoop and map-reduce programs. Would be helpful if someone answers my Question
I wanted to write a mapreduce program in which there would be two reducers. Output of one reducer is given to another reducer. To have two reducers, there should be two jobs drivers, Can some one please provide me an example snippet or example code of any mapreduce program in which output of one reducer is given to another reducer along with job's definition code.
We can have two mapreduce jobs. The output of the first job can be given as an input to the second job. The mapper of the second job can be just an identity mapper giving the input as the output and we can have our logic for the second job's reducer. These two jobs can be made to run one after the other using JobControl.

Best way for a job to update output from another job

Here is my scenario. I have a job that processed a large amount of csv data and writes it out using Avro into files divided up by date. I have been given a small file with that I want to use to update a few of these files with additional entries with a second job I can run whenever this needs to happen instead of reprocessing the whole data set again.
Here is sort of what the idea looks like:
Job1: Process lots of csv data, writes it out in compressed Avro files split into files by entry date. The source data is not divided by date so this job will do that.
Job2 (run as needed between Job1 runs): Process small update file and use this to add the entries to the appropriate appropriate Avro file. If it doesn't exist create a new file.
Job3 (always runs): Produce some metrics for reporting from the output of Job1 (and possibly Job 2).
So, I have to do it this way writing a Java job. My first job seems to work fine. So does 3. I'm not sure on how to approach job 2.
Here is what I was thinking:
Pass the update file in using distributed cache. Parse this file to
produce a list of dates in the Job class and use this to filter the
files from Job1 which will be the input of this job.
In the mapper, access the distributed update file and add them to the collection of my avro objects I've read in. What if the file doesn't exist yet here? Does this work?
Use Reducer to write the new object collection
Is this how one would implement this? If not what is the better way? Does a combiner make sense here? I feel like the answer is no.
Thanks in advance.
You can follow below approach:
1) run job1 on all your csv file
2) run job2 on small file and create new output
3) For update, you need to run one more job, in this job, load the output of job2 in setup() method and take output of job1 as a map() input. Then write the logic of update and generate final output.
4) then run your job3 for processing.
According to me, this will work.
Just one crazy idea: why do you need actually update job1 output?
JOB1 does its job producing one file for date. Why not add it with unique postfix like random UUID?
JOB2 processes 'update' information. Maybe several times. The logic of output file naming is the same: date based name and unique postfix.
JOB3 collects JOB1 and JOB2 output grouping them into splits by date prefix with all postfixes and taking as input.
If date-based grouping is target, here you have lot of advantages as for me, obvious ones:
You don't care abuot 'if you have output from JOB1 for this date'.
You even don't care if you need to update one JOB1 output with several JOB2 results.
You don't break HDFS approach with 'no file update' limitation having full power of 'write once' straightforward processing.
You need only some specific InputFormat for your JOB3. Looks not so complex.
If you need to combine data from different sources, no problem.
JOB3 itself can ignore fact that it receives data from several sources. InputFormat should take care.
Several JOB1 outputs can be combined the same way.
Limitations:
This could produce more small files than you can afford for large datasets and several passes.
You need custom InputFormat.
As for me good option if I properly understand your case and you can / need to group files by date as input for JOB3.
Hope this will help you.
For Job2, You can read the update file to filter the input data partitions in Driver code and set it in Input paths. You can follow the current approach to read the update file as distribute cache file.In case you want to fail the job if you are unable to read update file , throw exception in setup method itself.
If your update logic does not require aggregation at reduce side, Set Job2 as map only job.You might need to build logic to identify updated input partitions in Job3 as it will receive the Job1 output and Job2 output.

How do I run two different mappers on the same input and have their output sent to a single reducer?

I have some flight data (each line containing origin, destination, flight number, etc) and I need to process it to output flight details between all origins and destinations with one stopover, my idea is to have two mappers (one outputs destination as key and the other outputs origin as key, therefore the reducer gets the stopover location as key and all origin and destination as an array of values). Then I can output flight details with one stopover for all locations in the reducer.
So my question is how do I run two different mappers on the same input file and have their output sent to one reducer.
I read about MultipleInputs.addInputPath, but I guess it needs input to be different (or atleast two copies of the same input).
I am thinking of running the two mapper jobs independently using a workflow and then a third Identity mapper and reducer where I will do the flight calculation.
Is there a better solution that this? (Please do not ask me to use Hive, am not comfortable with it yet) Any guidance on implementing using mapreduce would really help. Thanks.
I think you can do it with just one Mapper.
The Mapper emits each (src,dst,fno,...) input record twice, once as (src,(src,dst,fno,...)) and once as (dst,(src,dst,fno,...)).
In the Reducer you need to figure out for each record whether its key is a source or destination and do the stop-over join. Using a flag to indicate the role of the key and a secondary sort can make this a bit more efficient.
That way only a single MR job with one Mapper and one Reducer is necessary for the task.
Your question did not specify if you wish to mix/match (stopover/no stopovers) together.
So I will go ahead with the stated question: that is only consider one (not zero) stopovers.
In that case simply have two Map/Reduce stages. First stage Mapper outputs
(dest1, source1).
First stage reducer receives
(dest1, Array(source1, source2, ...)
The first stage reducer then writes its tuples to hdfs output directory.
Now do the second stage: the mapper input uses the Stage1 reducer output as its source directory.
Second stage mapper reads:
(dest1, Array(source1, source2, ...))
Second stage mapper outputs:
(dest2, (source1,dest1))
Then your final (stage2) reducer receives:
(dest2, Array( (source11,dest11), (source12, dest12), (source13, dest13) ,...)
and it writes that data to the hdfs output. You can then use any external tools you like to read those results from hdfs.

Categories