Here is my scenario. I have a job that processed a large amount of csv data and writes it out using Avro into files divided up by date. I have been given a small file with that I want to use to update a few of these files with additional entries with a second job I can run whenever this needs to happen instead of reprocessing the whole data set again.
Here is sort of what the idea looks like:
Job1: Process lots of csv data, writes it out in compressed Avro files split into files by entry date. The source data is not divided by date so this job will do that.
Job2 (run as needed between Job1 runs): Process small update file and use this to add the entries to the appropriate appropriate Avro file. If it doesn't exist create a new file.
Job3 (always runs): Produce some metrics for reporting from the output of Job1 (and possibly Job 2).
So, I have to do it this way writing a Java job. My first job seems to work fine. So does 3. I'm not sure on how to approach job 2.
Here is what I was thinking:
Pass the update file in using distributed cache. Parse this file to
produce a list of dates in the Job class and use this to filter the
files from Job1 which will be the input of this job.
In the mapper, access the distributed update file and add them to the collection of my avro objects I've read in. What if the file doesn't exist yet here? Does this work?
Use Reducer to write the new object collection
Is this how one would implement this? If not what is the better way? Does a combiner make sense here? I feel like the answer is no.
Thanks in advance.
You can follow below approach:
1) run job1 on all your csv file
2) run job2 on small file and create new output
3) For update, you need to run one more job, in this job, load the output of job2 in setup() method and take output of job1 as a map() input. Then write the logic of update and generate final output.
4) then run your job3 for processing.
According to me, this will work.
Just one crazy idea: why do you need actually update job1 output?
JOB1 does its job producing one file for date. Why not add it with unique postfix like random UUID?
JOB2 processes 'update' information. Maybe several times. The logic of output file naming is the same: date based name and unique postfix.
JOB3 collects JOB1 and JOB2 output grouping them into splits by date prefix with all postfixes and taking as input.
If date-based grouping is target, here you have lot of advantages as for me, obvious ones:
You don't care abuot 'if you have output from JOB1 for this date'.
You even don't care if you need to update one JOB1 output with several JOB2 results.
You don't break HDFS approach with 'no file update' limitation having full power of 'write once' straightforward processing.
You need only some specific InputFormat for your JOB3. Looks not so complex.
If you need to combine data from different sources, no problem.
JOB3 itself can ignore fact that it receives data from several sources. InputFormat should take care.
Several JOB1 outputs can be combined the same way.
Limitations:
This could produce more small files than you can afford for large datasets and several passes.
You need custom InputFormat.
As for me good option if I properly understand your case and you can / need to group files by date as input for JOB3.
Hope this will help you.
For Job2, You can read the update file to filter the input data partitions in Driver code and set it in Input paths. You can follow the current approach to read the update file as distribute cache file.In case you want to fail the job if you are unable to read update file , throw exception in setup method itself.
If your update logic does not require aggregation at reduce side, Set Job2 as map only job.You might need to build logic to identify updated input partitions in Job3 as it will receive the Job1 output and Job2 output.
Related
I am seeking some guidance please on how to structure a spring batch application to ingest a bunch of potentially large delimited files, each with a different format.
The requirements are clear:
select the files to ingest from an external source: there can be multiple releases of some files each day so the latest release must be picked
turn each line of each file into json by combining the delimited fields with the column names of the first line (which is skipped)
send each line of json to a RESTFul Api
We have one step which uses a MultiResourceItemReader which processes files in sequence. The files are inputstreams which time out.
Ideally I think we want to have
a step which identifies the files to ingest
a step which processes files in parallel
Thanks in advance.
This is a fun one. I'd implement a customer line tokenizer that extends DelimitedLineTokenizer and also implements LineCallbackHandler. I'd then configure your FlatFileItemReader to skip the first line (list of column names) and pass that first line to your handler/tokenizer to set all your token names.
A custom FieldSetMapper would then receive a FieldSet with all your name/value pairs, which I'd just pass to the ItemProcessor. Your processor could then build your JSON strings and pass them off to your writer.
Obviously, you job falls into typical - reader -> processor -> writer category with writer being optional in your case ( if you don't wish to persist JSON before sending to RESTFul API) or you can call step to send JSON to REST Service as Writer if Writer is done after receiving response from service.
Anyway, you don't need a separate step to just know the file name. Make it part of application initialization code.
Strategies to parallelize your application are listed here.
You just said a bunch of files. If number of lines in those files have similar count, I would go by partitioning approach ( i.e. by implementing Partitioner interface, I will hand over each file to a separate thread and that thread will execute a step - reader -> processor -> writer). You wouldn't need MultiResourceItemReader in this case but simple single file reader as each file will have its own reader. Partitioning
If line count in those files vary a lot i.e. if one file is going to take hours and another getting finished in few minutes, you can continue using MultiResourceItemReader but use approach of Multi-threaded Step to achieve parallelism.This is chunk level parallelism so you might have to make reader thread safe.
Approach Parallel Steps doesn't look suitable for your case since your steps are not independent.
Hope it helps !!
I'm quite new with Hadoop, and I have a issue...
I have a output file (the result of a task) and I would like to modify it. As it can be a very big file, i want to do this operation in parralel.
Note : I don't want to simply append data, i want to modify structurally (even the size) so I have to read it completely and write it back
Read the file isn't a problem, i give to each worker a part of the file, they simply have to read it and make the changes they want.
But for writing the new file back to hdfs it seems more tricky.
My question is: how can I Create a big file into the hdfs and make my workers write into it simultaneously (i know the size of each part so two workers will never attempt to write at the same position).
Thanks in advance :)
Since the job is to read the input file and write select content from the input files to a output location in parallel, this is a mapper only job.
Create a Mapper class to read the file and perform your operations on
the file.
set the number for mappers in your driver class.
job.setNumMapTasks(n); n-number of mappers
I have some flight data (each line containing origin, destination, flight number, etc) and I need to process it to output flight details between all origins and destinations with one stopover, my idea is to have two mappers (one outputs destination as key and the other outputs origin as key, therefore the reducer gets the stopover location as key and all origin and destination as an array of values). Then I can output flight details with one stopover for all locations in the reducer.
So my question is how do I run two different mappers on the same input file and have their output sent to one reducer.
I read about MultipleInputs.addInputPath, but I guess it needs input to be different (or atleast two copies of the same input).
I am thinking of running the two mapper jobs independently using a workflow and then a third Identity mapper and reducer where I will do the flight calculation.
Is there a better solution that this? (Please do not ask me to use Hive, am not comfortable with it yet) Any guidance on implementing using mapreduce would really help. Thanks.
I think you can do it with just one Mapper.
The Mapper emits each (src,dst,fno,...) input record twice, once as (src,(src,dst,fno,...)) and once as (dst,(src,dst,fno,...)).
In the Reducer you need to figure out for each record whether its key is a source or destination and do the stop-over join. Using a flag to indicate the role of the key and a secondary sort can make this a bit more efficient.
That way only a single MR job with one Mapper and one Reducer is necessary for the task.
Your question did not specify if you wish to mix/match (stopover/no stopovers) together.
So I will go ahead with the stated question: that is only consider one (not zero) stopovers.
In that case simply have two Map/Reduce stages. First stage Mapper outputs
(dest1, source1).
First stage reducer receives
(dest1, Array(source1, source2, ...)
The first stage reducer then writes its tuples to hdfs output directory.
Now do the second stage: the mapper input uses the Stage1 reducer output as its source directory.
Second stage mapper reads:
(dest1, Array(source1, source2, ...))
Second stage mapper outputs:
(dest2, (source1,dest1))
Then your final (stage2) reducer receives:
(dest2, Array( (source11,dest11), (source12, dest12), (source13, dest13) ,...)
and it writes that data to the hdfs output. You can then use any external tools you like to read those results from hdfs.
Could I set several mapper class into one job?
For example I have a csv input file from the HDFS. I have two tasks to do. The first one is to count two fields from the csv input file and get the result into a output file. The second one is to count another two fields from the same csv input file and get the result into another output file. The Reducer is the same.
How could I achieve this just using one job and make them process at the same time? (I don't want to do the first one and then do the second after the first one finishing, I want to let them process parallel).
I try the following code:
job1.setMapperClass(Mapper1.class);
job1.setReducerClass(LogReducer.class);
job1.setMapperClass(Mapper2.class);
job1.setReducerClass(LogReducer.class);
I try it but it doesn't work, it only show me the second result, the first one is gone away.
It clearly needs two jobs to run in parallel. what is the problem with running two jobs in parallel as the mapping task and output path are different. Job can't handle multiple mappers, if it is not chained.
So the question is whether or not you want one output or two outputs from the reducer. You could map the two inputs, one mapped by Mapper1 and the other mapped by Mapper2, and then pass the merged intermediate results into a reducer to get one output. That's using the MultipleInputs class in a single job and can be configured in the driver class.
If you want the reduced results of Mapper1 to be separate from the reduced results of Mapper2, then you need to configure two jobs. The two jobs would have different mappers but would configure the same reducer class.
Have a look at MultipleOutputs class in Hadoop to write to multiple files from a reducer. Write the output to the second file based on conditions in your reduce method.
I am trying to Read HBase table TableMapReduceUtil and dump data into HDFS (Don't ask me why. It is weired but don't have any other option). So, to achieve that, I want to manipulate final file names (emitted by reducer) w.r.t the reducer key.
On the mapper side I was able to dump hbase rotryingws to HDFS in default order. But to override reducer outputfile format (name as per key), I figured out that MultipleOutputFormat class for reducer (which is absent on 0.20 due to some interface mess up, read somewhere) and the old one takes only JobConf. But if I try to write the code with old JobConf, I am not able to Use HBase 0.90's TableMapReduceUtil which only takes Job class
Doesn't have much handson with Hadoop/HBase. Had spent some time modifying existing MRJObs.
It seems I am stuck with my approach.
Versions Hadoop-Core-0.20.;HBase 0.90.1
Thanks
Pankaj
I am not able to Use HBase 0.90's TableMapReduceUtil which only takes Job class.
There are org.apache.hadoop.hbase.mapred.TableMapReduceUtil and org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil classes. The first will take JobConf (old MR API) and the second will take Job (new MR API). Use the appropriate TableMapReduceUtil class.