Limit no of mapper in MultipleOutput without reducer in hadoop - java

Hi i have an application that reads records from HBase and writes into text files HBase table has 200 regions.
I am using MultipleOutputs in the mapper class to write into multiple files and i am making file name from the incoming records .
I am making 40 unique file names .
I am able to get records properly but my problem is that when mapreduce finishes it creates 40 files and also 2k extra files with proper name but appended
with m-000 and so on.
This is because i have 200 regions and MultipleOutputs creates files for each mapper so 200 mapper and for each mapper there are 40 unique files so that is why it creates 40*200 files .
I don't know how to avoid this situation without custom partitioner .
Is there any way to force write records into belonging files only not to split into multiple files.
I have used custom partitioner class and its working fine but i don't want to use that as i am just reading from HBase and not doing reducer operation.Also if any extra file name i have to create then i have to change my code also .
Here is my mapper code
public class DefaultMapper extends TableMapper<NullWritable, Text> {
private Text text = new Text();
MultipleOutputs<NullWritable, Text> multipleOutputs;
String strName = "";
#Override()
public void setup(Context context) throws java.io.IOException, java.lang.InterruptedException {
multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
}
String FILE_NAME = new String(value.getValue(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),Bytes.toBytes(HbaseBulkLoadMapperConstants.FILE_NAME)));
multipleOutputs.write(NullWritable.get(), new Text(text.toString()),FILE_NAME);
//context.write(NullWritable.get(), text);
}
No reducer class
This is how my output looks like ideally only one Japan.BUS.gz file should be created.Other files are very small files also
Japan.BUS-m-00193.gz
Japan.BUS-m-00194.gz
Japan.BUS-m-00195.gz
Japan.BUS-m-00196.gz

I had encountered the same situation and made a solution for it also.
MultipleOutputs multipleOutputs = null;
String keyToFind = new String();
public void setup(Context context) throws IOException, InterruptedException
{
this.multipleOutputs_normal = new MultipleOutputs<KEYOUT, VALUEOUT>(context);
}
public void map(NullWritable key , Text values, Context context) throws IOException, InterruptedException
{
String valToFindInCol[] = values.toString.split(",");/** Lets say comma seperated **/
if (keyToFind .equals(valToFindInCol[2].toString())|| keyToFind == null) /** Say you need to match 2 position element **/
{
this.multipleOutputs.write(NullWritable.get(),<valToWrite>, valToFindInCol[2]);
}
else
{
this.multipleOutputs.close();
this.multipleOutputs = null;
this.multipleOutputs = new MultipleOutputs<KEYOUT, VALUEOUT>(context);
this.multipleOutputs.write(NullWritable.get(),<valToWrite>, valToFindInCol[2]);
}
keyToFind=valToFindInCol[2];
}

Related

User Defined Function in Pig Latin

I am using Java to create a User Defined Function UDF for Pig Latin in a Hadoop environment. I want to create multiple output files. I have tried to create a Java program to output these CSV files as below:
public String exec(Tuple input)
throws IOException {
if(input.equals("age")){
outputFile = new FileWriter("C:\\UDF\\output_age.csv");
}else{
outputFile = new FileWriter("C:\\UDF\\output_general.csv");
}
}
But this doesn't work. Is there any alternative method to do that, whether by Java or by Pig Latin itself?
While writing the UDFs, you need to take care of the data types. Here exec method takes tuple as input. To read tuple values, you need to use tuple.get(0) notation. i.e.
public String exec(Tuple input)
throws IOException {
String inputAge = input.get(0).toString();
if(inputAge.equals("age")){
// file creation logic
outputFile = new FileWriter("C:\\UDF\\output_age.csv");
}else{
// file creation logic
outputFile = new FileWriter("C:\\UDF\\output_general.csv");
}
}
You can refer Writing Java UDF in Pig for the reference.

skipping the header from java map reduce code

I am trying to get the summary of a csv file and the first line of the file is the header. Is there a way to make the values of each column with its header name as key value pair from the Java code.
Eg: Input file is like
A,B,C,D
1,2,3,4
5,6,7,8
I want the the output from mapper as (A,1),(B,2),(C,3),(D,4),(A,5),....
Note:I tried using overriding the run function in the Mapper class to skip the first line. But As far as I know the run function gets called for each input split and is thus not suiting my need. Any help on this will really be appreciated.
This is the way my mapper looks:
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] splits = line.split(",",-1);
int length = splits.length;
// count = 0;
for (int i = 0; i < length; i++) {
columnName.set(header[i]);
context.write(columnName, new Text(splits[i]+""));
}
}
public void run(Context context) throws IOException, InterruptedException
{
setup(context);
try
{
if (context.nextKeyValue())
{
Text columnHeader = context.getCurrentValue();
header = columnHeader.toString().split(",");
}
while (context.nextKeyValue())
{
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
}
finally
{
cleanup(context);
}
}
I assume that the column headers are alphabets and column values are numbers.
One of the ways to achieve this, is to use DistributedCache.
Following are the steps:
Create a file containing the column headers.
In the Driver code, add this file to the distributed cache, by calling Job::addCacheFile()
In the setup() method of the mapper, access this file from the distributed cache. Parse and store the contents of the file in a columnHeader list.
In the map() method, check if the values in each record match the headers (stored in columnnHeader list). If yes, then ignore that record (Because the record just contains the headers). If no, then emit the values along with the column headers.
This is how the Mapper and Driver code looks like:
Driver:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "HeaderParser");
job.setJarByClass(WordCount.class);
job.setMapperClass(HeaderParserMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
job.addCacheFile(new URI("/in/header.txt#header.txt"));
FileInputFormat.addInputPath(job, new Path("/in/in7.txt"));
FileOutputFormat.setOutputPath(job, new Path("/out/"));
System.exit(job.waitForCompletion(true) ? 0:1);
}
Driver Logic:
Copy "header.txt" (which contains just one line: A,B,C,D) to HDFS
In the Driver, add "header.txt" to distributed cache, by executing following statement:
job.addCacheFile(new URI("/in/header.txt#header.txt"));
Mapper:
public static class HeaderParserMapper
extends Mapper<LongWritable, Text , Text, NullWritable>{
String[] headerList;
String header;
#Override
protected void setup(Mapper.Context context) throws IOException, InterruptedException {
BufferedReader bufferedReader = new BufferedReader(new FileReader("header.txt"));
header = bufferedReader.readLine();
headerList = header.split(",");
}
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] values = line.split(",");
if(headerList.length == values.length && !header.equals(line)) {
for(int i = 0; i < values.length; i++)
context.write(new Text(headerList[i] + "," + values[i]), NullWritable.get());
}
}
}
Mapper Logic:
Override setup() method.
Read "header.txt" (which was put in distributed cache in the Driver) in the setup() method.
In the map() method, check if the line matches the header. If yes, then ignore that line. Else, output header and values as (h1,v1), (h2,v2), (h3,v3) and (h4,v4).
I ran this program on the following input:
A,B,C,D
1,2,3,4
5,6,7,8
I got the following output (where values are matched with respective header):
A,1
A,5
B,2
B,6
C,3
C,7
D,4
D,8
The accepted answer by #Manjunath Ballur works as a good hack. But, Map Reduce must be used in conjunction to simplicity. Checking the header for each line is not the recommended way to do this.
One way to go is to write a custom InputFormat that does this work for you

Spring Batch Write Header

I have a spring-batch file that extracts data from a database and writes it to a .CSV file.
I would like to add the names of the columns that are extracted as the headers of the file without hard coding them on the file.
Is possible to write the header when I get the results or is there another solution?
Thanks
fileItemWriter.setHeaderCallback(new FlatFileHeaderCallback() {
public void writeHeader(Writer writer) throws IOException {
writer.write(Arrays.toString(names));
}
});
[names] can be fetched using reflections from the domain class you created for the column names to be used by rowMapper, something like below :
private String[] reflectFields() throws ClassNotFoundException {
Class job = Class.forName("DomainClassName");
Field[] fields = FieldUtils.getAllFields(job);
names = new String[fields.length];
for(int i=0; i<fields.length; i++){
names[i] = fields[i].getName();
}
return names;
}

Read PDVInputStream dicomObject information on onCStoreRQ association request

I am trying to read (and then store to 3rd party local db) certain DICOM object tags "during" an incoming association request.
For accepting association requests and storing locally my dicom files i have used a modified version of dcmrcv() tool. More specifically i have overriden onCStoreRQ method like:
#Override
protected void onCStoreRQ(Association association, int pcid, DicomObject dcmReqObj,
PDVInputStream dataStream, String transferSyntaxUID,
DicomObject dcmRspObj)
throws DicomServiceException, IOException {
final String classUID = dcmReqObj.getString(Tag.AffectedSOPClassUID);
final String instanceUID = dcmReqObj.getString(Tag.AffectedSOPInstanceUID);
config = new GlobalConfig();
final File associationDir = config.getAssocDirFile();
final String prefixedFileName = instanceUID;
final String dicomFileBaseName = prefixedFileName + DICOM_FILE_EXTENSION;
File dicomFile = new File(associationDir, dicomFileBaseName);
assert !dicomFile.exists();
final BasicDicomObject fileMetaDcmObj = new BasicDicomObject();
fileMetaDcmObj.initFileMetaInformation(classUID, instanceUID, transferSyntaxUID);
final DicomOutputStream outStream = new DicomOutputStream(new BufferedOutputStream(new FileOutputStream(dicomFile), 600000));
//i would like somewhere here to extract some TAGS from incoming dicom object. By trying to do it using dataStream my dicom files
//are getting corrupted!
//System.out.println("StudyInstanceUID: " + dataStream.readDataset().getString(Tag.StudyInstanceUID));
try {
outStream.writeFileMetaInformation(fileMetaDcmObj);
dataStream.copyTo(outStream);
} finally {
outStream.close();
}
dicomFile.renameTo(new File(associationDir, dicomFileBaseName));
System.out.println("DICOM file name: " + dicomFile.getName());
}
#Override
public void associationAccepted(final AssociationAcceptEvent associationAcceptEvent) {
....
#Override
public void associationClosed(final AssociationCloseEvent associationCloseEvent) {
...
}
I would like somewhere between this code to intercept a method wich will read dataStream and will parse specific tags and store to a local database.
However wherever i try to put a piece of code that tries to manipulate (just read for start) dataStream then my dicom files get corrupted!
PDVInputStream is implementing java.io.InputStream ....
Even if i try to just put a:
System.out.println("StudyInstanceUID: " + dataStream.readDataset().getString(Tag.StudyInstanceUID));
before copying datastream to outStream ... then my dicom files are getting corrupted (1KB of size) ...
How am i supposed to use datastream in a CStoreRQ association request to extract some information?
I hope my question is clear ...
The PDVInputStream is probably a PDUDecoder class. You'll have to reset the position when using the input stream multiple times.
Maybe a better solution would be to store the DICOM object in memory and use that for both purposes. Something akin to:
DicomObject dcmobj = dataStream.readDataset();
String whatYouWant = dcmobj.get( Tag.whatever );
dcmobj.initFileMetaInformation( transferSyntaxUID );
outStream.writeDicomFile( dcmobj );

how to get input file name in hadoop cascading

In map-reduce I would extract the input file name as following
public void map(WritableComparable<Text> key, Text value, OutputCollector<Text,Text> output, Reporter reporter)
throws IOException {
FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
String filename = fileSplit.getPath().getName();
System.out.println("File name "+filename);
System.out.println("Directory and File name"+fileSplit.getPath().toString());
process(key,value);
}
How can I do the similar with cascading
Pipe assembly = new Pipe(SomeFlowFactory.class.getSimpleName());
Function<Object> parseFunc = new SomeParseFunction();
assembly = new Each(assembly, new Fields(LINE), parseFunc);
...
public class SomeParseFunction extends BaseOperation<Object> implements Function<Object> {
...
#Override
public void operate(FlowProcess flowProcess, FunctionCall<Object> functionCall) {
how can I get the input file name here ???
}
Thanks,
I don't use Cascading but I think it should be sufficient to access the context instance, using functionCall.getContext(), to obtain the filename you can use:
String filename= ((FileSplit)context.getInputSplit()).getPath().getName();
However, it seems that cascading use the old API, if the above doesn't work you must try with:
Object name = flowProcess.getProperty( "map.input.file" );
Thank Engineiro for sharing the answer. However, when invoking hfp.getReporter().getInputSplit() method, I got MultiInputSplit type which can't be casted into FileSplit type directly in cascading 2.5.3. After diving into the related cascading APIs, I found a way and retrieved input file names successfully. Therefore, I would like to share this to supplement Engineiro's answer. Please see the following code.
HadoopFlowProcess hfp = (HadoopFlowProcess) flowProcess;
MultiInputSplit mis = (MultiInputSplit) hfp.getReporter().getInputSplit();
FileSplit fs = (FileSplit) mis.getWrappedInputSplit();
String fileName = fs.getPath().getName();
You would do this by getting the reporter within the buffer class, from the provided flowprocess argument in the buffer operate call.
HadoopFlowProcess hfp = (HadoopFlowProcess) flowprocess;
FileSplit fileSplit = (FileSplit)hfp.getReporter().getInputSplit();
.
.//the rest of your code
.

Categories