I wrote some hadoop code to read the mapped file and split it into chunks and write it to many files as follows:
public void map(LongWritable key, Text value, OutputCollector<IntWritable, Text>
output,Reporter reporter) throws IOException {
String line = value.toString();
int totalLines = 2000;
int lines = 0;
int fileNum = 1;
String[] linesinfile = line.split("\n");
while(lines<linesinfile.length) {
// I do something like, if lines = totalLines, {
output.collect(new IntWritable(fileNum), new
Text(linesinfile[lines].toString()));
fileNum++;
lines = 0;
}
lines++;
}
}
In reduce, I do:
public void reduce(IntWritable key, Iterator<Text> values,
OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
while(values.hasNext()){
output.collect(key, values.next());
}
}
My MultiFile class is as follows:
public class MultiFileOutput extends MultipleTextOutputFormat<IntWritable, Text> {
protected String generateFileNameForKeyValue(IntWritable key, Text content, String
fileName) {
return key.toString() + "-" + fileName;
}
}
In main, I say:
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(MultiFileOutput.class);
apart from setting the OutKey/Value Class etc.
What am I doing wrong ? My output directory is always empty.
Thanks
The program looks a bit complex. If the purpose is to split the file into multiple files then it can be done in a couple of ways. There is no Need for a Map and Reduce job, just a Map job would be enough.
Use o.a.h.mapred.lib.NLineInputFormat to read N lines at a time to the mapper from the input and then write those N lines to a file.
Set the dfs.blocksize to the required file size while uploading the file, then each mapper will process one InputSplit which can be written to a file.
Related
Hi i have an application that reads records from HBase and writes into text files HBase table has 200 regions.
I am using MultipleOutputs in the mapper class to write into multiple files and i am making file name from the incoming records .
I am making 40 unique file names .
I am able to get records properly but my problem is that when mapreduce finishes it creates 40 files and also 2k extra files with proper name but appended
with m-000 and so on.
This is because i have 200 regions and MultipleOutputs creates files for each mapper so 200 mapper and for each mapper there are 40 unique files so that is why it creates 40*200 files .
I don't know how to avoid this situation without custom partitioner .
Is there any way to force write records into belonging files only not to split into multiple files.
I have used custom partitioner class and its working fine but i don't want to use that as i am just reading from HBase and not doing reducer operation.Also if any extra file name i have to create then i have to change my code also .
Here is my mapper code
public class DefaultMapper extends TableMapper<NullWritable, Text> {
private Text text = new Text();
MultipleOutputs<NullWritable, Text> multipleOutputs;
String strName = "";
#Override()
public void setup(Context context) throws java.io.IOException, java.lang.InterruptedException {
multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
}
String FILE_NAME = new String(value.getValue(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),Bytes.toBytes(HbaseBulkLoadMapperConstants.FILE_NAME)));
multipleOutputs.write(NullWritable.get(), new Text(text.toString()),FILE_NAME);
//context.write(NullWritable.get(), text);
}
No reducer class
This is how my output looks like ideally only one Japan.BUS.gz file should be created.Other files are very small files also
Japan.BUS-m-00193.gz
Japan.BUS-m-00194.gz
Japan.BUS-m-00195.gz
Japan.BUS-m-00196.gz
I had encountered the same situation and made a solution for it also.
MultipleOutputs multipleOutputs = null;
String keyToFind = new String();
public void setup(Context context) throws IOException, InterruptedException
{
this.multipleOutputs_normal = new MultipleOutputs<KEYOUT, VALUEOUT>(context);
}
public void map(NullWritable key , Text values, Context context) throws IOException, InterruptedException
{
String valToFindInCol[] = values.toString.split(",");/** Lets say comma seperated **/
if (keyToFind .equals(valToFindInCol[2].toString())|| keyToFind == null) /** Say you need to match 2 position element **/
{
this.multipleOutputs.write(NullWritable.get(),<valToWrite>, valToFindInCol[2]);
}
else
{
this.multipleOutputs.close();
this.multipleOutputs = null;
this.multipleOutputs = new MultipleOutputs<KEYOUT, VALUEOUT>(context);
this.multipleOutputs.write(NullWritable.get(),<valToWrite>, valToFindInCol[2]);
}
keyToFind=valToFindInCol[2];
}
I am trying to get the summary of a csv file and the first line of the file is the header. Is there a way to make the values of each column with its header name as key value pair from the Java code.
Eg: Input file is like
A,B,C,D
1,2,3,4
5,6,7,8
I want the the output from mapper as (A,1),(B,2),(C,3),(D,4),(A,5),....
Note:I tried using overriding the run function in the Mapper class to skip the first line. But As far as I know the run function gets called for each input split and is thus not suiting my need. Any help on this will really be appreciated.
This is the way my mapper looks:
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] splits = line.split(",",-1);
int length = splits.length;
// count = 0;
for (int i = 0; i < length; i++) {
columnName.set(header[i]);
context.write(columnName, new Text(splits[i]+""));
}
}
public void run(Context context) throws IOException, InterruptedException
{
setup(context);
try
{
if (context.nextKeyValue())
{
Text columnHeader = context.getCurrentValue();
header = columnHeader.toString().split(",");
}
while (context.nextKeyValue())
{
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
}
finally
{
cleanup(context);
}
}
I assume that the column headers are alphabets and column values are numbers.
One of the ways to achieve this, is to use DistributedCache.
Following are the steps:
Create a file containing the column headers.
In the Driver code, add this file to the distributed cache, by calling Job::addCacheFile()
In the setup() method of the mapper, access this file from the distributed cache. Parse and store the contents of the file in a columnHeader list.
In the map() method, check if the values in each record match the headers (stored in columnnHeader list). If yes, then ignore that record (Because the record just contains the headers). If no, then emit the values along with the column headers.
This is how the Mapper and Driver code looks like:
Driver:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "HeaderParser");
job.setJarByClass(WordCount.class);
job.setMapperClass(HeaderParserMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
job.addCacheFile(new URI("/in/header.txt#header.txt"));
FileInputFormat.addInputPath(job, new Path("/in/in7.txt"));
FileOutputFormat.setOutputPath(job, new Path("/out/"));
System.exit(job.waitForCompletion(true) ? 0:1);
}
Driver Logic:
Copy "header.txt" (which contains just one line: A,B,C,D) to HDFS
In the Driver, add "header.txt" to distributed cache, by executing following statement:
job.addCacheFile(new URI("/in/header.txt#header.txt"));
Mapper:
public static class HeaderParserMapper
extends Mapper<LongWritable, Text , Text, NullWritable>{
String[] headerList;
String header;
#Override
protected void setup(Mapper.Context context) throws IOException, InterruptedException {
BufferedReader bufferedReader = new BufferedReader(new FileReader("header.txt"));
header = bufferedReader.readLine();
headerList = header.split(",");
}
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] values = line.split(",");
if(headerList.length == values.length && !header.equals(line)) {
for(int i = 0; i < values.length; i++)
context.write(new Text(headerList[i] + "," + values[i]), NullWritable.get());
}
}
}
Mapper Logic:
Override setup() method.
Read "header.txt" (which was put in distributed cache in the Driver) in the setup() method.
In the map() method, check if the line matches the header. If yes, then ignore that line. Else, output header and values as (h1,v1), (h2,v2), (h3,v3) and (h4,v4).
I ran this program on the following input:
A,B,C,D
1,2,3,4
5,6,7,8
I got the following output (where values are matched with respective header):
A,1
A,5
B,2
B,6
C,3
C,7
D,4
D,8
The accepted answer by #Manjunath Ballur works as a good hack. But, Map Reduce must be used in conjunction to simplicity. Checking the header for each line is not the recommended way to do this.
One way to go is to write a custom InputFormat that does this work for you
I am new in Hadoop Map-reduce. My input is many text files and I want to write the map-reduce program such that it will write all the files-names and the associated sentences with the file names in one output file where
I want to just emit the file-name(key) and the associated sentences(value) from the mapper and the reducer will collect the key and all the values and write the file-name and their associated sentences in the output.
Mapper and reducer:
public void map(Text key, Text value,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
StringTokenizer itr = new StringTokenizer(value.toString(), ",");
String filename = new String();
FileSplit filesplit = (FileSplit) reporter.getInputSplit();
filename = filesplit.getpath().getName();
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(new Text(filename), word);
}
}
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
// int sum = 0;
String translation = "";
while (values.hasNext()) {
translation += "|" + values.toString() + "|";
}
results.set(translation);
output.collect(key, results);
}
When I run the above mapper and reducer with the same configuration of inputformat (keyvaluetextinputformat.class) it does not write any thing in the output.
What should I change to achieve my goal?
In your reduce method you declare values to be an Iterator. It should be declared as an Iterable instead.
public void reduce(Text key, Iterable<Text> values, ....
instead of
public void reduce(Text key, Iterator<Text> values, ....
Once you've done that, you can do:
Iterator<Text> iter = values.iterator();
while(iter.hasNext())
{
translation += "|" + iter.next().toString() + "|";
}
Because you used the wrong type the method isn't overriding the default reduce method which doesn't do anything. That's why you get no output.
I also don't see where you declare the variable results, either.
I am trying to count the occurrence of a particular word in a file using hadoop mapreduce programming in java. Both the file and the word should be an user input. So I am trying to pass the particular word as third argument along with the i/p and o/p paths(In, Out, Word). But i am not able to find out a way to pass the word to the map function.
I have tried the following way but it did not work:
- created a static String variable in mapper class and assigned the value of my 3rd argument(ie. word to be searched) to it. And then tried to use this static variable inside map function. But inside map function the static variables value came as Null.
I am unable to get the third arument's value inside map function.
Is there anyway to set the value via JobConf object? Please help. I have pasted my code below.
public class MyWordCount {
public static class MyWordCountMap extends Mapper < Text, Text, Text, LongWritable > {
static String wordToSearch;
private final static LongWritable ONE = new LongWritable(1L);
private Text word = new Text();
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
System.out.println(wordToSearch); // Here the value is coming as Null
if (value.toString().compareTo(wordToSearch) == 0) {
context.write(word, ONE);
}
}
}
public static class SumReduce extends Reducer < Text, LongWritable, Text, LongWritable > {
public void reduce(Text key, Iterator < LongWritable > values,
Context context) throws IOException, InterruptedException {
long sum = 0L;
while (values.hasNext()) {
sum += values.next().get();
}
context.write(key, new LongWritable(sum));
}
}
public static void main(String[] rawArgs) throws Exception {
GenericOptionsParser parser = new GenericOptionsParser(rawArgs);
Configuration conf = parser.getConfiguration();
String[] args = parser.getRemainingArgs();
Job job = new Job(conf, "wordcount");
job.setJarByClass(MyWordCountMap.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setMapperClass(MyWordCountMap.class);
job.setReducerClass(SumReduce.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
String MyWord = args[2];
MyWordCountMap.wordToSearch = MyWord;
job.waitForCompletion(true);
}
}
There is a way to do this with Configuration (see api here). As an example, the following code can be used which sets "Tree" as the word to be searched:
//Create a new configuration
Configuration conf = new Configuration();
//Set the work to be searched
conf.set("wordToSearch", "Tree");
//create the job
Job job = new Job(conf);
Then, in your mapper/reducer class you can get wordToSearch (i.e., "Tree" in this example) using the following:
//Create a new configuration
Configuration conf = context.getConfiguration();
//retrieve the wordToSearch variable
String wordToSearch = conf.get("wordToSearch");
See here for more details.
private static String[] testFiles = new String[] {"img01.JPG","img02.JPG","img03.JPG","img04.JPG","img06.JPG","img07.JPG","img05.JPG"};
// private static String testFilespath = "/home/student/Desktop/images";
private static String testFilespath ="hdfs://localhost:54310/user/root/images";
//private static String indexpath = "/home/student/Desktop/indexDemo";
private static String testExtensive="/home/student/Desktop/images";
public static class MapClass extends MapReduceBase
implements Mapper<Text, Text, Text, Text> {
private Text input_image = new Text();
private Text input_vector = new Text();
#Override
public void map(Text key, Text value,OutputCollector<Text, Text> output,Reporter reporter) throws IOException {
System.out.println("CorrelogramIndex Method:");
String featureString;
int MAXIMUM_DISTANCE = 16;
AutoColorCorrelogram.Mode mode = AutoColorCorrelogram.Mode.FullNeighbourhood;
for (String identifier : testFiles) {
try (FileInputStream fis = new FileInputStream(testFilespath + "/" + identifier)) {
//Document doc = builder.createDocument(fis, identifier);
//FileInputStream imageStream = new FileInputStream(testFilespath + "/" + identifier);
BufferedImage bimg = ImageIO.read(fis);
AutoColorCorrelogram vd = new AutoColorCorrelogram(MAXIMUM_DISTANCE, mode);
vd.extract(bimg);
featureString = vd.getStringRepresentation();
double[] bytearray=vd.getDoubleHistogram();
System.out.println("image: "+ identifier + " " + featureString );
}
System.out.println(" ------------- ");
input_image.set(identifier);
input_vector.set(featureString);
output.collect(input_image, input_vector);
}
}
}
public static class Reduce extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {
#Override
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String out_vector="";
while (values.hasNext()) {
out_vector.concat(values.next().toString());
}
output.collect(key, new Text(out_vector));
}
}
static int printUsage() {
System.out.println("image_mapreduce [-m <maps>] [-r <reduces>] <input> <output>");
ToolRunner.printGenericCommandUsage(System.out);
return -1;
}
#Override
public int run(String[] args) throws Exception {
JobConf conf = new JobConf(getConf(), image_mapreduce.class);
conf.setJobName("image_mapreduce");
// the keys are words (strings)
conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
conf.setOutputValueClass(Text.class);
conf.setMapperClass(MapClass.class);
// conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
List<String> other_args = new ArrayList<String>();
for(int i=0; i < args.length; ++i) {
try {
if ("-m".equals(args[i])) {
conf.setNumMapTasks(Integer.parseInt(args[++i]));
} else if ("-r".equals(args[i])) {
conf.setNumReduceTasks(Integer.parseInt(args[++i]));
} else {
other_args.add(args[i]);
}
} catch (NumberFormatException except) {
System.out.println("ERROR: Integer expected instead of " + args[i]);
return printUsage();
} catch (ArrayIndexOutOfBoundsException except) {
System.out.println("ERROR: Required parameter missing from " +
args[i-1]);
return printUsage();
}
}
FileInputFormat.setInputPaths(conf, other_args.get(0));
//FileInputFormat.setInputPaths(conf,new Path("hdfs://localhost:54310/user/root/images"));
FileOutputFormat.setOutputPath(conf, new Path(other_args.get(1)));
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new image_mapreduce(), args);
System.exit(res);
}
}
`I am writing a program which takes multiple image files as input , stored in hdfs & extract the features in map function. How can I specify the path to read the image in FileInputStream(some parameters)? Or is there any way to read the multiple image files?
What I want to do is:
--Take multiple image files in hdfs as input
-- extract features in map function.
--reduce itearatively.
Please help me in the code or better ways to do it.
Look into using the HIPI library - it stores a collection of images into an ImageBundle (which is more efficient that storing the individual image files in HDFS). They have a couple of examples too.
As for your code, you need to specify what input and output formats you plan to use. There is no current input format that hands the entire file over, but you can just extend FileInputFormat and create a RecordReader that emits <Text, BytesWritable> pairs, where the key is the filename, and the value is the bytes of the image file.
In fact Hadoop - The Definitive Guide has an example of this exact input format:
If you want to send all the images as input to MR task you just set the conf.setFileInputPath() to the directory of the input
If You want to send selective images in a particular folder You can add multiple paths when you are setting conf.setFileInputPath();
One way is to create a Path[] one for each image. or just set it to comma separated string with all the paths.
Go through the following documentation
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html
And one more thing you have to set the Map input formats as Text, ByteArray
get the image features from that ByteArray input instead of creating new fileinputstream.