I am new in Hadoop Map-reduce. My input is many text files and I want to write the map-reduce program such that it will write all the files-names and the associated sentences with the file names in one output file where
I want to just emit the file-name(key) and the associated sentences(value) from the mapper and the reducer will collect the key and all the values and write the file-name and their associated sentences in the output.
Mapper and reducer:
public void map(Text key, Text value,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
StringTokenizer itr = new StringTokenizer(value.toString(), ",");
String filename = new String();
FileSplit filesplit = (FileSplit) reporter.getInputSplit();
filename = filesplit.getpath().getName();
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(new Text(filename), word);
}
}
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
// int sum = 0;
String translation = "";
while (values.hasNext()) {
translation += "|" + values.toString() + "|";
}
results.set(translation);
output.collect(key, results);
}
When I run the above mapper and reducer with the same configuration of inputformat (keyvaluetextinputformat.class) it does not write any thing in the output.
What should I change to achieve my goal?
In your reduce method you declare values to be an Iterator. It should be declared as an Iterable instead.
public void reduce(Text key, Iterable<Text> values, ....
instead of
public void reduce(Text key, Iterator<Text> values, ....
Once you've done that, you can do:
Iterator<Text> iter = values.iterator();
while(iter.hasNext())
{
translation += "|" + iter.next().toString() + "|";
}
Because you used the wrong type the method isn't overriding the default reduce method which doesn't do anything. That's why you get no output.
I also don't see where you declare the variable results, either.
Related
I am trying to get the summary of a csv file and the first line of the file is the header. Is there a way to make the values of each column with its header name as key value pair from the Java code.
Eg: Input file is like
A,B,C,D
1,2,3,4
5,6,7,8
I want the the output from mapper as (A,1),(B,2),(C,3),(D,4),(A,5),....
Note:I tried using overriding the run function in the Mapper class to skip the first line. But As far as I know the run function gets called for each input split and is thus not suiting my need. Any help on this will really be appreciated.
This is the way my mapper looks:
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] splits = line.split(",",-1);
int length = splits.length;
// count = 0;
for (int i = 0; i < length; i++) {
columnName.set(header[i]);
context.write(columnName, new Text(splits[i]+""));
}
}
public void run(Context context) throws IOException, InterruptedException
{
setup(context);
try
{
if (context.nextKeyValue())
{
Text columnHeader = context.getCurrentValue();
header = columnHeader.toString().split(",");
}
while (context.nextKeyValue())
{
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
}
finally
{
cleanup(context);
}
}
I assume that the column headers are alphabets and column values are numbers.
One of the ways to achieve this, is to use DistributedCache.
Following are the steps:
Create a file containing the column headers.
In the Driver code, add this file to the distributed cache, by calling Job::addCacheFile()
In the setup() method of the mapper, access this file from the distributed cache. Parse and store the contents of the file in a columnHeader list.
In the map() method, check if the values in each record match the headers (stored in columnnHeader list). If yes, then ignore that record (Because the record just contains the headers). If no, then emit the values along with the column headers.
This is how the Mapper and Driver code looks like:
Driver:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "HeaderParser");
job.setJarByClass(WordCount.class);
job.setMapperClass(HeaderParserMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
job.addCacheFile(new URI("/in/header.txt#header.txt"));
FileInputFormat.addInputPath(job, new Path("/in/in7.txt"));
FileOutputFormat.setOutputPath(job, new Path("/out/"));
System.exit(job.waitForCompletion(true) ? 0:1);
}
Driver Logic:
Copy "header.txt" (which contains just one line: A,B,C,D) to HDFS
In the Driver, add "header.txt" to distributed cache, by executing following statement:
job.addCacheFile(new URI("/in/header.txt#header.txt"));
Mapper:
public static class HeaderParserMapper
extends Mapper<LongWritable, Text , Text, NullWritable>{
String[] headerList;
String header;
#Override
protected void setup(Mapper.Context context) throws IOException, InterruptedException {
BufferedReader bufferedReader = new BufferedReader(new FileReader("header.txt"));
header = bufferedReader.readLine();
headerList = header.split(",");
}
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] values = line.split(",");
if(headerList.length == values.length && !header.equals(line)) {
for(int i = 0; i < values.length; i++)
context.write(new Text(headerList[i] + "," + values[i]), NullWritable.get());
}
}
}
Mapper Logic:
Override setup() method.
Read "header.txt" (which was put in distributed cache in the Driver) in the setup() method.
In the map() method, check if the line matches the header. If yes, then ignore that line. Else, output header and values as (h1,v1), (h2,v2), (h3,v3) and (h4,v4).
I ran this program on the following input:
A,B,C,D
1,2,3,4
5,6,7,8
I got the following output (where values are matched with respective header):
A,1
A,5
B,2
B,6
C,3
C,7
D,4
D,8
The accepted answer by #Manjunath Ballur works as a good hack. But, Map Reduce must be used in conjunction to simplicity. Checking the header for each line is not the recommended way to do this.
One way to go is to write a custom InputFormat that does this work for you
I am trying to count the occurrence of a particular word in a file using hadoop mapreduce programming in java. Both the file and the word should be an user input. So I am trying to pass the particular word as third argument along with the i/p and o/p paths(In, Out, Word). But i am not able to find out a way to pass the word to the map function.
I have tried the following way but it did not work:
- created a static String variable in mapper class and assigned the value of my 3rd argument(ie. word to be searched) to it. And then tried to use this static variable inside map function. But inside map function the static variables value came as Null.
I am unable to get the third arument's value inside map function.
Is there anyway to set the value via JobConf object? Please help. I have pasted my code below.
public class MyWordCount {
public static class MyWordCountMap extends Mapper < Text, Text, Text, LongWritable > {
static String wordToSearch;
private final static LongWritable ONE = new LongWritable(1L);
private Text word = new Text();
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
System.out.println(wordToSearch); // Here the value is coming as Null
if (value.toString().compareTo(wordToSearch) == 0) {
context.write(word, ONE);
}
}
}
public static class SumReduce extends Reducer < Text, LongWritable, Text, LongWritable > {
public void reduce(Text key, Iterator < LongWritable > values,
Context context) throws IOException, InterruptedException {
long sum = 0L;
while (values.hasNext()) {
sum += values.next().get();
}
context.write(key, new LongWritable(sum));
}
}
public static void main(String[] rawArgs) throws Exception {
GenericOptionsParser parser = new GenericOptionsParser(rawArgs);
Configuration conf = parser.getConfiguration();
String[] args = parser.getRemainingArgs();
Job job = new Job(conf, "wordcount");
job.setJarByClass(MyWordCountMap.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setMapperClass(MyWordCountMap.class);
job.setReducerClass(SumReduce.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
String MyWord = args[2];
MyWordCountMap.wordToSearch = MyWord;
job.waitForCompletion(true);
}
}
There is a way to do this with Configuration (see api here). As an example, the following code can be used which sets "Tree" as the word to be searched:
//Create a new configuration
Configuration conf = new Configuration();
//Set the work to be searched
conf.set("wordToSearch", "Tree");
//create the job
Job job = new Job(conf);
Then, in your mapper/reducer class you can get wordToSearch (i.e., "Tree" in this example) using the following:
//Create a new configuration
Configuration conf = context.getConfiguration();
//retrieve the wordToSearch variable
String wordToSearch = conf.get("wordToSearch");
See here for more details.
I have a class which extends TreeMap with one external method.
The external method "open" suppose to read lines from a given file in the following format "word:meaning" and add it to the TreeMap - put("word", "meaning").
So I read the file with RandomAccessFile and put the keys-values in the TreeMap and when I print the TreeMap I can see the proper keys and values, for example:
{AAAA=BBBB, CAB=yahoo!}
But for some reason when I do get("AAAA") I get null.
Any reason why it's happening and how to solve it?
Here is the code
public class InMemoryDictionary extends TreeMap<String, String> implements
PersistentDictionary {
private static final long serialVersionUID = 1L; // (because we're extending
// a serializable class)
private File dictFile;
public InMemoryDictionary(File dictFile) {
super();
this.dictFile = dictFile;
}
#Override
public void open() throws IOException {
clear();
RandomAccessFile file = new RandomAccessFile(dictFile, "rw");
file.seek(0);
String line;
while (null != (line = file.readLine())) {
int firstColon = line.indexOf(":");
put(line.substring(0, firstColon - 1),
line.substring(firstColon + 1, line.length() - 1));
}
file.close();
}
#Override
public void close() throws IOException {
dictFile.delete();
RandomAccessFile file = new RandomAccessFile(dictFile, "rw");
file.seek(0);
for (Map.Entry<String, String> entry : entrySet()) {
file.writeChars(entry.getKey() + ":" + entry.getValue() + "\n");
}
file.close();
}
}
the "question marks" from a previous version of your question are important. they indicate that the strings you thought you were seeing are not in fact the strings you are using. RandomAccessFile is a poor choice to read a text file. You are presumably reading a text file with a text encoding which is not single byte (utf-16 perhaps)? the resulting strings are mis-encoded since RandomAccessFile does an "ascii" character conversion. this is causing your get() call to fail.
first, figure out the character encoding of your file and open it with the appropriately configured InputStreamReader.
second, extending TreeMap is a very poor design. Use aggregation here, not extension.
private static String[] testFiles = new String[] {"img01.JPG","img02.JPG","img03.JPG","img04.JPG","img06.JPG","img07.JPG","img05.JPG"};
// private static String testFilespath = "/home/student/Desktop/images";
private static String testFilespath ="hdfs://localhost:54310/user/root/images";
//private static String indexpath = "/home/student/Desktop/indexDemo";
private static String testExtensive="/home/student/Desktop/images";
public static class MapClass extends MapReduceBase
implements Mapper<Text, Text, Text, Text> {
private Text input_image = new Text();
private Text input_vector = new Text();
#Override
public void map(Text key, Text value,OutputCollector<Text, Text> output,Reporter reporter) throws IOException {
System.out.println("CorrelogramIndex Method:");
String featureString;
int MAXIMUM_DISTANCE = 16;
AutoColorCorrelogram.Mode mode = AutoColorCorrelogram.Mode.FullNeighbourhood;
for (String identifier : testFiles) {
try (FileInputStream fis = new FileInputStream(testFilespath + "/" + identifier)) {
//Document doc = builder.createDocument(fis, identifier);
//FileInputStream imageStream = new FileInputStream(testFilespath + "/" + identifier);
BufferedImage bimg = ImageIO.read(fis);
AutoColorCorrelogram vd = new AutoColorCorrelogram(MAXIMUM_DISTANCE, mode);
vd.extract(bimg);
featureString = vd.getStringRepresentation();
double[] bytearray=vd.getDoubleHistogram();
System.out.println("image: "+ identifier + " " + featureString );
}
System.out.println(" ------------- ");
input_image.set(identifier);
input_vector.set(featureString);
output.collect(input_image, input_vector);
}
}
}
public static class Reduce extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {
#Override
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String out_vector="";
while (values.hasNext()) {
out_vector.concat(values.next().toString());
}
output.collect(key, new Text(out_vector));
}
}
static int printUsage() {
System.out.println("image_mapreduce [-m <maps>] [-r <reduces>] <input> <output>");
ToolRunner.printGenericCommandUsage(System.out);
return -1;
}
#Override
public int run(String[] args) throws Exception {
JobConf conf = new JobConf(getConf(), image_mapreduce.class);
conf.setJobName("image_mapreduce");
// the keys are words (strings)
conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
conf.setOutputValueClass(Text.class);
conf.setMapperClass(MapClass.class);
// conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
List<String> other_args = new ArrayList<String>();
for(int i=0; i < args.length; ++i) {
try {
if ("-m".equals(args[i])) {
conf.setNumMapTasks(Integer.parseInt(args[++i]));
} else if ("-r".equals(args[i])) {
conf.setNumReduceTasks(Integer.parseInt(args[++i]));
} else {
other_args.add(args[i]);
}
} catch (NumberFormatException except) {
System.out.println("ERROR: Integer expected instead of " + args[i]);
return printUsage();
} catch (ArrayIndexOutOfBoundsException except) {
System.out.println("ERROR: Required parameter missing from " +
args[i-1]);
return printUsage();
}
}
FileInputFormat.setInputPaths(conf, other_args.get(0));
//FileInputFormat.setInputPaths(conf,new Path("hdfs://localhost:54310/user/root/images"));
FileOutputFormat.setOutputPath(conf, new Path(other_args.get(1)));
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new image_mapreduce(), args);
System.exit(res);
}
}
`I am writing a program which takes multiple image files as input , stored in hdfs & extract the features in map function. How can I specify the path to read the image in FileInputStream(some parameters)? Or is there any way to read the multiple image files?
What I want to do is:
--Take multiple image files in hdfs as input
-- extract features in map function.
--reduce itearatively.
Please help me in the code or better ways to do it.
Look into using the HIPI library - it stores a collection of images into an ImageBundle (which is more efficient that storing the individual image files in HDFS). They have a couple of examples too.
As for your code, you need to specify what input and output formats you plan to use. There is no current input format that hands the entire file over, but you can just extend FileInputFormat and create a RecordReader that emits <Text, BytesWritable> pairs, where the key is the filename, and the value is the bytes of the image file.
In fact Hadoop - The Definitive Guide has an example of this exact input format:
If you want to send all the images as input to MR task you just set the conf.setFileInputPath() to the directory of the input
If You want to send selective images in a particular folder You can add multiple paths when you are setting conf.setFileInputPath();
One way is to create a Path[] one for each image. or just set it to comma separated string with all the paths.
Go through the following documentation
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html
And one more thing you have to set the Map input formats as Text, ByteArray
get the image features from that ByteArray input instead of creating new fileinputstream.
I wrote some hadoop code to read the mapped file and split it into chunks and write it to many files as follows:
public void map(LongWritable key, Text value, OutputCollector<IntWritable, Text>
output,Reporter reporter) throws IOException {
String line = value.toString();
int totalLines = 2000;
int lines = 0;
int fileNum = 1;
String[] linesinfile = line.split("\n");
while(lines<linesinfile.length) {
// I do something like, if lines = totalLines, {
output.collect(new IntWritable(fileNum), new
Text(linesinfile[lines].toString()));
fileNum++;
lines = 0;
}
lines++;
}
}
In reduce, I do:
public void reduce(IntWritable key, Iterator<Text> values,
OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
while(values.hasNext()){
output.collect(key, values.next());
}
}
My MultiFile class is as follows:
public class MultiFileOutput extends MultipleTextOutputFormat<IntWritable, Text> {
protected String generateFileNameForKeyValue(IntWritable key, Text content, String
fileName) {
return key.toString() + "-" + fileName;
}
}
In main, I say:
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(MultiFileOutput.class);
apart from setting the OutKey/Value Class etc.
What am I doing wrong ? My output directory is always empty.
Thanks
The program looks a bit complex. If the purpose is to split the file into multiple files then it can be done in a couple of ways. There is no Need for a Map and Reduce job, just a Map job would be enough.
Use o.a.h.mapred.lib.NLineInputFormat to read N lines at a time to the mapper from the input and then write those N lines to a file.
Set the dfs.blocksize to the required file size while uploading the file, then each mapper will process one InputSplit which can be written to a file.