Reading file in Java Hadoop

Reading file in Java Hadoop - java

I am trying to follow a Hadoop tutorial on a website. I am trying to implement it in Java. The file that is provided is a file containing data about a forum. I want to parse that file and use the data.
The code to set my configurations is as follows:
public class ForumAnalyser extends Configured implements Tool{
public static void main(String[] args) {
int exitCode = 0;
try {
exitCode = ToolRunner.run(new ForumAnalyser(), args);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
finally {
System.exit(exitCode);
}
}
#Override
public int run(String[] args) throws Exception {
JobConf conf = new JobConf(ForumAnalyser.class);
setStudentHourPostJob(conf);
JobClient.runJob(conf);
return 0;
}
public static void setStudentHourPostJob(JobConf conf) {
FileInputFormat.setInputPaths(conf, new Path("input2"));
FileOutputFormat.setOutputPath(conf, new Path("output_forum_post"));
conf.setJarByClass(ForumAnalyser.class);
conf.setMapperClass(StudentHourPostMapper.class);
conf.setOutputKeyClass(LongWritable.class);
conf.setMapOutputKeyClass(LongWritable.class);
conf.setReducerClass(StudentHourPostReducer.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapOutputValueClass(IntWritable.class);
}
}
Each record in the file is separated by a "\n". So in the mapper class, each record is mostly correctly returned. Each column in every record is separated by tabs. The problem occurs for a specific column "posts". This column has the "posts" written by people and hence also contains "\n". So the mapper incorrectly reads a certain line under the "posts" column as a new record. Also, the "posts" column is specifically in double quotes in the file. The question that I have is:
1. How can I tell the mapper to differentiate each record correctly? Can I somehow tell it to read each column by tab? (I know how many columns each record has)?
Thanks in advance for the help.

By default, the MapReduce uses TextInputFormat, in which each record is a line of input (it assumes each record is delimited by new line ("\n")).
To achieve your requirements, you need to write your own InputFormat and RecordReader classes. For e.g. in Mahout, there is a XmlInputFormat for reading entire XML file as one record. Check the code here: https://github.com/apache/mahout/blob/master/integration/src/main/java/org/apache/mahout/text/wikipedia/XmlInputFormat.java
I took the code for XmlInputFormat and modified it to achieve your requirements. Here is the code (I call it as MultiLineInputFormat and MultiLineRecordReader):
package com.myorg.hadooptests;
import com.google.common.io.Closeables;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DataOutputBuffer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
/**
* Reads records that are delimited by a specific begin/end tag.
*/
public class MultiLineInputFormat extends TextInputFormat {
private static final Logger log = LoggerFactory.getLogger(MultiLineInputFormat.class);
#Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) {
try {
return new MultiLineRecordReader((FileSplit) split, context.getConfiguration());
} catch (IOException ioe) {
log.warn("Error while creating MultiLineRecordReader", ioe);
return null;
}
}
/**
* MultiLineRecordReader class to read through a given text document to output records containing multiple
* lines as a single line
*
*/
public static class MultiLineRecordReader extends RecordReader<LongWritable, Text> {
private final long start;
private final long end;
private final FSDataInputStream fsin;
private final DataOutputBuffer buffer = new DataOutputBuffer();
private LongWritable currentKey;
private Text currentValue;
private static final Logger log = LoggerFactory.getLogger(MultiLineRecordReader.class);
public MultiLineRecordReader(FileSplit split, Configuration conf) throws IOException {
// open the file and seek to the start of the split
start = split.getStart();
end = start + split.getLength();
Path file = split.getPath();
FileSystem fs = file.getFileSystem(conf);
fsin = fs.open(split.getPath());
fsin.seek(start);
log.info("start: " + Long.toString(start) + " end: " + Long.toString(end));
}
private boolean next(LongWritable key, Text value) throws IOException {
if (fsin.getPos() < end) {
try {
log.info("Started reading");
if(readUntilEnd()) {
key.set(fsin.getPos());
value.set(buffer.getData(), 0, buffer.getLength());
return true;
}
} finally {
buffer.reset();
}
}
return false;
}
#Override
public void close() throws IOException {
Closeables.closeQuietly(fsin);
}
#Override
public float getProgress() throws IOException {
return (fsin.getPos() - start) / (float) (end - start);
}
private boolean readUntilEnd() throws IOException {
boolean insideColumn = false;
byte[] delimiterBytes = new String("\"").getBytes("utf-8");
byte[] newLineBytes = new String("\n").getBytes("utf-8");
while (true) {
int b = fsin.read();
// end of file:
if (b == -1) return false;
log.info("Read: " + b);
// We encountered a Double Quote
if(b == delimiterBytes[0]) {
if(!insideColumn)
insideColumn = true;
else
insideColumn = false;
}
// If we encounter a new line and we are not inside a columnt, it means end of record.
if(b == newLineBytes[0] && !insideColumn) return true;
// save to buffer:
buffer.write(b);
// see if we've passed the stop point:
if (fsin.getPos() >= end) {
if(buffer.getLength() > 0) // If buffer has some data, then return true
return true;
else
return false;
}
}
}
#Override
public LongWritable getCurrentKey() throws IOException, InterruptedException {
return currentKey;
}
#Override
public Text getCurrentValue() throws IOException, InterruptedException {
return currentValue;
}
#Override
public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
}
#Override
public boolean nextKeyValue() throws IOException, InterruptedException {
currentKey = new LongWritable();
currentValue = new Text();
return next(currentKey, currentValue);
}
}
}
Logic:
I have assumed that the fields containing new lines ("\n") are delimited by double quotes (").
The record reading logic is in readUntilEnd() method.
In this method, if a new line appears and we are in the middle of reading a field (which is delimited by double quotes), we do not consider it as one record.
To test this, I wrote a Identity Mapper (which writes the input as-is to the output). In the driver, you explicitly specify the input format as your custom input format.
For e.g., I have specified the input format as:
job.setInputFormatClass(MultiLineInputFormat.class); // This is my custom class for InputFormat and RecordReader
Following is the code:
package com.myorg.hadooptests;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class MultiLineDemo {
public static class MultiLineMapper
extends Mapper<LongWritable, Text , Text, NullWritable> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
context.write(value, NullWritable.get());
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "MultiLineMapper");
job.setInputFormatClass(MultiLineInputFormat.class);
job.setJarByClass(MultiLineDemo.class);
job.setMapperClass(MultiLineMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.addInputPath(job, new Path("/in/in8.txt"));
FileOutputFormat.setOutputPath(job, new Path("/out/"));
job.waitForCompletion(true);
}
}
I ran this on the following input. The input records match the output records exactly. You can see that 2nd field in each record, contains new lines ("\n"), but still entire record is returned in the output.
E:\HadoopTests\target>hadoop fs -cat /in/in8.txt
1 "post1 \n" 3
1 "post2 \n post2 \n" 3
4 "post3 \n post3 \n post3 \n" 6
1 "post4 \n post4 \n post4 \n post4 \n" 6
E:\HadoopTests\target>hadoop fs -cat /out/*
1 "post1 \n" 3
1 "post2 \n post2 \n" 3
1 "post4 \n post4 \n post4 \n post4 \n" 6
4 "post3 \n post3 \n post3 \n" 6
Note: I wrote this code for demo purpose. You need to handle the corner cases (if any) and optimize the code (if there is a scope for optimization).

Related

Changing number of splits for Hadoop job

I am currently writing code to process a single image using Hadoop, so my input is only one file (.png). I have working code that will run a job, but instead of running sequential mappers, it runs only one mapper and never spawns other mappers.
I have created my own extensions of the FileInputFormat and RecordReader classes in order to create (what I thought were) "n" custom splits -> "n" map tasks.
I've been searching the web like crazy for examples of this nature to learn from, but all I've been able to find are examples which deal with using entire files as a split (meaning exactly one mapper) or using a fixed number of lines from a text file (e.g., 3) per map task.
What I'm trying to do is send a pair of coordinates ((x1, y1), (x2, y2)) to each mapper where the coordinates correspond to the top-left/bottom-right pixels of some rectangle in the image.
Any suggestions/guidance/examples/links to examples would greatly be appreciated.
Custom FileInputFormat
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import java.io.IOException;
public class FileInputFormat1 extends FileInputFormat
{
#Override
public RecordReader createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
return new RecordReader1();
}
#Override
protected boolean isSplitable(JobContext context, Path filename) {
return true;
}
}
Custom RecordReader
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import java.io.IOException;
public class RecordReader1 extends RecordReader<KeyChunk1, NullWritable> {
private KeyChunk1 key;
private NullWritable value;
private ImagePreprocessor IMAGE;
public RecordReader1()
{
}
#Override
public void close() throws IOException {
}
#Override
public float getProgress() throws IOException, InterruptedException {
return IMAGE.getProgress();
}
#Override
public KeyChunk1 getCurrentKey() throws IOException, InterruptedException {
return key;
}
#Override
public NullWritable getCurrentValue() throws IOException, InterruptedException {
return value;
}
#Override
public boolean nextKeyValue() throws IOException, InterruptedException {
boolean gotNextValue = IMAGE.hasAnotherChunk();
if (gotNextValue)
{
if (key == null)
{
key = new KeyChunk1();
}
if (value == null)
{
value = NullWritable.get();
}
int[] data = IMAGE.getChunkIndicesAndIndex();
key.setChunkIndex(data[2]);
key.setStartRow(data[0]);
key.setStartCol(data[1]);
key.setChunkWidth(data[3]);
key.setChunkHeight(data[4]);
}
else
{
key = null;
value = null;
}
return gotNextValue;
}
#Override
public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
Configuration config = taskAttemptContext.getConfiguration();
IMAGE = new ImagePreprocessor(
config.get("imageName"),
config.getInt("v_slices", 1),
config.getInt("h_slices", 1),
config.getInt("kernel_rad", 2),
config.getInt("grad_rad", 1),
config.get("hdfs_address"),
config.get("local_directory")
);
}
}
ImagePreprocessor Class (Used in custom RecordReader - only showing necessary information)
import java.awt.image.BufferedImage;
import java.io.IOException;
public class ImagePreprocessor {
private String filename;
private int num_v_slices;
private int num_h_slices;
private int minSize;
private int width, height;
private int chunkWidth, chunkHeight;
private int indexI, indexJ;
String hdfs_address, local_directory;
public ImagePreprocessor(String filename, int num_v_slices, int num_h_slices, int kernel_radius, int gradient_radius,
String hdfs_address, String local_directory) throws IOException{
this.hdfs_address = hdfs_address;
this.local_directory = local_directory;
// all "validate" methods throw errors if input data is invalid
checkValidFilename(filename);
checkValidNumber(num_v_slices, "vertical strips");
this.num_v_slices = num_v_slices;
checkValidNumber(num_h_slices, "horizontal strips");
this.num_h_slices = num_h_slices;
checkValidNumber(kernel_radius, "kernel radius");
checkValidNumber(gradient_radius, "gradient radius");
this.minSize = 1 + 2 * (kernel_radius + gradient_radius);
getImageData(); // loads image and saves width/height to class variables
validateImageSize();
chunkWidth = validateWidth((int)Math.ceil(((double)width) / num_v_slices));
chunkHeight = validateHeight((int)Math.ceil(((double)height) / num_h_slices));
indexI = 0;
indexJ = 0;
}
public boolean hasAnotherChunk()
{
return indexI < num_h_slices;
}
public int[] getChunkIndicesAndIndex()
{
int[] ret = new int[5];
ret[0] = indexI;
ret[1] = indexJ;
ret[2] = indexI*num_v_slices + indexJ;
ret[3] = chunkWidth;
ret[4] = chunkHeight;
indexJ += 1;
if (indexJ >= num_v_slices)
{
indexJ = 0;
indexI += 1;
}
return ret;
}
}
Thank you for your time!

You should override method public InputSplit[] getSplits(JobConf job, int numSplits) in your FileInputFormat1 class. Create your own class based on InputSplit with rectangle coordinates, so inside FileInputFormat you can get this information to return correct key/value pairs to mapper.
Probably implementation of getSplits in FileInputFormat could help you see here.

Global variable's value doesn't change after for Loop

I am developing a hadoop project. I want to find customers in a certain day and then write those with the max consumption in that day. In my reducer class, for some reason, the global variable max doesn't change it's value after a for loop.
EDIT I want to find the customers with max consumption in a certain day. I have managed to find the customers in the date I want, but I am facing a problem in my Reducer class. Here is the code:
EDIT #2 I already know that the values(consumption) are Natural numbers. So in my output file I want to be only the customers, of a certain day, with max consumption.
EDIT #3 My input file is consisted of many data. It has three columns; the customer's id, the timestamp (yyyy-mm-DD HH:mm:ss) and the consumption
Driver class
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class alicanteDriver {
public static void main(String[] args) throws Exception {
long t_start = System.currentTimeMillis();
long t_end;
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Alicante");
job.setJarByClass(alicanteDriver.class);
job.setMapperClass(alicanteMapperC.class);
//job.setCombinerClass(alicanteCombiner.class);
job.setPartitionerClass(alicantePartitioner.class);
job.setNumReduceTasks(2);
job.setReducerClass(alicanteReducerC.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("/alicante_1y.txt"));
FileOutputFormat.setOutputPath(job, new Path("/alicante_output"));
job.waitForCompletion(true);
t_end = System.currentTimeMillis();
System.out.println((t_end-t_start)/1000);
}
}
Mapper class
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class alicanteMapperC extends
Mapper<LongWritable, Text, Text, IntWritable> {
String Customer = new String();
SimpleDateFormat ft = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
Date t = new Date();
IntWritable Consumption = new IntWritable();
int counter = 0;
// new vars
int max = 0;
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Date d2 = null;
try {
d2 = ft.parse("2013-07-01 01:00:00");
} catch (ParseException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
if (counter > 0) {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line, ",");
while (itr.hasMoreTokens()) {
Customer = itr.nextToken();
try {
t = ft.parse(itr.nextToken());
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Consumption.set(Integer.parseInt(itr.nextToken()));
//sort out as many values as possible
if(Consumption.get() > max) {
max = Consumption.get();
}
//find customers in a certain date
if (t.compareTo(d2) == 0 && Consumption.get() == max) {
context.write(new Text(Customer), Consumption);
}
}
}
counter++;
}
}
Reducer class
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import com.google.common.collect.Iterables;
public class alicanteReducerC extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int max = 0; //this var
// declaration of Lists
List<Text> l1 = new ArrayList<Text>();
List<IntWritable> l2 = new ArrayList<IntWritable>();
for (IntWritable val : values) {
if (val.get() > max) {
max = val.get();
}
l1.add(key);
l2.add(val);
}
for (int i = 0; i < l1.size(); i++) {
if (l2.get(i).get() == max) {
context.write(key, new IntWritable(max));
}
}
}
}
Some values of the Input file
C11FA586148,2013-07-01 01:00:00,3
C11FA586152,2015-09-01 15:22:22,3
C11FA586168,2015-02-01 15:22:22,1
C11FA586258,2013-07-01 01:00:00,5
C11FA586413,2013-07-01 01:00:00,5
C11UA487446,2013-09-01 15:22:22,3
C11UA487446,2013-07-01 01:00:00,3
C11FA586148,2013-07-01 01:00:00,4
Output should be
C11FA586258 5
C11FA586413 5
I have searched the forums for a couple of hours, and still can't find the issue. Any ideas?

here is the refactored code: you can pass/change specific value for date of consumption. In this case you don't need reducer. my first answer was to query max comsumption from input, and this answer is to query user provided consumption from input. setup method will get user provided value for mapper.maxConsumption.date and pass them to map method. cleaup method in reducer scans all max consumption customers and writes final max in input (i.e, 5 in this case) - see screen shot for detail execution log:
run as:
hadoop jar maxConsumption.jar -Dmapper.maxConsumption.date="2013-07-01 01:00:00" Data/input.txt output/maxConsupmtion5
#input:
C11FA586148,2013-07-01 01:00:00,3
C11FA586152,2015-09-01 15:22:22,3
C11FA586168,2015-02-01 15:22:22,1
C11FA586258,2013-07-01 01:00:00,5
C11FA586413,2013-07-01 01:00:00,5
C11UA487446,2013-09-01 15:22:22,3
C11UA487446,2013-07-01 01:00:00,3
C11FA586148,2013-07-01 01:00:00,4
#output:
C11FA586258 5
C11FA586413 5
public class maxConsumption extends Configured implements Tool{
public static class DataMapper extends Mapper<Object, Text, Text, IntWritable> {
SimpleDateFormat ft = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
Date dateInFile, filterDate;
int lineno=0;
private final static Text customer = new Text();
private final static IntWritable consumption = new IntWritable();
private final static Text maxConsumptionDate = new Text();
public void setup(Context context) {
Configuration config = context.getConfiguration();
maxConsumptionDate.set(config.get("mapper.maxConsumption.date"));
}
public void map(Object key, Text value, Context context) throws IOException, InterruptedException{
try{
lineno++;
filterDate = ft.parse(maxConsumptionDate.toString());
//map data from line/file
String[] fields = value.toString().split(",");
customer.set(fields[0].trim());
dateInFile = ft.parse(fields[1].trim());
consumption.set(Integer.parseInt(fields[2].trim()));
if(dateInFile.equals(filterDate)) //only send to reducer if date filter matches....
context.write(new Text(customer), consumption);
}catch(Exception e){
System.err.println("Invaid Data at line: " + lineno + " Error: " + e.getMessage());
}
}
}
public static class DataReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
LinkedHashMap<String, Integer> maxConsumption = new LinkedHashMap<String,Integer>();
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int max=0;
System.out.print("reducer received: " + key + " [ ");
for(IntWritable value: values){
System.out.print( value.get() + " ");
if(value.get() > max)
max=value.get();
}
System.out.println( " ]");
System.out.println(key.toString() + " max is " + max);
maxConsumption.put(key.toString(), max);
}
#Override
protected void cleanup(Context context)
throws IOException, InterruptedException {
int max=0;
//first find the max from reducer
for (String key : maxConsumption.keySet()){
System.out.println("cleaup customer : " + key.toString() + " consumption : " + maxConsumption.get(key)
+ " max: " + max);
if(maxConsumption.get(key) > max)
max=maxConsumption.get(key);
}
System.out.println("final max is: " + max);
//write only the max value from map
for (String key : maxConsumption.keySet()){
if(maxConsumption.get(key) == max)
context.write(new Text(key), new IntWritable(maxConsumption.get(key)));
}
}
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new maxConsumption(), args);
System.exit(res);
}
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: -Dmapper.maxConsumption.date=\"2013-07-01 01:00:00\" <in> <out>");
System.exit(2);
}
Configuration conf = this.getConf();
Job job = Job.getInstance(conf, "get-max-consumption");
job.setJarByClass(maxConsumption.class);
job.setMapperClass(DataMapper.class);
job.setReducerClass(DataReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
FileSystem fs = null;
Path dstFilePath = new Path(args[1]);
try {
fs = dstFilePath.getFileSystem(conf);
if (fs.exists(dstFilePath))
fs.delete(dstFilePath, true);
} catch (IOException e1) {
e1.printStackTrace();
}
return job.waitForCompletion(true) ? 0 : 1;
}
}

Probably all the values that go to your reducer are under 0. Try min value to identify if you variable change.
max = MIN_VALUE;
Based on what you say, the output should be only 0's (in this the max value in the reducers are 0) or no output (all values less than 0). Also, look this
context.write(key, new IntWritable());
it should be
context.write(key, new IntWritable(max));
EDIT: I just saw your Mapper class, it has a lot of problems. the following code is skiping the first element in every mapper. why?
if (counter > 0) {
I guess, that you are getting something like this right? "customer, 2013-07-01 01:00:00, 2,..." if that is the case and you are already filtering values, you should declare your max variable as local, not in the mapper scope, it would affect multiple customers.
There are a lot of questions around this.. you could explain your input for each mapper and what you want to do.
EDIT2: Based on your answer I would try this
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class AlicanteMapperC extends Mapper<LongWritable, Text, Text, IntWritable> {
private final int max = 5;
private SimpleDateFormat ft = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Date t = null;
String[] line = value.toString().split(",");
String customer = line[0];
try {
t = ft.parse(line[1]);
} catch (ParseException e) {
// TODO Auto-generated catch block
throw new RuntimeException("something wrong with the date!" + line[1]);
}
Integer consumption = Integer.parseInt(line[2]);
//find customers in a certain date
if (t.compareTo(ft.parse("2013-07-01 01:00:00")) == 0 && consumption == max) {
context.write(new Text(customer), new IntWritable(consumption));
}
counter++;
}
}
and reducer pretty simple to emit 1 record per customer
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import com.google.common.collect.Iterables;
public class AlicanteReducerC extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
//We already now that it is 5
context.write(key, new IntWritable(5));
//If you want something different, for example report customer with different values, you could iterate over the iterator like this
//for (IntWritable val : values) {
// context.write(key, new IntWritable(val));
//}
}
}

hadoop mapreduce to generate substrings of different lengths

Using Hadoop mapreduce I am writing code to get substrings of different lengths. Example given string "ZYXCBA" and length 3 (Using a text file i give input as "3 ZYXCBA"). My code has to return all possible strings of length 3 ("ZYX","YXC","XCB","CBA"), length 4("ZYXC","YXCB","XCBA") finally length 5("ZYXCB","YXCBA").
In map phase I did the following:
key = length of substrings I want
value = "ZYXCBA".
So mapper output is
3,"ZYXCBA"
4,"ZYXCBA"
5,"ZYXCBA"
In reduce I take string ("ZYXCBA") and key 3 to get all substrings of length 3. Same occurs for 4,5. Results are concatenated using a string. So out put of reduce should be :
3 "ZYX YXC XCB CBA"
4 "ZYXC YXCB XCBA"
5 "ZYXCB YXCBA"
I am running my code using following command:
hduser#Ganesh:~/Documents$ hadoop jar Saishingles.jar hadoopshingles.Saishingles Behara/Shingles/input Behara/Shingles/output
My code is as shown below:
package hadoopshingles;
import java.io.IOException;
//import java.util.ArrayList;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class Saishingles{
public static class shinglesmapper extends Mapper<Object, Text, IntWritable, Text>{
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String str = new String(value.toString());
String[] list = str.split(" ");
int x = Integer.parseInt(list[0]);
String val = list[1];
int M = val.length();
int X = M-1;
for(int z = x; z <= X; z++)
{
context.write(new IntWritable(z), new Text(val));
}
}
}
public static class shinglesreducer extends Reducer<IntWritable,Text,IntWritable,Text> {
public void reduce(IntWritable key, Text value, Context context
) throws IOException, InterruptedException {
int z = key.get();
String str = new String(value.toString());
int M = str.length();
int Tz = M - z;
String newvalue = "";
for(int position = 0; position <= Tz; position++)
{
newvalue = newvalue + " " + str.substring(position,position + z);
}
context.write(new IntWritable(z),new Text(newvalue));
}
}
public static void main(String[] args) throws Exception {
GenericOptionsParser parser = new GenericOptionsParser(args);
Configuration conf = parser.getConfiguration();
String[] otherArgs = parser.getRemainingArgs();
if (otherArgs.length != 2)
{
System.err.println("Usage: Saishingles <inputFile> <outputDir>");
System.exit(2);
}
Job job = Job.getInstance(conf, "Saishingles");
job.setJarByClass(hadoopshingles.Saishingles.class);
job.setMapperClass(shinglesmapper.class);
//job.setCombinerClass(shinglesreducer.class);
job.setReducerClass(shinglesreducer.class);
//job.setMapOutputKeyClass(IntWritable.class);
//job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Output of reduce instead of returning
3 "ZYX YXC XCB CBA"
4 "ZYXC YXCB XCBA"
5 "ZYXCB YXCBA"
it's returning
3 "ZYXCBA"
4 "ZYXCBA"
5 "ZYXCBA"
i.e., it's giving same output as mapper. Don't know why this is happening. Please help me resolve this and thanks in advance for helping ;) :) :)

You can achieve this without even running reducer. your map/reduce logic is wrong...transformation should be done in Mapper.
Reduce - In this phase the reduce(WritableComparable, Iterator, OutputCollector, Reporter) method is called for each <key, (list of values)> pair in the grouped inputs.
in your reduce signature: public void reduce(IntWritable key, Text value, Context context)
should be public void reduce(IntWritable key, Iterable<Text> values, Context context)
Also, change last line of reduce method: context.write(new IntWritable(z),new Text(newvalue)); to context.write(key,new Text(newvalue)); - you already have Intwritable Key from mapper, I wouldn't create new one.
with given input:
3 "ZYXCBA"
4 "ZYXCBA"
5 "ZYXCBA"
Mapper job will output:
3 "XCB YXC ZYX"
4 "XCBA YXCB ZYXC"
5 "YXCBA ZYXCB"
MapReduceJob:
import java.io.IOException;
import java.util.ArrayList;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class SubStrings{
public static class SubStringsMapper extends Mapper<Object, Text, IntWritable, Text> {
#Override
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String [] values = value.toString().split(" ");
int len = Integer.parseInt(values[0].trim());
String str = values[1].replaceAll("\"", "").trim();
int endindex=len;
for(int i = 0; i < len; i++)
{
endindex=i+len;
if(endindex <= str.length())
context.write(new IntWritable(len), new Text(str.substring(i, endindex)));
}
}
}
public static class SubStringsReducer extends Reducer<IntWritable, Text, IntWritable, Text> {
public void reduce(IntWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String str="\""; //adding starting quotes
for(Text value: values)
str += " " + value;
str=str.replace("\" ", "\"") + "\""; //adding ending quotes
context.write(key, new Text(str));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "get-possible-strings-by-length");
job.setJarByClass(SubStrings.class);
job.setMapperClass(SubStringsMapper.class);
job.setReducerClass(SubStringsReducer.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
FileSystem fs = null;
Path dstFilePath = new Path(args[1]);
try {
fs = dstFilePath.getFileSystem(conf);
if (fs.exists(dstFilePath))
fs.delete(dstFilePath, true);
} catch (IOException e1) {
e1.printStackTrace();
}
job.waitForCompletion(true);
}
}

How can I go back to Main method in my code, And depending on Condition?

In this program I am Reading .xlsx file. And adding cell data to vector, if vector size is less-than 12 no need to read remaining data, and i need to go main method.
How can I do in my program ?
This is my Code :
package com.read;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Vector;
import org.apache.poi.openxml4j.opc.OPCPackage;
import java.io.InputStream;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;
public class SendDataToDb {
public static void main(String[] args) {
SendDataToDb sd = new SendDataToDb();
try {
sd.processOneSheet("C:/Users/User/Desktop/New folder/Untitled 2.xlsx");
System.out.println("in Main method");
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void processOneSheet(String filename) throws Exception {
System.out.println("executing Process Method");
OPCPackage pkg = OPCPackage.open(filename);
XSSFReader r = new XSSFReader( pkg );
SharedStringsTable sst = r.getSharedStringsTable();
System.out.println("count "+sst.getCount());
XMLReader parser = fetchSheetParser(sst);
// To look up the Sheet Name / Sheet Order / rID,
// you need to process the core Workbook stream.
// Normally it's of the form rId# or rSheet#
InputStream sheet2 = r.getSheet("rId2");
System.out.println("Sheet2");
InputSource sheetSource = new InputSource(sheet2);
parser.parse(sheetSource);
sheet2.close();
}
public XMLReader fetchSheetParser(SharedStringsTable sst) throws SAXException {
//System.out.println("EXECUTING fetchSheetParser METHOD");
XMLReader parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
ContentHandler handler = new SheetHandler(sst);
parser.setContentHandler(handler);
System.out.println("Method :fetchSheetParser");
return parser;
}
/**
* See org.xml.sax.helpers.DefaultHandler javadocs
*/
private class SheetHandler extends DefaultHandler {
private SharedStringsTable sst;
private String lastContents;
private boolean nextIsString;
Vector values = new Vector(20);
private SheetHandler(SharedStringsTable sst) {
this.sst = sst;
}
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
// c => cell
//long l = Long.valueOf(attributes.getValue("r"));
if(name.equals("c")){
columnNum++;
}
if(name.equals("c")) {
// Print the cell reference
// Figure out if the value is an index in the SST
String cellType = attributes.getValue("t");
if(cellType != null && cellType.equals("s")) {
nextIsString = true;
} else {
nextIsString = false;
}
}
// Clear contents cache
lastContents = "";
}
public void endElement(String uri, String localName, String name)
throws SAXException {
//System.out.println("Method :222222222");
// Process the last contents as required.
// Do now, as characters() may be called more than once
if(nextIsString) {
int idx = Integer.parseInt(lastContents);
lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
nextIsString = false;
}
// v => contents of a cell
// Output after we've seen the string contents
if(name.equals("v")) {
values.add(lastContents);
}
if(name.equals("row")) {
System.out.println(values);
//values.setSize(50);
System.out.println(values.size()+" "+values.capacity());
//********************************************************
//I AM CHECKING CONDITION HERE, IF CONDITION IS TRUE I NEED STOP THE REMAINING PROCESS AND GO TO MAIN METHOD.
if(values.size() < 12)
values.removeAllElements();
//WHAT CODE I NEED TO WRITE HERE TO STOP THE EXECUTION OF REMAINING PROCESS AND GO TO MAIN
//***************************************************************
}
}
public void characters(char[] ch, int start, int length)
throws SAXException {
//System.out.println("method : 333333333333");
lastContents += new String(ch, start, length);
}
}
}
check the code in between lines of //******************************
and
//******************************************

You can throw a SAXException wherever you want the parsing to stop:
throw new SAXException("<Your message>")
and handle it in the main method.

After your checking, you should throw the Exception to get out from there and get it back to the main method.
throw new Exception("vector size has to be less than 12");

Dynamically compiling and Running a Hadoop job from another Java File

I am trying to write a Java file that receives the source code of a MapReduce job, compiles it dynamically and runs the job on a Hadoop cluster. To reach this, I have written 3 methods called compile(), makeJAR() and run_Hadoop_Job(). Everything works fine with the compilation and creation of the JAR file. However, when the job is submitted to Hadoop, as soon as the job starts, it faces problem with finding required Mapper/Reducer classes and throws a ClassNotFoundException for both the Mapper_Class and Reducer_Class *(java.lang.ClassNotFoundException: reza.rCloud.Mapper_Reducer_Classes$Mapper_Class.class)* . I know that there should be something wrong with how I have referenced the required Mapper/Reducer classes but I was not able to figure it out after several. Any help/suggestion on how to solve the issue is highly appreciated.
Regarding the details of the project: I have a file called "rCloud_test/src/reza/Mapper_Reducer_Classes.java" that contains the source code for Mapper_Class and Reducer_Class. This file is ultimately received during the runtime but for now I copied the Hadoop WordCount example in it and store it locally in the same folder as my main class file: rCloud_test/src/reza/Platform2.java.
Here below you can see the main() method of the Platform2.java which is my main class for this project:
public static void main(String[] args){
System.out.println("Code Execution Started");
String className = "Mapper_Reducer_Classes";
Platform2 myPlatform = new Platform2();
//step 1: compile the received class file dynamically:
boolean compResult = myPlatform.compile(className);
System.out.println(className + ".java compilation result: "+compResult);
//step 2: make a JAR file out of the compiled file:
if (compResult) {
compResult = myPlatform.makeJAR("jar_file", myPlatform.compilation_Output_Folder);
System.out.println("JAR creation result: "+compResult);
}
//step 3: Now let's run the Hadoop job:
if (compResult) {
compResult = myPlatform.run_Hadoop_Job(className);
System.out.println("Running on Hadoop result: "+compResult);
}
The method that is causing me all the problems is the run_Hadoop_Job() which is as below:
private boolean run_Hadoop_Job(String className){
try{
System.out.println("*Starting to run the code on Hadoop...");
String[] argsTemp = { "project_test/input", "project_test/output" };
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://localhost:54310");
conf.set("mapred.job.tracker", "localhost:54311");
conf.set("mapred.jar", jar_Output_Folder + "/jar_file"+".jar");
conf.set("libjars", required_Execution_Classes);
//THIS IS WHERE IT CAN'T FIND THE MENTIONED CLASSES, ALTHOUGH THEY EXIST BOTH ON DISK
// AND IN THE CREATED JAR FILE:??????
System.out.println("Getting Mapper/Reducer package name: " +
Mapper_Reducer_Classes.class.getName());
conf.set("mapreduce.map.class", "reza.rCloud.Mapper_Reducer_Classes$Mapper_Class");
conf.set("mapreduce.reduce.class", "reza.rCloud.Mapper_Reducer_Classes$Reducer_Class");
Job job = new Job(conf, "Hadoop Example for dynamically and programmatically compiling-running a job");
job.setJarByClass(Platform2.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(argsTemp[0]));
FileSystem fs = FileSystem.get(conf);
Path out = new Path(argsTemp[1]);
fs.delete(out, true);
FileOutputFormat.setOutputPath(job, new Path(argsTemp[1]));
//job.submit();
System.out.println("*and now submitting the job to Hadoop...");
System.exit(job.waitForCompletion(true) ? 0 : 1);
System.out.println("Job Finished!");
} catch (Exception e) {
System.out.println("****************Exception!" );
e.printStackTrace();
return false;
}
return true;
}
if needed, here's the source code for the compile() method:
private boolean compile(String className) {
String fileToCompile = JOB_FOLDER + "/" +className+".java";
JavaCompiler compiler = ToolProvider.getSystemJavaCompiler();
FileOutputStream errorStream = null;
try{
errorStream = new FileOutputStream(JOB_FOLDER + "/logs/Errors.txt");
} catch(FileNotFoundException e){
//if problem creating the file, default wil be console
}
int compilationResult =
compiler.run( null, null, errorStream,
"-classpath", required_Compilation_Classes,
"-d", compilation_Output_Folder,
fileToCompile);
if (compilationResult == 0) {
//Compilation is successful:
return true;
} else {
//Compilation Failed:
return false;
}
}
and the source code for makeJAR() method:
private boolean makeJAR(String outputFileName, String inputDirectory) {
Manifest manifest = new Manifest();
manifest.getMainAttributes().put(Attributes.Name.MANIFEST_VERSION,
"1.0");
JarOutputStream target = null;
try {
target = new JarOutputStream(new FileOutputStream(
jar_Output_Folder+ "/"
+ outputFileName+".jar" ), manifest);
add(new File(inputDirectory), target);
} catch (Exception e) { return false; }
finally {
if (target != null)
try{
target.close();
} catch (Exception e) { return false; }
}
return true;
}
private void add(File source, JarOutputStream target) throws IOException
{
BufferedInputStream in = null;
try
{
if (source.isDirectory())
{
String name = source.getPath().replace("\\", "/");
if (!name.isEmpty())
{
if (!name.endsWith("/"))
name += "/";
JarEntry entry = new JarEntry(name);
entry.setTime(source.lastModified());
target.putNextEntry(entry);
target.closeEntry();
}
for (File nestedFile: source.listFiles())
add(nestedFile, target);
return;
}
JarEntry entry = new JarEntry(source.getPath().replace("\\", "/"));
entry.setTime(source.lastModified());
target.putNextEntry(entry);
in = new BufferedInputStream(new FileInputStream(source));
byte[] buffer = new byte[1024];
while (true)
{
int count = in.read(buffer);
if (count == -1)
break;
target.write(buffer, 0, count);
}
target.closeEntry();
}
finally
{
if (in != null)
in.close();
}
}
and finally the fixed parameters used for accessing the files:
private String JOB_FOLDER = "/Users/reza/My_Software/rCloud_test/src/reza/rCloud";
private String HADOOP_SOURCE_FOLDER = "/Users/reza/My_Software/hadoop-0.20.2";
private String required_Compilation_Classes = HADOOP_SOURCE_FOLDER + "/hadoop-0.20.2-core.jar";
private String required_Execution_Classes = required_Compilation_Classes + "," +
"/Users/reza/My_Software/ActorFoundry_dist_ver/lib/commons-cli-1.1.jar," +
"/Users/reza/My_Software/ActorFoundry_dist_ver/lib/commons-logging-1.1.1.jar";
public String compilation_Output_Folder = "/Users/reza/My_Software/rCloud_test/dyn_classes";
private String jar_Output_Folder = "/Users/reza/My_Software/rCloud_test/dyn_jar";
As a result of running the Platform2, the structure of the project on disk looks as below:
rCloud_test/classes/reza/rCloud/Platform2.class: contain the Platform2 class
rCloud_test/dyn_classes/reza/rCloud/ contains the classes for Mapper_Reducer_Classes.class, Mapper_Reducer_Classes$Mapper_Class.class, and Mapper_Reducer_Classes$Reducer_Class.class
rCloud_test/dyn_jar/jar_file.jar contains the created jar file
REVSED: here's the source code for the rCloud_test/src/reza/rCloud/Mapper_Reducer_Classes.java:
package reza.rCloud;
import java.io.IOException;
import java.lang.InterruptedException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class Mapper_Reducer_Classes {
/**
* The map class of WordCount.
*/
public static class Mapper_Class
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
/**
* The reducer class of WordCount
*/
public static class Reducer_Class
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
}

Try to set them by using the setClass() method :
conf.setClass("mapreduce.map.class",
Class.forName("reza.rCloud.Mapper_Reducer_Classes$Mapper_Class"),
Mapper.class);
conf.setClass("mapreduce.reduce.class",
Class.forName("reza.rCloud.Mapper_Reducer_Classes$Reducer_Class"),
Reducer.class);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading file in Java Hadoop - java

Related

Changing number of splits for Hadoop job

Global variable's value doesn't change after for Loop

hadoop mapreduce to generate substrings of different lengths

How can I go back to Main method in my code, And depending on Condition?

Dynamically compiling and Running a Hadoop job from another Java File

Categories

Resources