Performing multiple computations with Hadoop Map Reduce - java

I have a map reduce program for finding the min/max for 2 separate properties for each year. This works, for the most part, using a single node cluster in hadoop. Here is my currently setup:
public class MaxTemperatureReducer extends
Reducer<Text, Stats, Text, Stats> {
private Stats result = new Stats();
#Override
public void reduce(Text key, Iterable<Stats> values, Context context)
throws IOException, InterruptedException {
int maxValue = Integer.MIN_VALUE;
int minValue = Integer.MAX_VALUE;
int sum = 0;
for (Stats value : values) {
result.setMaxTemp(Math.max(maxValue, value.getMaxTemp()));
result.setMinTemp(Math.min(minValue, value.getMinTemp()));
result.setMaxWind(Math.max(maxValue, value.getMaxWind()));
result.setMinWind(Math.min(minValue, value.getMinWind()));
sum += value.getCount();
}
result.setCount(sum);
context.write(key, result);
}
}
public class MaxTemperatureMapper extends
Mapper<Object, Text, Text, Stats> {
private static final int MISSING = 9999;
private Stats outStat = new Stats();
#Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] split = value.toString().split("\\s+");
String year = split[2].substring(0, 4);
int airTemperature;
airTemperature = (int) Float.parseFloat(split[3]);
outStat.setMinTemp((float)airTemperature);
outStat.setMaxTemp((float)airTemperature);
outStat.setMinWind(Float.parseFloat(split[12]));
outStat.setMaxWind(Float.parseFloat(split[14]));
outStat.setCount(1);
context.write(new Text(year), outStat);
}
}
public class MaxTemperatureDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err
.println("Usage: MaxTemperatureDriver <input path> <outputpath>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperatureDriver.class);
job.setJobName("Max Temperature");
job.setMapperClass(MaxTemperatureMapper.class);
job.setCombinerClass(MaxTemperatureReducer.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Stats.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
MaxTemperatureDriver driver = new MaxTemperatureDriver();
int exitCode = ToolRunner.run(driver, args);
System.exit(exitCode);
}
}
Currently it only prints the Min/Max for the temp and windspeed for each year. I am sure it is a simple implementation but cannot find a answer anywhere. I want to try and find the top 5 min/max for each year. Any suggestions?

Let me assume the following signature for your Stats class.
/* the stats class need to be a writable, the below is just a demo*/
public class Stats {
public float getTemp() {
return temp;
}
public void setTemp(float temp) {
this.temp = temp;
}
public float getWind() {
return wind;
}
public void setWind(float wind) {
this.wind = wind;
}
private float temp;
private float wind;
}
With this, let us change the reducer as below.
SortedSet<Float> tempSetMax = new TreeSet<Float>();
SortedSet<Float> tempSetMin = new TreeSet<Float>();
SortedSet<Float> windSetMin = new TreeSet<Float>();
SortedSet<Float> windSetMax = new TreeSet<Float>();
List<Stats> values = new ArrayList<Float>();
for (Stats value : values) {
float temp = value.getTemp();
float wind = value.getWind();
if (tempSetMax.size() < 5) {
tempSetMax.add(temp);
} else {
float currentMinValue = tempSetMax.first();
if (temp > currentMinValue) {
tempSetMax.remove(currentMinValue);
tempSetMax.add(temp);
}
}
if (tempSetMin.size() < 5) {
tempSetMin.add(temp);
} else {
float currentMaxValue = tempSetMin.last();
if (temp < currentMaxValue) {
tempSetMax.remove(currentMaxValue);
tempSetMax.add(temp);
}
}
if (windSetMin.size() < 5) {
windSetMin.add(wind);
} else {
float currentMaxValue = windSetMin.last();
if (wind < currentMaxValue) {
windSetMin.remove(currentMaxValue);
windSetMin.add(temp);
}
}
if (windSetMax.size() < 5) {
windSetMax.add(wind);
} else {
float currentMinValue = windSetMax.first();
if (wind > currentMinValue) {
windSetMax.remove(currentMinValue);
windSetMax.add(temp);
}
}
}
Now you can write to context the toString() of each list, or you can create a custom writable. In my code, please change the Stats according to your requirement. It needs to be a Writable. The above is just for demonstrating the example flow.

Here is the code from the MR Design Patterns Book to get the top 10. There is also code for other MR design patterns in the same GitHub location.

Related

Why output is empty on Hadoop MapReduce? Debug?

I want to aggregate the total_aomunt(float) for each date("yyyy-MM-dd HH:mm:ss.S").
The data is like here.
The output is empty like here.
The basic idea of the mapper is to skip the first line(header), and afterward get the date and total_amount of each line, then I change the date format to "yyyy-MM-dd". I guess the errors happen in try{} catch(){}.
By the way, I am wondering if there are some bad lines in the dataset, for instance, the value of total_amount is not floating but contains '$' like "$ 9.5" or the format of date is not like "yyyy-MM-dd HH:mm:ss.S", how to detect those lines and discard them? Do you have any ideas?
Here is the code:
public class RevenueSum extends Configured implements Tool {
static int printUsage() {
System.out.println("revenuesum [-m <maps>] [-r <reduces>] <input> <output>");
ToolRunner.printGenericCommandUsage(System.out);
return -1;
}
public static class RevenueSumMapper
extends Mapper<LongWritable, Text, Text, FloatWritable> {
#Override
public void map(LongWritable key, Text value, Context context
) throws IOException, InterruptedException {
try {
/*Some condition satisfying it is header*/
if (key.get() == 0 && value.toString().contains("header") )
return;
else {
// For rest of data it goes here
String line = value.toString();
String[] items = line.split(",");
String old_date = items[3];
LocalDateTime datetime = LocalDateTime.parse(old_date, DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.S"));
String date = datetime.format(DateTimeFormatter.ofPattern("yyyy-MM-dd"));
Float revenue = Float.parseFloat(items[10]);
context.write(new Text(date), new FloatWritable(revenue));
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
public static class RevenueSumReducer
extends Reducer<Text,FloatWritable,Text,FloatWritable> {
private FloatWritable result = new FloatWritable();
public void reduce(Text key, Iterable<FloatWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (FloatWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "revenue sum");
job.setJarByClass(RevenueSum.class);
job.setMapperClass(RevenueSumMapper.class);
job.setCombinerClass(RevenueSumReducer.class);
job.setReducerClass(RevenueSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
List<String> other_args = new ArrayList<String>();
for(int i=0; i < args.length; ++i) {
try {
if ("-r".equals(args[i])) {
job.setNumReduceTasks(Integer.parseInt(args[++i]));
} else {
other_args.add(args[i]);
}
} catch (NumberFormatException except) {
System.out.println("ERROR: Integer expected instead of " + args[i]);
return printUsage();
} catch (ArrayIndexOutOfBoundsException except) {
System.out.println("ERROR: Required parameter missing from " +
args[i-1]);
return printUsage();
}
}
// Make sure there are exactly 2 parameters left.
if (other_args.size() != 2) {
System.out.println("ERROR: Wrong number of parameters: " +
other_args.size() + " instead of 2.");
return printUsage();
}
FileInputFormat.setInputPaths(job, other_args.get(0));
FileOutputFormat.setOutputPath(job, new Path(other_args.get(1)));
return (job.waitForCompletion(true) ? 0 : 1);
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new RevenueSum(), args);
System.exit(res);
}

Apache Hbase MapReduce job take too much time while reading the datastore

I have setup Apache Hbase, Nutch and Hadoop cluster. I have crawled few documents i.e., about 30 Million. There are 3 workers in the cluster and 1 master. I have write my own Hbase mapreduce job to read crawled data and change some score little bit based on some logic.
For this purpose, I have combined the documents of same domain and found their effective bytes and found some score. Later, in reducer, I have assigned that score to each URL of that domain (via cache). This portion of job takes took much time i.e., 16 hours. Following is the code snippet
for ( int index = 0; index < Cache.size(); index++) {
String Orig_key = Cache.get(index);
float doc_score = log10;
WebPage page = datastore.get(Orig_key);
if ( page == null ) {
continue;
}
page.setScore(doc_score);
if (mark) {
page.getMarkers().put( Queue, Q1);
}
context.write(Orig_key, page);
}
If I remove that document read statement from datastore then job is finished in 2 to 3 hours only. That why, I think the statement WebPage page = datastore.get(Orig_key); is causing this problem. Is'nt it ?
If that is the case then what is best approach. The Cache object is simply a list that contains URLs of same domain.
DomainAnalysisJob.java
...
...
public class DomainAnalysisJob implements Tool {
public static final Logger LOG = LoggerFactory
.getLogger(DomainAnalysisJob.class);
private static final Collection<WebPage.Field> FIELDS = new HashSet<WebPage.Field>();
private Configuration conf;
protected static final Utf8 URL_ORIG_KEY = new Utf8("doc_orig_id");
protected static final Utf8 DOC_DUMMY_MARKER = new Utf8("doc_marker");
protected static final Utf8 DUMMY_KEY = new Utf8("doc_id");
protected static final Utf8 DOMAIN_DUMMY_MARKER = new Utf8("domain_marker");
protected static final Utf8 LINK_MARKER = new Utf8("link");
protected static final Utf8 Queue = new Utf8("q");
private static URLNormalizers urlNormalizers;
private static URLFilters filters;
private static int maxURL_Length;
static {
FIELDS.add(WebPage.Field.STATUS);
FIELDS.add(WebPage.Field.LANG_INFO);
FIELDS.add(WebPage.Field.URDU_SCORE);
FIELDS.add(WebPage.Field.MARKERS);
FIELDS.add(WebPage.Field.INLINKS);
}
/**
* Maps each WebPage to a host key.
*/
public static class Mapper extends GoraMapper<String, WebPage, Text, WebPage> {
#Override
protected void setup(Context context) throws IOException ,InterruptedException {
Configuration conf = context.getConfiguration();
urlNormalizers = new URLNormalizers(context.getConfiguration(), URLNormalizers.SCOPE_DEFAULT);
filters = new URLFilters(context.getConfiguration());
maxURL_Length = conf.getInt("url.characters.max.length", 2000);
}
#Override
protected void map(String key, WebPage page, Context context)
throws IOException, InterruptedException {
String reversedHost = null;
if (page == null) {
return;
}
if ( key.length() > maxURL_Length ) {
return;
}
String url = null;
try {
url = TableUtil.unreverseUrl(key);
url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT);
url = filters.filter(url); // filter the url
} catch (Exception e) {
LOG.warn("Skipping " + key + ":" + e);
return;
}
if ( url == null) {
context.getCounter("DomainAnalysis", "FilteredURL").increment(1);
return;
}
try {
reversedHost = TableUtil.getReversedHost(key.toString());
}
catch (Exception e) {
return;
}
page.getMarkers().put( URL_ORIG_KEY, new Utf8(key) );
context.write( new Text(reversedHost), page );
}
}
public DomainAnalysisJob() {
}
public DomainAnalysisJob(Configuration conf) {
setConf(conf);
}
#Override
public Configuration getConf() {
return conf;
}
#Override
public void setConf(Configuration conf) {
this.conf = conf;
}
public void updateDomains(boolean buildLinkDb, int numTasks) throws Exception {
NutchJob job = NutchJob.getInstance(getConf(), "rankDomain-update");
job.getConfiguration().setInt("mapreduce.task.timeout", 1800000);
if ( numTasks < 1) {
job.setNumReduceTasks(job.getConfiguration().getInt(
"mapred.map.tasks", job.getNumReduceTasks()));
} else {
job.setNumReduceTasks(numTasks);
}
ScoringFilters scoringFilters = new ScoringFilters(getConf());
HashSet<WebPage.Field> fields = new HashSet<WebPage.Field>(FIELDS);
fields.addAll(scoringFilters.getFields());
StorageUtils.initMapperJob(job, fields, Text.class, WebPage.class,
Mapper.class);
StorageUtils.initReducerJob(job, DomainAnalysisReducer.class);
job.waitForCompletion(true);
}
#Override
public int run(String[] args) throws Exception {
boolean linkDb = false;
int numTasks = -1;
for (int i = 0; i < args.length; i++) {
if ("-rankDomain".equals(args[i])) {
linkDb = true;
} else if ("-crawlId".equals(args[i])) {
getConf().set(Nutch.CRAWL_ID_KEY, args[++i]);
} else if ("-numTasks".equals(args[i]) ) {
numTasks = Integer.parseInt(args[++i]);
}
else {
throw new IllegalArgumentException("unrecognized arg " + args[i]
+ " usage: updatedomain -crawlId <crawlId> [-numTasks N]" );
}
}
LOG.info("Updating DomainRank:");
updateDomains(linkDb, numTasks);
return 0;
}
public static void main(String[] args) throws Exception {
final int res = ToolRunner.run(NutchConfiguration.create(),
new DomainAnalysisJob(), args);
System.exit(res);
}
}
DomainAnalysisReducer.java
...
...
public class DomainAnalysisReducer extends
GoraReducer<Text, WebPage, String, WebPage> {
public static final Logger LOG = DomainAnalysisJob.LOG;
public DataStore<String, WebPage> datastore;
protected static float q1_ur_threshold = 500.0f;
protected static float q1_ur_docCount = 50;
public static final Utf8 Queue = new Utf8("q"); // Markers for Q1 and Q2
public static final Utf8 Q1 = new Utf8("q1");
public static final Utf8 Q2 = new Utf8("q2");
#Override
protected void setup(Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
try {
datastore = StorageUtils.createWebStore(conf, String.class, WebPage.class);
}
catch (ClassNotFoundException e) {
throw new IOException(e);
}
q1_ur_threshold = conf.getFloat("domain.queue.threshold.bytes", 500.0f);
q1_ur_docCount = conf.getInt("domain.queue.doc.count", 50);
LOG.info("Conf updated: Queue-bytes-threshold = " + q1_ur_threshold + " Queue-doc-threshold: " + q1_ur_docCount);
}
#Override
protected void cleanup(Context context) throws IOException, InterruptedException {
datastore.close();
}
#Override
protected void reduce(Text key, Iterable<WebPage> values, Context context)
throws IOException, InterruptedException {
ArrayList<String> Cache = new ArrayList<String>();
int doc_counter = 0;
int total_ur_bytes = 0;
for ( WebPage page : values ) {
// cache
String orig_key = page.getMarkers().get( DomainAnalysisJob.URL_ORIG_KEY ).toString();
Cache.add(orig_key);
// do not consider those doc's that are not fetched or link URLs
if ( page.getStatus() == CrawlStatus.STATUS_UNFETCHED ) {
continue;
}
doc_counter++;
int ur_score_int = 0;
int doc_ur_bytes = 0;
int doc_total_bytes = 0;
String ur_score_str = "0";
String langInfo_str = null;
// read page and find its Urdu score
langInfo_str = TableUtil.toString(page.getLangInfo());
if (langInfo_str == null) {
continue;
}
ur_score_str = TableUtil.toString(page.getUrduScore());
ur_score_int = Integer.parseInt(ur_score_str);
doc_total_bytes = Integer.parseInt( langInfo_str.split("&")[0] );
doc_ur_bytes = ( doc_total_bytes * ur_score_int) / 100; //Formula to find ur percentage
total_ur_bytes += doc_ur_bytes;
}
float avg_bytes = 0;
float log10 = 0;
if ( doc_counter > 0 && total_ur_bytes > 0) {
avg_bytes = (float) total_ur_bytes/doc_counter;
log10 = (float) Math.log10(avg_bytes);
log10 = (Math.round(log10 * 100000f)/100000f);
}
context.getCounter("DomainAnalysis", "DomainCount").increment(1);
// if average bytes and doc count, are more than threshold then mark as q1
boolean mark = false;
if ( avg_bytes >= q1_ur_threshold && doc_counter >= q1_ur_docCount ) {
mark = true;
for ( int index = 0; index < Cache.size(); index++) {
String Orig_key = Cache.get(index);
float doc_score = log10;
WebPage page = datastore.get(Orig_key);
if ( page == null ) {
continue;
}
page.setScore(doc_score);
if (mark) {
page.getMarkers().put( Queue, Q1);
}
context.write(Orig_key, page);
}
}
}
In my testing and debugging, I have found that the statement WebPage page = datastore.get(Orig_key); is major cause of too much time. It took about 16 hours to complete the job but when I replaced this statement with WebPage page = WebPage.newBuilder().build(); the time was reduced to 6 hours. Is this due to IO ?

Java, Refactoring case

I was given exercise that I need to refactor several java projects.
Only those 2 left which I truly don't have an idea how to refactor.
csv.writer
public class CsvWriter {
public CsvWriter() {
}
public void write(String[][] lines) {
for (int i = 0; i < lines.length; i++)
writeLine(lines[i]);
}
private void writeLine(String[] fields) {
if (fields.length == 0)
System.out.println();
else {
writeField(fields[0]);
for (int i = 1; i < fields.length; i++) {
System.out.print(",");
writeField(fields[i]);
}
System.out.println();
}
}
private void writeField(String field) {
if (field.indexOf(',') != -1 || field.indexOf('\"') != -1)
writeQuoted(field);
else
System.out.print(field);
}
private void writeQuoted(String field) {
System.out.print('\"');
for (int i = 0; i < field.length(); i++) {
char c = field.charAt(i);
if (c == '\"')
System.out.print("\"\"");
else
System.out.print(c);
}
System.out.print('\"');
}
}
csv.writertest
public class CsvWriterTest {
#Test
public void testWriter() {
CsvWriter writer = new CsvWriter();
String[][] lines = new String[][] {
new String[] {},
new String[] { "only one field" },
new String[] { "two", "fields" },
new String[] { "", "contents", "several words included" },
new String[] { ",", "embedded , commas, included",
"trailing comma," },
new String[] { "\"", "embedded \" quotes",
"multiple \"\"\" quotes\"\"" },
new String[] { "mixed commas, and \"quotes\"", "simple field" } };
// Expected:
// -- (empty line)
// only one field
// two,fields
// ,contents,several words included
// ",","embedded , commas, included","trailing comma,"
// """","embedded "" quotes","multiple """""" quotes"""""
// "mixed commas, and ""quotes""",simple field
writer.write(lines);
}
}
test
public class Configuration {
public int interval;
public int duration;
public int departure;
public void load(Properties props) throws ConfigurationException {
String valueString;
int value;
valueString = props.getProperty("interval");
if (valueString == null) {
throw new ConfigurationException("monitor interval");
}
value = Integer.parseInt(valueString);
if (value <= 0) {
throw new ConfigurationException("monitor interval > 0");
}
interval = value;
valueString = props.getProperty("duration");
if (valueString == null) {
throw new ConfigurationException("duration");
}
value = Integer.parseInt(valueString);
if (value <= 0) {
throw new ConfigurationException("duration > 0");
}
if ((value % interval) != 0) {
throw new ConfigurationException("duration % interval");
}
duration = value;
valueString = props.getProperty("departure");
if (valueString == null) {
throw new ConfigurationException("departure offset");
}
value = Integer.parseInt(valueString);
if (value <= 0) {
throw new ConfigurationException("departure > 0");
}
if ((value % interval) != 0) {
throw new ConfigurationException("departure % interval");
}
departure = value;
}
}
public class ConfigurationException extends Exception {
private static final long serialVersionUID = 1L;
public ConfigurationException() {
super();
}
public ConfigurationException(String arg0) {
super(arg0);
}
public ConfigurationException(String arg0, Throwable arg1) {
super(arg0, arg1);
}
public ConfigurationException(Throwable arg0) {
super(arg0);
}
}
configuration.test
public class ConfigurationTest{
#Test
public void testGoodInput() throws IOException {
String data = "interval = 10\nduration = 100\ndeparture = 200\n";
Properties input = loadInput(data);
Configuration props = new Configuration();
try {
props.load(input);
} catch (ConfigurationException e) {
assertTrue(false);
return;
}
assertEquals(props.interval, 10);
assertEquals(props.duration, 100);
assertEquals(props.departure, 200);
}
#Test
public void testNegativeValues() throws IOException {
processBadInput("interval = -10\nduration = 100\ndeparture = 200\n");
processBadInput("interval = 10\nduration = -100\ndeparture = 200\n");
processBadInput("interval = 10\nduration = 100\ndeparture = -200\n");
}
#Test
public void testInvalidDuration() throws IOException {
processBadInput("interval = 10\nduration = 99\ndeparture = 200\n");
}
#Test
public void testInvalidDeparture() throws IOException {
processBadInput("interval = 10\nduration = 100\ndeparture = 199\n");
}
#Test
private void processBadInput(String data) throws IOException {
Properties input = loadInput(data);
boolean failed = false;
Configuration props = new Configuration();
try {
props.load(input);
} catch (ConfigurationException e) {
failed = true;
}
assertTrue(failed);
}
#Test
private Properties loadInput(String data) throws IOException {
InputStream is = new StringBufferInputStream(data);
Properties input = new Properties();
input.load(is);
is.close();
return input;
}
}
Ok, here some advice regarding the code.
CsvWriter
The bad thing is that you print everything to System.out. It will be hard to test without mocks. Instead I suggest you to add field PrintStream which defines where all data will go.
import java.io.PrintStream;
public class CsvWriter {
private final PrintStream printStream;
public CsvWriter() {
this.printStream = System.out;
}
public CsvWriter(PrintStream printStream) {
this.printStream = printStream;
}
...
You then write everything to this stream. This refactoring easy since you use replace function(Ctrl+R in IDEA). Here is the example how you do it.
private void writeField(String field) {
if (field.indexOf(',') != -1 || field.indexOf('\"') != -1)
writeQuoted(field);
else
printStream.print(field);
}
Others stuff seems ok in this class.
CsvWriterTest
First thing first you don't check all logic in a single method. Make small methods with different kind of tests. It's ok to keep your current test though. Sometimes it's useful to check most of the logic in a complex scenario.
Also pay attention to the names of the methods. Check this
Obviously you test doesn't check the results. That's why we need this functionality with PrintStream. We can build a PrintStream on top of the instance of ByteArrayOutputStream. We then construct a string and check if the content is valid. Here is how you can easily check what was written
public class CsvWriterTest {
private ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
private PrintStream printStream = new PrintStream(byteArrayOutputStream);
#Test
public void testWriter() {
CsvWriter writer = new CsvWriter(printStream);
... old logic here ...
writer.write(lines);
String result = new String(byteArrayOutputStream.toByteArray());
Assert.assertTrue(result.contains("two,fields"));
Configuration
Make fields private
Make messages more concise
ConfigurationException
Seems good about serialVersionUID. This thing is needed for serialization/deserialization.
ConfigurationTest
Do not use assertTrue(false/failed); Use Assert.fail(String) with some message which is understandable.
Tip: if you don't have much experience and need to refactor code like this, you may want to read some chapters of Effective Java 2nd edition by Joshua Bloch. The book is not so big so you can read it in a week and it has some rules how to write clean and understandable code.

NullPointerException in toString() method of CustomArrayWritable class, MapReduce

I'm trying to use Time_Ant10s(custom ArrayWritable class) as the Reducer's output.
I refer to this nice question: MapReduce Output ArrayWritable, but I get NullPointerException in context.write() in the last line of the Reducer.
I suppose get() in Time_Ant10s.toString() may return null, but I have no idea why this happens. Could you help me?
Main Method
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "something");
// general
job.setJarByClass(CommutingTime1.class);
job.setMapperClass(Mapper1.class);
job.setReducerClass(Reducer1.class);
job.setNumReduceTasks(1);
job.setInputFormatClass (TextInputFormat.class);
// mapper output
job.setMapOutputKeyClass(Date_Uid.class);
job.setMapOutputValueClass(Time_Ant10.class);
// reducer output
job.setOutputFormatClass(CommaTextOutputFormat.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Time_Ant10s.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Mapper
public static class Mapper1 extends Mapper<LongWritable, Text, Date_Uid, Time_Ant10> {
/* map as <date_uid, time_ant10> */
// omitted
}
}
Reducer
public static class Reducer1 extends Reducer<Date_Uid, Time_Ant10, IntWritable, Time_Ant10s> {
/* <date_uid, time_ant10> -> <date, time_ant10s> */
private IntWritable date = new IntWritable();
#Override
protected void reduce(Date_Uid date_uid, Iterable<Time_Ant10> time_ant10s, Context context) throws IOException, InterruptedException {
date.set(date_uid.getDate());
// count ants
int num = 0;
for(Time_Ant10 time_ant10 : time_ant10s){
num++;
}
if(num>=1){
Time_Ant10[] temp = new Time_Ant10[num];
int i=0;
for(Time_Ant10 time_ant10 : time_ant10s){
String time = time_ant10.getTimeStr();
int ant10 = time_ant10.getAnt10();
temp[i] = new Time_Ant10(time, ant10);
i++;
}
context.write(date, new Time_Ant10s(temp));
}
}
}
Writer
public static class CommaTextOutputFormat extends TextOutputFormat<IntWritable, Time_Ant10s> {
#Override
public RecordWriter<IntWritable, Time_Ant10s> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
Configuration conf = job.getConfiguration();
String extension = ".txt";
Path file = getDefaultWorkFile(job, extension);
FileSystem fs = file.getFileSystem(conf);
FSDataOutputStream fileOut = fs.create(file, false);
return new LineRecordWriter<IntWritable, Time_Ant10s>(fileOut, ",");
}
}
Custom Writables
// Time
public static class Time implements Writable {
private int h, m, s;
public Time() {}
public Time(int h, int m, int s) {
this.h = h;
this.m = m;
this.s = s;
}
public Time(String time) {
String[] hms = time.split(":", 0);
this.h = Integer.parseInt(hms[0]);
this.m = Integer.parseInt(hms[1]);
this.s = Integer.parseInt(hms[2]);
}
public void set(int h, int m, int s) {
this.h = h;
this.m = m;
this.s = s;
}
public void set(String time) {
String[] hms = time.split(":", 0);
this.h = Integer.parseInt(hms[0]);
this.m = Integer.parseInt(hms[1]);
this.s = Integer.parseInt(hms[2]);
}
public int[] getTime() {
int[] time = new int[3];
time[0] = this.h;
time[1] = this.m;
time[2] = this.s;
return time;
}
public String getTimeStr() {
return String.format("%1$02d:%2$02d:%3$02d", this.h, this.m, this.s);
}
public int getTimeInt() {
return this.h * 10000 + this.m * 100 + this.s;
}
#Override
public void readFields(DataInput in) throws IOException {
h = in.readInt();
m = in.readInt();
s = in.readInt();
}
#Override
public void write(DataOutput out) throws IOException {
out.writeInt(h);
out.writeInt(m);
out.writeInt(s);
}
}
// Time_Ant10
public static class Time_Ant10 implements Writable {
private Time time;
private int ant10;
public Time_Ant10() {
this.time = new Time();
}
public Time_Ant10(Time time, int ant10) {
this.time = time;
this.ant10 = ant10;
}
public Time_Ant10(String time, int ant10) {
this.time = new Time(time);
this.ant10 = ant10;
}
public void set(Time time, int ant10) {
this.time = time;
this.ant10 = ant10;
}
public void set(String time, int ant10) {
this.time = new Time(time);
this.ant10 = ant10;
}
public int[] getTime() {
return this.time.getTime();
}
public String getTimeStr() {
return this.time.getTimeStr();
}
public int getTimeInt() {
return this.time.getTimeInt();
}
public int getAnt10() {
return this.ant10;
}
#Override
public void readFields(DataInput in) throws IOException {
time.readFields(in);
ant10 = in.readInt();
}
#Override
public void write(DataOutput out) throws IOException {
time.write(out);
out.writeInt(ant10);
}
}
// Time_Ant10s
public static class Time_Ant10s extends ArrayWritable {
public Time_Ant10s(){
super(Time_Ant10.class);
}
public Time_Ant10s(Time_Ant10[] time_ant10s){
super(Time_Ant10.class, time_ant10s);
}
#Override
public Time_Ant10[] get() {
return (Time_Ant10[]) super.get();
}
#Override
public String toString() {
int time, ant10;
Time_Ant10[] time_ant10s = get();
String output = "";
for(Time_Ant10 time_ant10: time_ant10s){
time = time_ant10.getTimeInt();
ant10 = time_ant10.getAnt10();
output += time + "," + ant10 + ",";
}
return output;
}
}
// Data_Uid
public static class Date_Uid implements WritableComparable<Date_Uid> {
// omitted
}
Error Message
java.lang.Exception: java.lang.NullPointerException
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.lang.NullPointerException
at CommutingTime1$Time_Ant10s.toString(CommutingTime1.java:179)
at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.writeObject(TextOutputFormat.java:85)
at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.write(TextOutputFormat.java:104)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:558)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105)
at CommutingTime1$Reducer1.reduce(CommutingTime1.java:323)
at CommutingTime1$Reducer1.reduce(CommutingTime1.java:291)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I found that the problem is Iterable in reduce can't be iterated twice. So I refer to this page and change the reducer and Time_Ant10s as below. Now everything is going fine.
#redflar3: Thank you very much for giving me hints. I totally misunderstood where my code has bug.
Reducer
public static class Reducer1 extends Reducer<Date_Uid, Time_Ant10, IntWritable, Time_Ant10s> {
private IntWritable date = new IntWritable();
#Override
protected void reduce(Date_Uid date_uid, Iterable<Time_Ant10> time_ant10s, Context context) throws IOException, InterruptedException {
String time = "";
int ant10;
date.set(date_uid.getDate());
ArrayList<Time_Ant10> temp_list = new ArrayList<Time_Ant10>();
for (Time_Ant10 time_ant10 : time_ant10s){
time = time_ant10.getTimeStr();
ant10 = time_ant10.getAnt10();
temp_list.add(new Time_Ant10(time, ant10));
}
if(temp_list.size() >= 1){
Time_Ant10[] temp_array = temp_list.toArray(new Time_Ant10[temp_list.size()]);
context.write(date, new Time_Ant10s(temp_array));
}
}
}
Time_Ant10s
public static class Time_Ant10s extends ArrayWritable {
public Time_Ant10s(){
super(Time_Ant10.class);
}
public Time_Ant10s(Time_Ant10[] time_ant10s){
super(Time_Ant10.class, time_ant10s);
}
#Override
public Time_Ant10[] get() {
return (Time_Ant10[]) super.get();
}
#Override
public String toString() {
int time, ant10;
Time_Ant10[] time_ant10s = get();
String output = "";
for(Time_Ant10 time_ant10: time_ant10s){
time = time_ant10.getTimeInt();
ant10 = time_ant10.getAnt10();
output += time + "," + ant10 + ",";
}
return output;
}
}

Join with Hadoop in Java [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I'm working since short time with Hadoop and trying to implement a join in Java. It doesn't matter if Map-Side or Reduce-Side. I took Reduce-Side join since it was supposed to be easier to implement. I know that Java is not the best choice for joins, aggregations etc. and should better pick Hive or Pig which I have done already. However I'm working on a research project and I have to use all of those 3 languages in order to deliver a comparison.
Anyway, I have two input files with different structure. One is key|value and the other one is key|value1;value2;value3;value4. One record from each input file looks like following:
Input1: 1;2010-01-10T00:00:01
Input2: 1;23;Blue;2010-01-11T00:00:01;9999-12-31T23:59:59
I followed the example in the Hadoop Definitve Guide book, but it didn't work for me. I'm posting my java files here, so you can see what's wrong.
public class LookupReducer extends Reducer<TextPair,Text,Text,Text> {
private String result = "";
private String msisdn;
private String attribute, product;
private long trans_dt_long, start_dt_long, end_dt_long;
private String trans_dt, start_dt, end_dt;
#Override
public void reduce(TextPair key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
context.progress();
//value without key to remember
Iterator<Text> iter = values.iterator();
for (Text val : values) {
Text recordNoKey = val; //new Text(iter.next());
String valSplitted[] = recordNoKey.toString().split(";");
//if the input is coming from CDR set corresponding values
if(key.getSecond().toString().equals(CDR.CDR_TAG))
{
trans_dt = recordNoKey.toString();
trans_dt_long = dateToLong(recordNoKey.toString());
}
//if the input is coming from Attributes set corresponding values
else if(key.getSecond().toString().equals(Attribute.ATT_TAG))
{
attribute = valSplitted[0];
product = valSplitted[1];
start_dt = valSplitted[2];
start_dt_long = dateToLong(valSplitted[2]);
end_dt = valSplitted[3];
end_dt_long = dateToLong(valSplitted[3]);;
}
Text record = val; //iter.next();
//System.out.println("RECORD: " + record);
Text outValue = new Text(recordNoKey.toString() + ";" + record.toString());
if(start_dt_long < trans_dt_long && trans_dt_long < end_dt_long)
{
//concat output columns
result = attribute + ";" + product + ";" + trans_dt;
context.write(key.getFirst(), new Text(result));
System.out.println("KEY: " + key);
}
}
}
private static long dateToLong(String date){
DateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
Date parsedDate = null;
try {
parsedDate = formatter.parse(date);
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
long dateInLong = parsedDate.getTime();
return dateInLong;
}
public static class TextPair implements WritableComparable<TextPair> {
private Text first;
private Text second;
public TextPair(){
set(new Text(), new Text());
}
public TextPair(String first, String second){
set(new Text(first), new Text(second));
}
public TextPair(Text first, Text second){
set(first, second);
}
public void set(Text first, Text second){
this.first = first;
this.second = second;
}
public Text getFirst() {
return first;
}
public void setFirst(Text first) {
this.first = first;
}
public Text getSecond() {
return second;
}
public void setSecond(Text second) {
this.second = second;
}
#Override
public void readFields(DataInput in) throws IOException {
// TODO Auto-generated method stub
first.readFields(in);
second.readFields(in);
}
#Override
public void write(DataOutput out) throws IOException {
// TODO Auto-generated method stub
first.write(out);
second.write(out);
}
#Override
public int hashCode(){
return first.hashCode() * 163 + second.hashCode();
}
#Override
public boolean equals(Object o){
if(o instanceof TextPair)
{
TextPair tp = (TextPair) o;
return first.equals(tp.first) && second.equals(tp.second);
}
return false;
}
#Override
public String toString(){
return first + ";" + second;
}
#Override
public int compareTo(TextPair tp) {
// TODO Auto-generated method stub
int cmp = first.compareTo(tp.first);
if(cmp != 0)
return cmp;
return second.compareTo(tp.second);
}
public static class FirstComparator extends WritableComparator {
protected FirstComparator(){
super(TextPair.class, true);
}
#Override
public int compare(WritableComparable comp1, WritableComparable comp2){
TextPair pair1 = (TextPair) comp1;
TextPair pair2 = (TextPair) comp2;
int cmp = pair1.getFirst().compareTo(pair2.getFirst());
if(cmp != 0)
return cmp;
return -pair1.getSecond().compareTo(pair2.getSecond());
}
}
public static class GroupComparator extends WritableComparator {
protected GroupComparator()
{
super(TextPair.class, true);
}
#Override
public int compare(WritableComparable comp1, WritableComparable comp2)
{
TextPair pair1 = (TextPair) comp1;
TextPair pair2 = (TextPair) comp2;
return pair1.compareTo(pair2);
}
}
}
}
public class Joiner extends Configured implements Tool {
public static final String DATA_SEPERATOR =";"; //Define the symbol for seperating the output data
public static final String NUMBER_OF_REDUCER = "1"; //Define the number of the used reducer jobs
public static final String COMPRESS_MAP_OUTPUT = "true"; //if the output from the mapping process should be compressed, set COMPRESS_MAP_OUTPUT = "true" (if not set it to "false")
public static final String
USED_COMPRESSION_CODEC = "org.apache.hadoop.io.compress.SnappyCodec"; //set the used codec for the data compression
public static final boolean JOB_RUNNING_LOCAL = true; //if you run the Hadoop job on your local machine, you have to set JOB_RUNNING_LOCAL = true
//if you run the Hadoop job on the Telefonica Cloud, you have to set JOB_RUNNING_LOCAL = false
public static final String OUTPUT_PATH = "/home/hduser"; //set the folder, where the output is saved. Only needed, if JOB_RUNNING_LOCAL = false
public static class KeyPartitioner extends Partitioner<TextPair, Text> {
#Override
public int getPartition(/*[*/TextPair key/*]*/, Text value, int numPartitions) {
System.out.println("numPartitions: " + numPartitions);
return (/*[*/key.getFirst().hashCode()/*]*/ & Integer.MAX_VALUE) % numPartitions;
}
}
private static Configuration hadoopconfig() {
Configuration conf = new Configuration();
conf.set("mapred.textoutputformat.separator", DATA_SEPERATOR);
conf.set("mapred.compress.map.output", COMPRESS_MAP_OUTPUT);
//conf.set("mapred.map.output.compression.codec", USED_COMPRESSION_CODEC);
conf.set("mapred.reduce.tasks", NUMBER_OF_REDUCER);
return conf;
}
#Override
public int run(String[] args) throws Exception {
// TODO Auto-generated method stub
if ((args.length != 3) && (JOB_RUNNING_LOCAL)) {
System.err.println("Usage: Lookup <CDR-inputPath> <Attribute-inputPath> <outputPath>");
System.exit(2);
}
//starting the Hadoop job
Configuration conf = hadoopconfig();
Job job = new Job(conf, "Join cdrs and attributes");
job.setJarByClass(Joiner.class);
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, CDRMapper.class);
MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, AttributeMapper.class);
//FileInputFormat.addInputPath(job, new Path(otherArgs[0])); //expecting a folder instead of a file
if(JOB_RUNNING_LOCAL)
FileOutputFormat.setOutputPath(job, new Path(args[2]));
else
FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
job.setPartitionerClass(KeyPartitioner.class);
job.setGroupingComparatorClass(TextPair.FirstComparator.class);
job.setReducerClass(LookupReducer.class);
job.setMapOutputKeyClass(TextPair.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Joiner(), args);
System.exit(exitCode);
}
}
public class Attribute {
public static final String ATT_TAG = "1";
public static class AttributeMapper
extends Mapper<LongWritable, Text, TextPair, Text>{
private static Text values = new Text();
//private Object output = new Text();
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//partition the input line by the separator semicolon
String[] attributes = value.toString().split(";");
String valuesInString = "";
if(attributes.length != 5)
System.err.println("Input column number not correct. Expected 5, provided " + attributes.length
+ "\n" + "Check the input file");
if(attributes.length == 5)
{
//setting the values with the input values read above
valuesInString = attributes[1] + ";" + attributes[2] + ";" + attributes[3] + ";" + attributes[4];
values.set(valuesInString);
//writing out the key and value pair
context.write( new TextPair(new Text(String.valueOf(attributes[0])), new Text(ATT_TAG)), values);
}
}
}
}
public class CDR {
public static final String CDR_TAG = "0";
public static class CDRMapper
extends Mapper<LongWritable, Text, TextPair, Text>{
private static Text values = new Text();
private Object output = new Text();
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//partition the input line by the separator semicolon
String[] cdr = value.toString().split(";");
//setting the values with the input values read above
values.set(cdr[1]);
//output = CDR_TAG + cdr[1];
//writing out the key and value pair
context.write( new TextPair(new Text(String.valueOf(cdr[0])), new Text(CDR_TAG)), values);
}
}
}
I would be glad if anyone could at least post a link for a tutorial or a simple example where such a join functionality is implemented. I searched a lot, but either the code was not complete or there was not enough explanation.
To be honest, I have no idea what your code is trying to do, but that's probably because I'd do it in a different way and not familiar with the API's you're using.
I would start from scratch as follows:
Create a mapper to read one of the files. For each line read, write a key value pair to the context. The key is a Text created from the key and the value is another Text created by concatenating a "1" with the entire input record.
Create another mapper for the other file. This mapper acts just like the first mapper, but the value is a Text created by concatenating a "2" with the entire input record.
Write a reducer to do the join. The reduce() method will get all records written for a specific key. You can tell which input file (and therefore the data format for the record) by looking to see whether the value starts with a "1" or a "2". Once you know whether or not you have one, the other or both record types, you can write whatever logic you need to merge the data from the one or two records.
By the way, you use the MultipleInputs class to configure more than one mapper in your job/driver class.

Categories