How to make WordCount use new Java libraries in Cloudera? - java

I am new to Hadoop, and trying the example of WordCount V1.0 here:
https://www.cloudera.com/documentation/other/tutorial/CDH5/topics/ht_usage.html
However, when I compile the WordCount.java using this line:
javac -cp /usr/lib/hadoop/*:/usr/lib/hadoop-mapreduce/* WordCount.java -d build -Xlint
It seems like the code uses the old version of .jar files, and gives me the following warnings (as shown in the picture). However, when I check inside the classpath I declared, there are some .jar files which seems to be newer versions of those being required .jar files.
So my question is that how can I make my WordCount.java use the newer file instead? I tried looking inside the WordCount.java code to see which rows use those required .jar files but could not see them.
Thanks in advance for any help.
The code of the WordCount.java
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Related

Running Implementing Custom Output Format in Hadoop Tutorial

I want to run the code described in this tutorial in order to customize the output format in Hadoop. More precisely, the tutorial shows two java files:
WordCount: is the word count java application (similar to the WordCount v1.0 of the MapReduce Tutorial in this link)
XMLOutputFormat: java class that extends FileOutputFormat and implements the method to customize the output.
Well, what I did was to take the WordCount v1.0 of the MapReduce Tutorial (instead of using the WordCount showed in the tutorial) and add in the driver job.setOutputFormatClass(XMLOutputFormat.class); and execute the hadoop app in this way:
/usr/local/hadoop/bin/hadoop com.sun.tools.javac.Main WordCount.java && jar cf wc.jar WordCount*.class && /usr/local/hadoop/bin/hadoop jar wc.jar WordCount /home/luis/Desktop/mytest/input/ ./output_folder
note: /home/luis/Desktop/mytest/input/ and ./output_folder are the input and output folders, respectively.
Unfortunately, the terminal shows me the following error:
WordCount.java:57: error: cannot find symbol
job.setOutputFormatClass(XMLOutputFormat.class);
^
symbol: class XMLOutputFormat
location: class WordCount
1 error
Why? WordCount.java and XMLOutputFormat.java are stored in the same folder.
The following is my code.
WordCount code:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setOutputFormatClass(XMLOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
XMLOutputFormat code:
import java.io.DataOutputStream;
import java.io.IOException;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class XMLOutputFormat extends FileOutputFormat<Text, IntWritable> {
protected static class XMLRecordWriter extends RecordWriter<Text, IntWritable> {
private DataOutputStream out;
public XMLRecordWriter(DataOutputStream out) throws IOException{
this.out = out;
out.writeBytes("<Output>\n");
}
private void writeStyle(String xml_tag,String tag_value) throws IOException {
out.writeBytes("<"+xml_tag+">"+tag_value+"</"+xml_tag+">\n");
}
public synchronized void write(Text key, IntWritable value) throws IOException {
out.writeBytes("<record>\n");
this.writeStyle("key", key.toString());
this.writeStyle("value", value.toString());
out.writeBytes("</record>\n");
}
public synchronized void close(TaskAttemptContext job) throws IOException {
try {
out.writeBytes("</Output>\n");
} finally {
out.close();
}
}
}
public RecordWriter<Text, IntWritable> getRecordWriter(TaskAttemptContext job) throws IOException {
String file_extension = ".xml";
Path file = getDefaultWorkFile(job, file_extension);
FileSystem fs = file.getFileSystem(job.getConfiguration());
FSDataOutputStream fileOut = fs.create(file, false);
return new XMLRecordWriter(fileOut);
}
}
You need to either add package testpackage; at the beginning of your WordCount class
or
import testpackage.XMLOutputFormat; in your WordCount class.
Because they are in the same directory, it doesn't imply they are in the same package.
We will need to add the XMLOutputFormat.jar file to the HADOOP_CLASSPATH first for the driver code to find it. And pass it in -libjars option to be added to classpath of the map and reduce jvms.
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/abc/xyz/XMLOutputFormat.jar
yarn jar wordcount.jar com.sample.test.Wordcount
-libjars /path/to/XMLOutputFormat.jar
/lab/mr/input /lab/output/output

class not found exception in Hadoop

I'm trying to run a hadoop single unit program for wordcount, I'm doing this on windows 10 64 bit and on Cygwin, this is the program I'm using:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens())
{
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>
{
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
My classpath is as follows:
~/hadoop-1.2.1/bin/hadoop jar WordCount.jar hadoop.ProcessUnits input_dir output_dir
and I get the following error messages when I try to compile the program:
Exception in thread "main" java.lang.ClassNotFoundException: hadoop.ProcessUnits
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
The problem is that hadoop.ProcessUnits should point to the mainClass you want to run. Your class is called WordCount, so it should look something like:
$ hadoop jar WordCount.jar <PACKAGE>.WordCount input_dir output_dir
Your code doesn't include the package name so I've substituted <PACKAGE>

How to implement string matching algorithm with Hadoop?

I want to implement a string matching(Boyer-Moore) algorithm using Hadoop. I just started using Hadoop so I have no idea how to write a Hadoop program in Java.
All the sample programs that I have seen so far are word counting examples and I couldn't find any sample programs for string matching.
I tried searching for some tutorials that teaches how to write Hadoop applications using Java but couldn't find any. Can you suggest me some tutorials where I can learn how to write Hadoop applications using Java.
Thanks in advance.
I haven't tested the below code, But this should get you started.
I have used the BoyerMoore implementation available here
What the below code is doing:
The goal is to search for a pattern in an input document. The BoyerMoore class is initialized in the setup method using the pattern set in the configuration.
The mapper receives each line at a time and it uses the BoyerMoore instance to find the pattern. If match is found, the we write it using context.
There is no need of a reducer here. If the pattern is found multiple times in different mapper then the output will have multiple offsets(1 per mapper).
package hadoop.boyermoore;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class BoyerMooreImpl {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private BoyerMoore boyerMoore;
private static IntWritable offset;
private Text offsetFound = new Text("offset");
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
String line = itr.nextToken();
int offset1 = boyerMoore.search(line);
if (line.length() != offset1) {
offset = new IntWritable(offset1);
context.write(offsetFound,offset);
}
}
}
#Override
public final void setup(Context context) {
if (boyerMoore == null)
boyerMoore = new BoyerMoore(context.getConfiguration().get("pattern"));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("pattern","your_pattern_here");
Job job = Job.getInstance(conf, "BoyerMoore");
job.setJarByClass(BoyerMooreImpl.class);
job.setMapperClass(TokenizerMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I don't know if this is the correct implementation to run an algorithm in parallel, but this is what I figured out,
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;
public class StringMatching extends Configured implements Tool {
public static void main(String args[]) throws Exception {
long start = System.currentTimeMillis();
int res = ToolRunner.run(new StringMatching(), args);
long end = System.currentTimeMillis();
System.exit((int)(end-start));
}
public int run(String[] args) throws Exception {
Path inputPath = new Path(args[0]);
Path outputPath = new Path(args[1]);
Configuration conf = getConf();
Job job = new Job(conf, this.getClass().toString());
FileInputFormat.setInputPaths(job, inputPath);
FileOutputFormat.setOutputPath(job, outputPath);
job.setJobName("StringMatching");
job.setJarByClass(StringMatching.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
#Override
public void map(LongWritable key, Text value,
Mapper.Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
BoyerMoore bm = new BoyerMoore();
boolean flag = bm.findPattern(key.toString().trim().toLowerCase(), "abc");
if(flag){
context.write(key, new IntWritable(1));
}else{
context.write(key, new IntWritable(0));
}
}
}
}
I'm using AWS(Amazon Web Services) so I can select the number of nodes from the console that I want my program to run on simultaneously. So I'm assuming that the map and reduce methods that I have used should be enough for running the Boyer-Moore string matching algorithm in parallel.

Mapreduce2 program error: Not a valid JAR

I have installed Hadoop version 2.2.0 on Ubuntu. When I run:
yarn jar
hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
wordcount /user/ubuntu/wordcount/input/file01.txt /output
It runs fine.
When I run a sample program, make the JAR using Eclipse export utility, and run using:
yarn jar /user/ubuntu/WordCountNew.jar com.sample.WordCountNew
/user/ubuntu/wordcount/input/file01.txt /output9
It shows: Not a valid JAR: /user/ubuntu/WordCountNew.jar
When I compile the code in Eclipse, it also shows this error:
2015-02-02 16:08:53,077 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
at com.kumar.WordCountNew.main(WordCountNew.java:62)
Code:
package com.sample;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCountNew {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum = sum + val.get();
context.write(key, new IntWritable(sum));
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setJarByClass(WordCountNew.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
job.waitForCompletion(true);
}
}

getCredentials method error when trying to run "Word Count" program in eclipse by importing all JAR files

Error : Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.security.UserGroupInformation.getCredentials()Lorg/apache/hadoop/security/Credentials;
at org.apache.hadoop.mapreduce.Job.(Job.java:135)
at org.apache.hadoop.mapreduce.Job.getInstance(Job.java:176)
at org.apache.hadoop.mapreduce.Job.getInstance(Job.java:195)
at WordCount.main(WordCount.java:20)
Hadoop version 2.2.0
WordCount.java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration(), "word count");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.submit();
}
}
WordMapper.java
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
#Override
public void map(Object key, Text value,
Context contex) throws IOException, InterruptedException {
// Break line into words for processing
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
word.set(wordList.nextToken());
contex.write(word, one);
}
}
}
SumReducer.java
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable totalWordCount = new IntWritable();
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
Iterator<IntWritable> it=values.iterator();
while (it.hasNext()) {
wordCount += it.next().get();
}
totalWordCount.set(wordCount);
context.write(key, totalWordCount);
}
}
Please let me know what can be done ?Latest mapreduce API is used for the program. All the jars that came with hadoop 2.2.0 are also imported into eclipse.
Thanks :)
Are you using an Eclipse plugin for Hadoop? If not that is the problem. With out the plugin, Eclipse if just running the WordCount class and Hadoop can't find the necessary jars. Bundle all the jars including WordCount and run it in Cluster.
If you want to run it from Eclipse you need Eclipse plugin. If you don't have one, you can build the plugin by following this instructions

Categories