Executing a jar through ant clean dist- appears under different name? - java

I'm working with hadoop and need to create a .jar file combined from all of my classes in /src file. Everytime I try to create it it appears under WordCount.jar instead of Twitter.jar which I have stated in my code below:
import java.util.Arrays;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Twitter {
public static void runJob(String[] input, String output) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(Twitter.class);
job.setReducerClass(TwitterReducer.class);
job.setMapperClass(TwitterMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
Path outputPath = new Path(output);
FileInputFormat.setInputPaths(job, StringUtils.join(input, ","));
FileOutputFormat.setOutputPath(job, outputPath);
outputPath.getFileSystem(conf).delete(outputPath, true);
job.waitForCompletion(true);
}
public static void main(String[] args) throws Exception {
runJob(Arrays.copyOfRange(args, 0, args.length - 1), args[args.length - 1]);
}
}
Therefore I am unsure what is wrong? The files in the .jar itself are exactly the same as in /src folder.

The name of the Jar file has nothing to do the name of a class in it. You need to check the Ant buildfile, specifically the target that creates the jar. The Ant task that creates the Jar file is usually the jar task and the name of the file can be specified via the destfile attribute.

Related

Copy JSON file from Local to HDFS

import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class HdfsWriter extends Configured implements Tool {
public int run(String[] args) throws Exception {
//String localInputPath = args[0];
Path outputPath = new Path(args[0]); // ARGUMENT FOR OUTPUT_LOCATION
Configuration conf = getConf();
FileSystem fs = FileSystem.get(conf);
OutputStream os = fs.create(outputPath);
InputStream is = new BufferedInputStream(new FileInputStream("/home/acadgild/acadgild.txt")); //Data set is getting copied into input stream through buffer mechanism.
IOUtils.copyBytes(is, os, conf); // Copying the dataset from input stream to output stream
return 0;
}
public static void main(String[] args) throws Exception {
int returnCode = ToolRunner.run(new HdfsWriter(), args);
System.exit(returnCode);
}
}
Need to Move the data from Local to HDFS.
The above code I got from another blog , it's not working. can anyone help me on this.
Also i need to parse the Json using MR and group by DateTime and move to HDFS
Map Reduce is a distributed job processing framework
for each mapper local means the local filesytem on the node on which that mapper is running.
What you want is reading from local on a given node to be put on to HDFS and then processing it via MapReduce.
There are multiple tools available for copying from Local of one node to HDFS
hdfs put localPath HdfsPath (Shell script)
flume

MapReduce set input and output

I have a file
import java.io.IOException;
import java.nio.file.Paths;
import java.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;
public class ViewCount extends Configured implements Tool {
public static void main(String args[]) throws Exception {
int res = ToolRunner.run(new ViewCount(), args);
System.exit(res);
}
public int run(String[] args) throws Exception {
//Path inputPath = new Path(args[0]);
Path inputPath = Paths.get("C:/WorkSpace/input.txt");
Path outputPath = Paths.get("C:/WorkSpace/output.txt");
Configuration conf = getConf();
Job job = new Job(conf, this.getClass().toString());
I try to run a the app in windows. How can I set inputPath and outputPath? The method I use now doesn't work. Before I had
Path inputPath = new Path(args[0]);
Path outputPath = new Path(args[1]);
and I had to go to the command line. Now I want to run the app from the IDE.
I'm getting
Required:
org.apache.hadoop.fs.Path
Found:
java.nio.file.Path
For Eclipse, you could set arguments :
Run -> run configuration -> arguments.
It should be the same in Intellij.
The error tells you that it is expecting a org.apache.hadoop.fs.Path, but instead it receives a java.nio.file.Paths.
This means that you should change the second import of your code to
org.apache.hadoop.fs.Path. IDEs import suggestions can be wrong some times ;)
Change the import and then use the method that you already had to add the input and output path. Those arguments are given in Eclipse with right-clicking the project -> Run as -> Run configurations -> arguments. The two paths should be white-space separated. Apply and run!
For the next executions, just run the project.

Running MR program with separate mapper, reducer and driver classes

maxtempmapper.java class:
package com.hadoop.gskCodeBase.maxTemp;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTempMapper extends Mapper<LongWritable,Text,Text,IntWritable> {
private static final int MISSING=9999;
#Override
public void map(LongWritable kay,Text value,Context context) throws IOException,InterruptedException {
String line = value.toString();
String year = line.substring(15,19);
int airTemperature;
if(line.charAt(87)== '+'){
airTemperature=Integer.parseInt(line.substring(88, 92));
}else{
airTemperature=Integer.parseInt(line.substring(87, 92));
}
String quality=line.substring(92,93);
if(airTemperature !=MISSING && quality.matches("[01459]")){
context.write(new Text(year), new IntWritable(airTemperature));
}
}
}
maxtempreducer.java class:
package com.hadoop.gskCodeBase.maxTemp;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTempReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
#Override
public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException,InterruptedException {
int maxValue = Integer.MIN_VALUE;
for(IntWritable value : values){
maxValue=Math.max(maxValue, value.get());
}
context.write(key, new IntWritable(maxValue));
}
}
maxtempdriver.java class:
package com.hadoop.gskCodeBase.maxTemp;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class MaxTempDriver extends Configured implements Tool{
public int run(String[] args) throws Exception{
if(args.length !=2){
System.err.println("UsageTemperatureDriver <input path> <outputpath>");
System.exit(-1);
}
Job job = Job.getInstance();
job.setJarByClass(MaxTempDriver.class);
job.setJobName("Max Temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
job.setMapperClass(MaxTempMapper.class);
job.setReducerClass(MaxTempReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0:1);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
MaxTempDriver driver = new MaxTempDriver();
int exitCode = ToolRunner.run(driver, args);
System.exit(exitCode);
}
}
I have to execute the above three classes on single node hadoop cluster on windows using command prompt
can someone please help me in how to execute these three classes on command prompt(windows)?
Archive all the java files into a single .jar file. Then just run it as you normally do. In Windows, it's easier to run Hadoop via Cygwin terminal. You can execute the job by the following command:
hadoop jar <path to .jar> <path to input folder in hdfs> <path to output folder in hdfs>
Eg:
hadoop jar wordcount.jar /input /output
-UPDATE-
You should assign you driver class to the job.setJarByClass(). In this case, it would be your MaxTempDriver.class
In eclipse, you can create a jar file by right clicking on your source folder > Export > JAR file. From there you can follow the steps. You can set your Main Class during the process as well.
Hope this answers your question.

unable to set mapreduce.job.reduces through generic option parser

hadoop jar MapReduceTryouts-1.jar invertedindex.simple.MyDriver -D mapreduce.job.reduces=10 /user/notprabhu2/Input/potter/ /user/notprabhu2/output
I have been trying in vain to set the number of reducers through the -D option provided by GenericOptionParser but it does not seem to work and I have no idea why.
I tried -D mapreduce.job.reduces=10(with space after -D) and also
-Dmapreduce.job.reduces=10(without space after -D) but nothing seems to dodge.
In my Driver class I have implemented Tools.
package invertedindex.simple;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class MyDriver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Job job = Job.getInstance(conf);
job.setJarByClass(MyDriver.class);
Path outputPath = new Path(args[1]);
outputPath.getFileSystem(getConf()).delete(outputPath, true);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.addInputPath(job, new Path(args[0]));
TextOutputFormat.setOutputPath(job, outputPath);
job.setNumReduceTasks(3);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(),new MyDriver(), args);
System.exit(exitCode);
}
}
Since I have explicitly set the number of reducers to 3 in my driver code I always end up with 3 reducers.
I am using CDH 5.4.7 which has Hadoop 2.6.0 on a 2 node cluster on Google Compute Engine.
Figured it out. Turned out to be so silly but still posting the answer just in case someone also does the same silly mistake.
Seems the job.setNumReduceTasks(3); line in my driver class is taking precedence over the -D mapreduce.job.reduces=10 in the command line.
When I removed thejob.setNumReduceTasks(3); line from my code everything worked fine.
set the property for number of reducers - mapreduce.job.reduces in xml tag
set property in mapred-site.xml which will be called by code from configuration:
<property>
<name>mapreduce.job.reduces</name>
<value>5</value>
</property>
relaunch hadoop process

Can a Jar File be updated programmatically without rewriting the whole file?

It is possible to update individual files in a JAR file using the jar command as follows:
jar uf TicTacToe.jar images/new.gif
Is there a way to do this programmatically?
I have to rewrite the entire jar file if I use JarOutputStream, so I was wondering if there was a similar "random access" way to do this. Given that it can be done using the jar tool, I had expected there to be a similar way to do it programmatically.
It is possible to update just parts of the JAR file using Zip File System Provider available in Java 7:
import java.net.URI;
import java.nio.file.FileSystem;
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.StandardCopyOption;
import java.util.HashMap;
import java.util.Map;
public class ZipFSPUser {
public static void main(String [] args) throws Throwable {
Map<String, String> env = new HashMap<>();
env.put("create", "true");
// locate file system by using the syntax
// defined in java.net.JarURLConnection
URI uri = URI.create("jar:file:/codeSamples/zipfs/zipfstest.zip");
try (FileSystem zipfs = FileSystems.newFileSystem(uri, env)) {
Path externalTxtFile = Paths.get("/codeSamples/zipfs/SomeTextFile.txt");
Path pathInZipfile = zipfs.getPath("/SomeTextFile.txt");
// copy a file into the zip file
Files.copy( externalTxtFile,pathInZipfile,
StandardCopyOption.REPLACE_EXISTING );
}
}
}
Yes, if you use this opensource library you can modify it in this way as well.
https://truevfs.java.net
public static void main(String args[]) throws IOException{
File entry = new TFile("c:/tru6413/server/lib/nxps.jar/dir/second.txt");
Writer writer = new TFileWriter(entry);
try {
writer.write(" this is writing into a file inside an archive");
} finally {
writer.close();
}
}

Categories