I am using multi text output formate to create multiple files of a single file i.e each line on new file.
This is my code:
public class MOFExample extends Configured implements Tool {
private static double count = 0;
static class KeyBasedMultipleTextOutputFormat extends
MultipleTextOutputFormat<Text, Text> {
#Override
protected String generateFileNameForKeyValue(Text key, Text value,
String name) {
return count++ + "_";// + name;
}
}
/**
* The main job driver.
*/
public int run(final String[] args) throws Exception {
Path csvInputs = new Path(args[0]);
Path outputDir = new Path(args[1]);
JobConf jobConf = new JobConf(super.getConf());
jobConf.setJarByClass(MOFExample.class);
jobConf.setMapperClass(IdentityMapper.class);
jobConf.setInputFormat(KeyValueTextInputFormat.class);
jobConf.setOutputFormat(KeyBasedMultipleTextOutputFormat.class);
jobConf.setOutputValueClass(Text.class);
jobConf.setOutputKeyClass(Text.class);
FileInputFormat.setInputPaths(jobConf, csvInputs);
FileOutputFormat.setOutputPath(jobConf, outputDir);
//jobConf.setNumMapTasks(4);
jobConf.setNumReduceTasks(4);
return JobClient.runJob(jobConf).isSuccessful() ? 0 : 1;
}
public static void main(final String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new MOFExample(), args);
System.exit(res);
}
}
This code runs fine on small text file but when the number of lines of input file are greater than 1900 which is yet not a large file it throws an exception:
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at MOFExample.run(MOFExample.java:57)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at MOFExample.main(MOFExample.java:61)
I also tried this tutorial but this one returns empty output directory without any exception when the input file is large however this one also worked fine with small input file.
Note: I am using Single-Node Cluster
Related
I am trying to analyze a retail store data where i want to solve the breakdown of sales by city ,Here is my data
Date Time City Product-Cat Sale-Value Payment-Mode
2012-01-01 09:20 Fort Worth Women's Clothing 153.57 Visa
2012-01-01 09:00 San Jose Mens Clothing 214.05 Rupee
2012-01-01 09:00 San Diego Music 76.43 Amex
2012-01-01 09:00 New York Cameras 45.76 Visa
Now i want to calculate sales break down by product category across all the stores
Here is the Mapper and reducer and the main class
public class RetailDataAnalysis {
public static class RetailDataAnalysisMapper extends Mapper<Text,Text,Text,Text>{
// when trying with LongWritable Key
public void map(LongWritable key,Text Value,Context context) throws IOException, InterruptedException{
String analyser [] = Value.toString().split(",");
Text productCategory = new Text(analyser[3]);
Text salesPrice = new Text(analyser[4]);
context.write(productCategory, salesPrice);
}
// When trying with Text key
public void map(Text key,Text Value,Context context) throws IOException, InterruptedException{
String analyser [] = Value.toString().split(",");
Text productCategory = new Text(analyser[3]);
Text salesPrice = new Text(analyser[4]);
context.write(productCategory, salesPrice);
}
}
public static class RetailDataAnalysisReducer extends Reducer<Text,Text,Text,Text>{
protected void reduce(Text key,Iterable<Text> values,Context context)throws IOException, InterruptedException{
String csv ="";
for(Text value:values){
if(csv.length()>0){
csv+= ",";
}
csv+=value.toString();
}
context.write(key, new Text(csv));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String [] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
if(otherArgs.length<2){
System.out.println("Usage Retail Data ");
System.exit(2);
}
Job job= new Job(conf,"Retail Data Analysis");
job.setJarByClass(RetailDataAnalysis.class);
job.setMapperClass(RetailDataAnalysisMapper.class);
job.setCombinerClass(RetailDataAnalysisReducer.class);
job.setReducerClass(RetailDataAnalysisReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
for(int i=0;i<otherArgs.length-1;++i){
FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
}
FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length-1]));
System.exit(job.waitForCompletion(true)?0:1);
}
}
And the exception i am getting is when using LongWritable Key,
18/04/11 09:15:40 INFO mapreduce.Job: Task Id : attempt_1523355254827_0008_m_000000_2, Status : FAILED
Error: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1069)
Exception i am getting when trying to use Text key
Error: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1069)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:712)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124)
Please help me to solve this,i am very new to hadoop.
You may need different input format class. By default used is TextInputFormat which split the file line by line and gives line number as LongWritable and the line as Text.
You can specify the input format class this way:
job.setInputFormatClass(TextInputFormat.class);
In your case, if you do not need the key, just the values, you can use LongWritable as key:
public static class RetailDataAnalysisMapper extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text Value, Context context) throws IOException, InterruptedException {
//...
}
}
Edit:
Here is whole code after modyfing to use LongWritable as key:
public class RetailDataAnalysis {
public static class RetailDataAnalysisMapper extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text Value, Context context) throws IOException, InterruptedException {
String analyser[] = Value.toString().split(",");
Text productCategory = new Text(analyser[3]);
Text salesPrice = new Text(analyser[4]);
context.write(productCategory, salesPrice);
}
}
public static class RetailDataAnalysisReducer extends Reducer<Text, Text, Text, Text> {
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String csv = "";
for (Text value : values) {
if (csv.length() > 0) {
csv += ",";
}
csv += value.toString();
}
context.write(key, new Text(csv));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length < 2) {
System.out.println("Usage Retail Data ");
System.exit(2);
}
Job job = new Job(conf, "Retail Data Analysis");
job.setJarByClass(RetailDataAnalysis.class);
job.setMapperClass(RetailDataAnalysisMapper.class);
job.setCombinerClass(RetailDataAnalysisReducer.class);
job.setReducerClass(RetailDataAnalysisReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
for (int i = 0; i < otherArgs.length - 1; ++i) {
FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
}
FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Also if you are splitting the data by ,, your data should be a csv, like this:
2012-01-01 09:20,Fort Worth,Women's Clothing,153.57,Visa
2012-01-01 09:00,San Jose,Mens Clothing,214.05,Rupee
2012-01-01 09:00,San Diego,Music,76.43,Amex
2012-01-01 09:00,New York,Cameras,5.76,Visa
Not space separated as you specified it in your question.
When you read a file using Map Reduce, the file input format ( the default one ) reads an entire line and sends it to the mapper in the format of , so the input to the mapper becomes :-
public static class RetailDataAnalysisMapper extends Mapper<LongWritable,Text,Text,Text>
In case you need to read as
public static class RetailDataAnalysisMapper extends Mapper<Text,Text,Text,Text>
you would need to change the file input format and use your custom file input format along with the custom record reader.
Then you need to add the following line in the driver code.
job.setInputFormatClass("your custom input format".class);
Hadoop understands everything in the form of
so when you read a file, the offset becomes the LongWritable key and the value read becomes the value.
So you need to use the default signature of Mapper<LongWritable,Text, <anything>,<anything> >
I am attempting to test a method that returns a File object using JUnit and JMockit. I am a beginner with both of these.
The problem I am having is that I can't figure out how to properly/successfully mock the implementation method returning a file, since in reality, the user has to actually select a file for the method to return. The error I keep running into is:
java.lang.IllegalStateException: Missing invocation to mocked type at this point; please make sure such invocations appear only after the declaration of a suitable mock field or parameter
Any suggestions?
Here is a recreation of my implementation:
public final class MyClass {
public static File OpenFile(Stage stage, String title, String fileTypeText, ArrayList<String> fileType) throws Exception {
File file = null;
try {
FileChooser fileChooser = new FileChooser();
fileChooser.setTitle(title);
FileChooser.ExtensionFilter extFilter = new FileChooser.ExtensionsFilter(fileTypeText + fileType, fileType);
fileChooser.getExtensionsFilters().add(extFilter);
file = fileChooser.showOpenDialog(stage);
return file;
}
catc (Exception e) {
if(fileType==null) {
...
}
return file;
}
}
}
Here is a recreation of my attempted JUnit test:
#Test
public void TestOpenFile(#Mocked Stage stage) throws Exception {
final ArrayList<String> extensions = new ArrayList<String>();
extensions.add(".txt");
final File file = null;
new Expectations() {{
MyClass.OpenFile(stage, anyString, anyString, extensions); returns(file);
}};
assertEquals(file, MyClass.OpenFile(stage, "some title", "some type", extensions));
}
Your solution is correct, but I would use expectations instead:
public void TestOpenFile(#Mocked FileChooser chooser) throws Exception{
new Expectations() {
{
chooser.showOpenDialog(stage); result = expectedFile;
}};
final File actualFile = MyClass.OpenFile(...);
assertEquals(expectedFile, actualFile);}
I find this easier to understand and write (my personal preference)
I realized that I was approaching the problem incorrectly at first. What I did to resolve this was:
Mock the FileChooser.showOpenDialog method to return a file instead of trying to mock my own method to return a file, which would have defeated the purpose of testing.
final File expectedFile = new File("abc");
new MockUp<FileChooser>() {
#Mock
File showOpenDialog(final Window overWindow) {
return expectedFile;
}
};
final File actualFile = MyClass.OpenFile(...);
assertEquals(expectedFile, actualFile);
It is the code from HitHub for my learning purpose. And I tried run it in eclipse and got errors like following:
Exception in thread "main" org.matsim.core.utils.io.UncheckedIOException: java.io.FileNotFoundException: args[0]
at org.matsim.core.utils.io.IOUtils.getBufferedReader(IOUtils.java:125)
at org.matsim.core.utils.io.IOUtils.getBufferedReader(IOUtils.java:72)
at org.matsim.core.utils.io.MatsimXmlParser.parse(MatsimXmlParser.java:147)
at org.matsim.core.config.ConfigUtils.loadConfig(ConfigUtils.java:59)
at test1.RunCarsharing.main(RunCarsharing.java:23)
Caused by: java.io.FileNotFoundException: args[0]
... 5 more
And the main program is as following:
public class RunCarsharing {
public static void main(String[] args) {
Logger.getLogger( "org.matsim.core.controler.Injector" ).setLevel(Level.OFF);
final Config config = ConfigUtils.loadConfig(args[0]);
CarsharingUtils.addConfigModules(config);
final Scenario sc = ScenarioUtils.loadScenario(config);
final Controler controler = new Controler( sc );
installCarSharing(controler);
controler.run();
}
public static void installCarSharing(final Controler controler) {
Scenario sc = controler.getScenario() ;
controler.addOverridingModule( new AbstractModule() {
#Override
public void install() {
this.addPlanStrategyBinding("RandomTripToCarsharingStrategy").to( RandomTripToCarsharingStrategy.class ) ;
this.addPlanStrategyBinding("CarsharingSubtourModeChoiceStrategy").to( CarsharingSubtourModeChoiceStrategy.class ) ;
}
});
controler.addOverridingModule(new AbstractModule() {
#Override
public void install() {
bindMobsim().toProvider( CarsharingQsimFactory.class );
}
});
controler.setTripRouterFactory(CarsharingUtils.createTripRouterFactory(sc));
//setting up the scoring function factory, inside different scoring functions are set-up
controler.setScoringFunctionFactory(new CarsharingScoringFunctionFactory( sc.getConfig(), sc.getNetwork()));
final CarsharingConfigGroup csConfig = (CarsharingConfigGroup) controler.getConfig().getModule(CarsharingConfigGroup.GROUP_NAME);
controler.addControlerListener(new CarsharingListener(controler,
csConfig.getStatsWriterFrequency() ) ) ;
}
}
If the code and stacktrace are accurate, then the only way you can get that exception message is if something is trying to open a file whose filename is "args[0]".
in Hadoop 2.4.0, I get the following error while executing below code sample. I think, there is mismatch hadoop version. Are you review the code? and How can I fix this codes?
I am trying to write map-reduce job that copying Hcatalog table.
thank you.
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.apache.hcatalog.mapreduce.HCatBaseOutputFormat.getJobInfo(HCatBaseOutputFormat.java:94)
at org.apache.hcatalog.mapreduce.HCatBaseOutputFormat.getOutputFormat(HCatBaseOutputFormat.java:82)
at org.apache.hcatalog.mapreduce.HCatBaseOutputFormat.checkOutputSpecs(HCatBaseOutputFormat.java:72)
at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:458)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:343)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
at org.deneme.hadoop.UseHCat.run(UseHCat.java:102)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.deneme.hadoop.UseHCat.main(UseHCat.java:107)
Code Sample
public class UseHCat extends Configured implements Tool{
public static class Map extends Mapper<WritableComparable, HCatRecord,Text,IntWritable> {
String groupname;
#Override
protected void map( WritableComparable key,
HCatRecord value,
org.apache.hadoop.mapreduce.Mapper<WritableComparable, HCatRecord,
Text, IntWritable>.Context context)
throws IOException, InterruptedException {
// The group table from /etc/group has name, 'x', id
groupname = (String) value.get(0);
int id = (Integer) value.get(2);
// Just select and emit the name and ID
context.write(new Text(groupname), new IntWritable(id));
}
}
public static class Reduce extends Reducer<Text, IntWritable,
WritableComparable, HCatRecord> {
protected void reduce( Text key,
java.lang.Iterable<IntWritable> values,
org.apache.hadoop.mapreduce.Reducer<Text, IntWritable,
WritableComparable, HCatRecord>.Context context)
throws IOException, InterruptedException {
// Only expecting one ID per group name
Iterator<IntWritable> iter = values.iterator();
IntWritable iw = iter.next();
int id = iw.get();
// Emit the group name and ID as a record
HCatRecord record = new DefaultHCatRecord(2);
record.set(0, key.toString());
record.set(1, id);
context.write(null, record);
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf(); //hdfs://sandbox.hortonworks.com:8020
//conf.set("fs.defaultFS", "hdfs://192.168.1.198:8020");
//conf.set("mapreduce.job.tracker", "192.168.1.115:50001");
//Configuration conf = new Configuration();
//conf.set("fs.defaultFS", "hdfs://192.168.1.198:8020/data");
args = new GenericOptionsParser(conf, args).getRemainingArgs();
// Get the input and output table names as arguments
String inputTableName = args[0];
String outputTableName = args[1];
// Assume the default database
String dbName = null;
String jobName = "UseHCat";
String userChosenName = getConf().get(JobContext.JOB_NAME);
if (userChosenName != null)
jobName += ": " + userChosenName;
Job job = Job.getInstance(getConf());
job.setJobName(jobName);
// Job job = new Job(conf, "UseHCat");
// HCatInputFormat.setInput(job, InputJobInfo.create(dbName,inputTableName, null));
HCatInputFormat.setInput(job, dbName, inputTableName);
job.setJarByClass(UseHCat.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
// An HCatalog record as input
job.setInputFormatClass(HCatInputFormat.class);
// Mapper emits a string as key and an integer as value
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// Ignore the key for the reducer output; emitting an HCatalog record as value
job.setOutputKeyClass(WritableComparable.class);
job.setOutputValueClass(DefaultHCatRecord.class);
job.setOutputFormatClass(HCatOutputFormat.class);
HCatOutputFormat.setOutput(job, OutputJobInfo.create(dbName, outputTableName, null));
HCatSchema s = HCatOutputFormat.getTableSchema(job.getConfiguration());
System.err.println("INFO: output schema explicitly set for writing:" + s);
HCatOutputFormat.setSchema(job, s);
return (job.waitForCompletion(true) ? 0 : 1);
}
public static void main(String[] args) throws Exception {
// System.setProperty("hadoop.home.dir", "C:"+File.separator+"hadoop-2.4.0");
int exitCode = ToolRunner.run(new UseHCat(), args);
System.exit(exitCode);
}
}
In Hadoop 1.x.x JobContext is a Class where as in Hadoop 2.x.x, it is an interface and HCatalog-core APIs are not compatible with hadoop 2.x.x.
HCatalogBaseOutputFormat class needs the following code change to fix the issue:
//JobContext ctx = new JobContext(conf,jobContext.getJobID());
JobContext ctx = new Job(conf);
In Map I read Hdfs file update to Hbase,
Version:hadoop 2.5.1 hbase 1.0.0
Exception as follows :
Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set.
maybe there is something wrong with
job.setOutputFormatClass(TableOutputFormat.class);
this line prompt:
The method setOutputFormatClass(Class<? extends OutputFormat>) in the type Job is not applicable for the arguments (Class<TableOutputFormat>)
codes as follows:
public class HdfsAppend2HbaseUtil extends Configured implements Tool{
public static class HdfsAdd2HbaseMapper extends Mapper<Text, Text, ImmutableBytesWritable, Put>{
public void map(Text ikey, Text ivalue, Context context)
throws IOException, InterruptedException {
String oldIdList = HBaseHelper.getValueByKey(ikey.toString());
StringBuffer sb = new StringBuffer(oldIdList);
String newIdList = ivalue.toString();
sb.append("\t" + newIdList);
Put p = new Put(ikey.toString().getBytes());
p.addColumn("idFam".getBytes(), "idsList".getBytes(), sb.toString().getBytes());
context.write(new ImmutableBytesWritable(), p);
}
}
public int run(String[] paths) throws Exception {
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "master,salve1");
conf.set("hbase.zookeeper.property.clientPort", "2181");
Job job = Job.getInstance(conf,"AppendToHbase");
job.setJarByClass(cn.edu.hadoop.util.HdfsAppend2HbaseUtil.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setMapperClass(HdfsAdd2HbaseMapper.class);
job.setNumReduceTasks(0);
job.setOutputFormatClass(TableOutputFormat.class);
job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, "reachableTable");
FileInputFormat.setInputPaths(job, new Path(paths[0]));
job.setOutputKeyClass(ImmutableBytesWritable.class);
job.setOutputValueClass(Put.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
System.out.println("Append Start: ");
long time1 = System.currentTimeMillis();
long time2;
String[] pathsStr = {Const.TwoDegreeReachableOutputPathDetail};
int exitCode = ToolRunner.run(new HdfsAppend2HbaseUtil(), pathsStr);
time2 = System.currentTimeMillis();
System.out.println("Append Cost " + "\t" + (time2-time1)/1000 +" s");
System.exit(exitCode);
}
}
You didn't mention the output directory where it is to write the output like you gave for input path.
Mention it like this.
FileOutputFormat.setOutputPath(job, new Path(<output path>));
At last , I know why,just as I supposed there is something wrong with:
job.setOutputFormatClass(TableOutputFormat.class);
this line prompt:
The method setOutputFormatClass(Class<? extends OutputFormat>) in the type Job is not applicable for the arguments (Class<TableOutputFormat>)
In fact here we need import
org.apache.hadoop.hbase.mapreduce.TableOutputFormat
not to import
org.apache.hadoop.hbase.mapred.TableOutputFormat
the former extends from org.apache.hadoop.mapred.FileOutputFormat
see:
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapred/TableOutputFormat.html
and the later extends from
org.apache.hadoop.mapreduce.OutputFormat
see:
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html
At last Thank U all very much!!!