Hadoop 2.4.0 + HCatalog + Mapreduce

Hadoop 2.4.0 + HCatalog + Mapreduce - java

in Hadoop 2.4.0, I get the following error while executing below code sample. I think, there is mismatch hadoop version. Are you review the code? and How can I fix this codes?
I am trying to write map-reduce job that copying Hcatalog table.
thank you.
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.apache.hcatalog.mapreduce.HCatBaseOutputFormat.getJobInfo(HCatBaseOutputFormat.java:94)
at org.apache.hcatalog.mapreduce.HCatBaseOutputFormat.getOutputFormat(HCatBaseOutputFormat.java:82)
at org.apache.hcatalog.mapreduce.HCatBaseOutputFormat.checkOutputSpecs(HCatBaseOutputFormat.java:72)
at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:458)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:343)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
at org.deneme.hadoop.UseHCat.run(UseHCat.java:102)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.deneme.hadoop.UseHCat.main(UseHCat.java:107)
Code Sample
public class UseHCat extends Configured implements Tool{
public static class Map extends Mapper<WritableComparable, HCatRecord,Text,IntWritable> {
String groupname;
#Override
protected void map( WritableComparable key,
HCatRecord value,
org.apache.hadoop.mapreduce.Mapper<WritableComparable, HCatRecord,
Text, IntWritable>.Context context)
throws IOException, InterruptedException {
// The group table from /etc/group has name, 'x', id
groupname = (String) value.get(0);
int id = (Integer) value.get(2);
// Just select and emit the name and ID
context.write(new Text(groupname), new IntWritable(id));
}
}
public static class Reduce extends Reducer<Text, IntWritable,
WritableComparable, HCatRecord> {
protected void reduce( Text key,
java.lang.Iterable<IntWritable> values,
org.apache.hadoop.mapreduce.Reducer<Text, IntWritable,
WritableComparable, HCatRecord>.Context context)
throws IOException, InterruptedException {
// Only expecting one ID per group name
Iterator<IntWritable> iter = values.iterator();
IntWritable iw = iter.next();
int id = iw.get();
// Emit the group name and ID as a record
HCatRecord record = new DefaultHCatRecord(2);
record.set(0, key.toString());
record.set(1, id);
context.write(null, record);
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf(); //hdfs://sandbox.hortonworks.com:8020
//conf.set("fs.defaultFS", "hdfs://192.168.1.198:8020");
//conf.set("mapreduce.job.tracker", "192.168.1.115:50001");
//Configuration conf = new Configuration();
//conf.set("fs.defaultFS", "hdfs://192.168.1.198:8020/data");
args = new GenericOptionsParser(conf, args).getRemainingArgs();
// Get the input and output table names as arguments
String inputTableName = args[0];
String outputTableName = args[1];
// Assume the default database
String dbName = null;
String jobName = "UseHCat";
String userChosenName = getConf().get(JobContext.JOB_NAME);
if (userChosenName != null)
jobName += ": " + userChosenName;
Job job = Job.getInstance(getConf());
job.setJobName(jobName);
// Job job = new Job(conf, "UseHCat");
// HCatInputFormat.setInput(job, InputJobInfo.create(dbName,inputTableName, null));
HCatInputFormat.setInput(job, dbName, inputTableName);
job.setJarByClass(UseHCat.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
// An HCatalog record as input
job.setInputFormatClass(HCatInputFormat.class);
// Mapper emits a string as key and an integer as value
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// Ignore the key for the reducer output; emitting an HCatalog record as value
job.setOutputKeyClass(WritableComparable.class);
job.setOutputValueClass(DefaultHCatRecord.class);
job.setOutputFormatClass(HCatOutputFormat.class);
HCatOutputFormat.setOutput(job, OutputJobInfo.create(dbName, outputTableName, null));
HCatSchema s = HCatOutputFormat.getTableSchema(job.getConfiguration());
System.err.println("INFO: output schema explicitly set for writing:" + s);
HCatOutputFormat.setSchema(job, s);
return (job.waitForCompletion(true) ? 0 : 1);
}
public static void main(String[] args) throws Exception {
// System.setProperty("hadoop.home.dir", "C:"+File.separator+"hadoop-2.4.0");
int exitCode = ToolRunner.run(new UseHCat(), args);
System.exit(exitCode);
}
}

In Hadoop 1.x.x JobContext is a Class where as in Hadoop 2.x.x, it is an interface and HCatalog-core APIs are not compatible with hadoop 2.x.x.
HCatalogBaseOutputFormat class needs the following code change to fix the issue:
//JobContext ctx = new JobContext(conf,jobContext.getJobID());
JobContext ctx = new Job(conf);

Related

How to solve org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text

I am trying to analyze a retail store data where i want to solve the breakdown of sales by city ,Here is my data
Date Time City Product-Cat Sale-Value Payment-Mode
2012-01-01 09:20 Fort Worth Women's Clothing 153.57 Visa
2012-01-01 09:00 San Jose Mens Clothing 214.05 Rupee
2012-01-01 09:00 San Diego Music 76.43 Amex
2012-01-01 09:00 New York Cameras 45.76 Visa
Now i want to calculate sales break down by product category across all the stores
Here is the Mapper and reducer and the main class
public class RetailDataAnalysis {
public static class RetailDataAnalysisMapper extends Mapper<Text,Text,Text,Text>{
// when trying with LongWritable Key
public void map(LongWritable key,Text Value,Context context) throws IOException, InterruptedException{
String analyser [] = Value.toString().split(",");
Text productCategory = new Text(analyser[3]);
Text salesPrice = new Text(analyser[4]);
context.write(productCategory, salesPrice);
}
// When trying with Text key
public void map(Text key,Text Value,Context context) throws IOException, InterruptedException{
String analyser [] = Value.toString().split(",");
Text productCategory = new Text(analyser[3]);
Text salesPrice = new Text(analyser[4]);
context.write(productCategory, salesPrice);
}
}
public static class RetailDataAnalysisReducer extends Reducer<Text,Text,Text,Text>{
protected void reduce(Text key,Iterable<Text> values,Context context)throws IOException, InterruptedException{
String csv ="";
for(Text value:values){
if(csv.length()>0){
csv+= ",";
}
csv+=value.toString();
}
context.write(key, new Text(csv));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String [] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
if(otherArgs.length<2){
System.out.println("Usage Retail Data ");
System.exit(2);
}
Job job= new Job(conf,"Retail Data Analysis");
job.setJarByClass(RetailDataAnalysis.class);
job.setMapperClass(RetailDataAnalysisMapper.class);
job.setCombinerClass(RetailDataAnalysisReducer.class);
job.setReducerClass(RetailDataAnalysisReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
for(int i=0;i<otherArgs.length-1;++i){
FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
}
FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length-1]));
System.exit(job.waitForCompletion(true)?0:1);
}
}
And the exception i am getting is when using LongWritable Key,
18/04/11 09:15:40 INFO mapreduce.Job: Task Id : attempt_1523355254827_0008_m_000000_2, Status : FAILED
Error: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1069)
Exception i am getting when trying to use Text key
Error: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1069)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:712)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124)
Please help me to solve this,i am very new to hadoop.

You may need different input format class. By default used is TextInputFormat which split the file line by line and gives line number as LongWritable and the line as Text.
You can specify the input format class this way:
job.setInputFormatClass(TextInputFormat.class);
In your case, if you do not need the key, just the values, you can use LongWritable as key:
public static class RetailDataAnalysisMapper extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text Value, Context context) throws IOException, InterruptedException {
//...
}
}
Edit:
Here is whole code after modyfing to use LongWritable as key:
public class RetailDataAnalysis {
public static class RetailDataAnalysisMapper extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text Value, Context context) throws IOException, InterruptedException {
String analyser[] = Value.toString().split(",");
Text productCategory = new Text(analyser[3]);
Text salesPrice = new Text(analyser[4]);
context.write(productCategory, salesPrice);
}
}
public static class RetailDataAnalysisReducer extends Reducer<Text, Text, Text, Text> {
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String csv = "";
for (Text value : values) {
if (csv.length() > 0) {
csv += ",";
}
csv += value.toString();
}
context.write(key, new Text(csv));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length < 2) {
System.out.println("Usage Retail Data ");
System.exit(2);
}
Job job = new Job(conf, "Retail Data Analysis");
job.setJarByClass(RetailDataAnalysis.class);
job.setMapperClass(RetailDataAnalysisMapper.class);
job.setCombinerClass(RetailDataAnalysisReducer.class);
job.setReducerClass(RetailDataAnalysisReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
for (int i = 0; i < otherArgs.length - 1; ++i) {
FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
}
FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Also if you are splitting the data by ,, your data should be a csv, like this:
2012-01-01 09:20,Fort Worth,Women's Clothing,153.57,Visa
2012-01-01 09:00,San Jose,Mens Clothing,214.05,Rupee
2012-01-01 09:00,San Diego,Music,76.43,Amex
2012-01-01 09:00,New York,Cameras,5.76,Visa
Not space separated as you specified it in your question.

When you read a file using Map Reduce, the file input format ( the default one ) reads an entire line and sends it to the mapper in the format of , so the input to the mapper becomes :-
public static class RetailDataAnalysisMapper extends Mapper<LongWritable,Text,Text,Text>
In case you need to read as
public static class RetailDataAnalysisMapper extends Mapper<Text,Text,Text,Text>
you would need to change the file input format and use your custom file input format along with the custom record reader.
Then you need to add the following line in the driver code.
job.setInputFormatClass("your custom input format".class);
Hadoop understands everything in the form of
so when you read a file, the offset becomes the LongWritable key and the value read becomes the value.
So you need to use the default signature of Mapper<LongWritable,Text, <anything>,<anything> >

ClassCastException:java.lang.Exception: java.lang.ClassCastException in mapred

I am writing a mapreduce application which takes input in (key, value) format and just displays the same data as output from reducer.
This is the sample input:
1500s 1
1960s 1
Aldus 1
In the below code, I am specifying the input format using <<>> and specified the delimiter as tab in the main(). When I run the code, I am running into the error message:
java.lang.Exception: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.LongWritable
at cscie63.examples.WordDesc$KVMapper.map(WordDesc.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Tried different things to debug but nothing helped.
public class WordDesc {
public static class KVMapper
extends Mapper<Text, LongWritable, Text, LongWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Text key, LongWritable value , Context context
) throws IOException, InterruptedException {
context.write(key,value);
}
}
public static class KVReducer
extends Reducer<Text,LongWritable,Text,LongWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, LongWritable value,
Context context
) throws IOException, InterruptedException {
context.write(key, value);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", "\t");
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length < 2) {
System.err.println("Usage: wordcount <in> [<in>...] <out>");
System.exit(2);
}
Job job = new Job(conf, "word desc");
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setJarByClass(WordDesc.class);
job.setMapperClass(KVMapper.class);
job.setCombinerClass(KVReducer.class);
job.setReducerClass(KVReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
for (int i = 0; i < otherArgs.length - 1; ++i) {
FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
}
FileOutputFormat.setOutputPath(job,
new Path(otherArgs[otherArgs.length - 1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

I guess that this line job.setInputFormatClass(KeyValueTextInputFormat.class); tells your program to treat your input as key value pairs of Text. Therefore, when you require your input value to be a LongWritable you get this Exception.
A quick fix would be to read your input as Text and then, if you want to use a LongWritable, parse it using:
public static class KVMapper
extends Mapper<Text, Text, Text, LongWritable>{
private final static LongWritable val = new LongWritable();
public void map(Text key, Text value, Context context) {
val.set(Long.parseLong(value.toString()));
context.write(key,val);
}
}
What it does is the following: value is Text, then value.toString() gives the String representation of this Text and then Long.parseLong() parses this String as long. Finally, val.set(), transforms it to a a LongWritable.
By the way, I don't think that you need a Reducer for that... You could make it faster by setting the number of reduce tasks to 0.

Spark: Two SparkContexts in a single Application Best Practice

I think I have an interesting question for all of you today. In the code below you will notice I have two SparkContexts one for SparkStreaming and the other one which is a normal SparkContext. According to best practices you should only have one SparkContext in a Spark application even though its possible to circumvent this via allowMultipleContexts in the configuration.
Problem is, I need to retrieve data from hive and from a Kafka topic to do some logic, and whenever I submit my application it obviously returns "Cannot have 2 Spark Contexts Running on JVM".
My question is, is there a correct way to do this than how I am doing it right now?
public class MainApp {
private final String logFile= Properties.getString("SparkLogFileDir");
private static final String KAFKA_GROUPID = Properties.getString("KafkaGroupId");
private static final String ZOOKEEPER_URL = Properties.getString("ZookeeperURL");
private static final String KAFKA_BROKER = Properties.getString("KafkaBroker");
private static final String KAFKA_TOPIC = Properties.getString("KafkaTopic");
private static final String Database = Properties.getString("HiveDatabase");
private static final Integer KAFKA_PARA = Properties.getInt("KafkaParrallel");
public static void main(String[] args){
//set settings
String sql="";
//START APP
System.out.println("Starting NPI_TWITTERAPP...." + new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
System.out.println("Configuring Settings...."+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
SparkConf conf = new SparkConf()
.setAppName(Properties.getString("SparkAppName"))
.setMaster(Properties.getString("SparkMasterUrl"));
//Set Spark/hive/sql Context
JavaSparkContext sc = new JavaSparkContext(conf);
JavaStreamingContext jssc = new JavaStreamingContext(conf, new Duration(5000));
JavaHiveContext HiveSqlContext = new JavaHiveContext(sc);
//Check if Twitter Hive Table Exists
try {
HiveSqlContext.sql("DROP TABLE IF EXISTS "+Database+"TWITTERSTORE");
HiveSqlContext.sql("CREATE TABLE IF NOT EXISTS "+Database+".TWITTERSTORE "
+" (created_at String, id String, id_str String, text String, source String, truncated String, in_reply_to_user_id String, processed_at String, lon String, lat String)"
+" STORED AS TEXTFILE");
}catch(Exception e){
System.out.println(e);
}
//Check if Ivapp Table Exists
sql ="CREATE TABLE IF NOT EXISTS "+Database+".IVAPPGEO AS SELECT DISTINCT a.LATITUDE, a.LONGITUDE, b.ODNCIRCUIT_OLT_CLLI, b.ODNCIRCUIT_OLT_TID, a.CITY, a.STATE, a.ZIP FROM "
+Database+".T_PONNMS_SERVICE B, "
+Database+".CLLI_LATLON_MSTR A WHERE a.BID_CLLI = substr(b.ODNCIRCUIT_OLT_CLLI,0,8)";
try {
System.out.println(sql + new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
HiveSqlContext.sql(sql);
sql = "SELECT LATITUDE, LONGITUDE, ODNCIRCUIT_OLT_CLLI, ODNCIRCUIT_OLT_TID, CITY, STATE, ZIP FROM "+Database+".IVAPPGEO";
JavaSchemaRDD RDD_IVAPPGEO = HiveSqlContext.sql(sql).cache();
}catch(Exception e){
System.out.println(sql + new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
}
//JavaHiveContext hc = new JavaHiveContext();
System.out.println("Retrieve Data from Kafka Topic: "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
Map<String, Integer> topicMap = new HashMap<String, Integer>();
topicMap.put(KAFKA_TOPIC,KAFKA_PARA);
JavaPairReceiverInputDStream<String, String> messages = KafkaUtils.createStream(
jssc, KAFKA_GROUPID, ZOOKEEPER_URL, topicMap);
JavaDStream<String> json = messages.map(
new Function<Tuple2<String, String>, String>() {
private static final long serialVersionUID = 42l;
#Override
public String call(Tuple2<String, String> message) {
return message._2();
}
}
);
System.out.println("Completed Kafka Messages... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
System.out.println("Filtering Resultset... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
JavaPairDStream<Long, String> tweets = json.mapToPair(
new TwitterFilterFunction());
JavaPairDStream<Long, String> filtered = tweets.filter(
new Function<Tuple2<Long, String>, Boolean>() {
private static final long serialVersionUID = 42l;
#Override
public Boolean call(Tuple2<Long, String> tweet) {
return tweet != null;
}
}
);
JavaDStream<Tuple2<Long, String>> tweetsFiltered = filtered.map(
new TextFilterFunction());
tweetsFiltered = tweetsFiltered.map(
new StemmingFunction());
System.out.println("Finished Filtering Resultset... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
System.out.println("Processing Sentiment Data... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
//calculate postive tweets
JavaPairDStream<Tuple2<Long, String>, Float> positiveTweets =
tweetsFiltered.mapToPair(new PositiveScoreFunction());
//calculate negative tweets
JavaPairDStream<Tuple2<Long, String>, Float> negativeTweets =
tweetsFiltered.mapToPair(new NegativeScoreFunction());
JavaPairDStream<Tuple2<Long, String>, Tuple2<Float, Float>> joined =
positiveTweets.join(negativeTweets);
//Score tweets
JavaDStream<Tuple4<Long, String, Float, Float>> scoredTweets =
joined.map(new Function<Tuple2<Tuple2<Long, String>,
Tuple2<Float, Float>>,
Tuple4<Long, String, Float, Float>>() {
private static final long serialVersionUID = 42l;
#Override
public Tuple4<Long, String, Float, Float> call(
Tuple2<Tuple2<Long, String>, Tuple2<Float, Float>> tweet)
{
return new Tuple4<Long, String, Float, Float>(
tweet._1()._1(),
tweet._1()._2(),
tweet._2()._1(),
tweet._2()._2());
}
});
System.out.println("Finished Processing Sentiment Data... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
System.out.println("Outputting Tweets Data to flat file "+Properties.getString("HdfsOutput")+" ... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
JavaDStream<Tuple5<Long, String, Float, Float, String>> result =
scoredTweets.map(new ScoreTweetsFunction());
result.foreachRDD(new FileWriter());
System.out.println("Outputting Sentiment Data to Hive... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
jssc.start();
jssc.awaitTermination();
}
}

Creating SparkContext
You can create a SparkContext instance with or without creating a SparkConf object first.
Getting Existing or Creating New SparkContext (getOrCreate methods)
getOrCreate(): SparkContext
getOrCreate(conf: SparkConf): SparkContext
SparkContext.getOrCreate methods allow you to get the existing SparkContext or create a new one.
import org.apache.spark.SparkContext
val sc = SparkContext.getOrCreate()
// Using an explicit SparkConf object
import org.apache.spark.SparkConf
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("SparkMe App")
val sc = SparkContext.getOrCreate(conf)
Refer Here - https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sparkcontext.html

Apparently if I use sc.close() to close the original SparkContext before executing JavaStreaming Context it works perfectly, no errors or issues.

you can use a singleton object ContextManager which would handle which context to provide.
public class ContextManager {
private static JavaSparkContext context;
private static String currentType;
private ContextManager() {}
public static JavaSparkContext getContext(String type) {
if(type == currentType && context != null) {
return context;
}
else if (type == "streaming"){
.. clean up the current context ..
.. initialize the context to streaming context ..
currentType = type;
}
else {
..clean up the current context..
... initialize the context to normal context ..
currentType = type;
}
return context;
}
}
There are some issues like in projects where you switch context quite rapidly the overhead would be quite large.

You can access the SparkContext from your JavaStreamingSparkContext, and use that reference when creating additional contexts.
SparkConf sparkConfig = new SparkConf().setAppName("foo");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConfig, Duration.seconds(30));
SqlContext sqlContext = new SqlContext(jssc.sparkContext());

Job failed Exception hadoop

I am using multi text output formate to create multiple files of a single file i.e each line on new file.
This is my code:
public class MOFExample extends Configured implements Tool {
private static double count = 0;
static class KeyBasedMultipleTextOutputFormat extends
MultipleTextOutputFormat<Text, Text> {
#Override
protected String generateFileNameForKeyValue(Text key, Text value,
String name) {
return count++ + "_";// + name;
}
}
/**
* The main job driver.
*/
public int run(final String[] args) throws Exception {
Path csvInputs = new Path(args[0]);
Path outputDir = new Path(args[1]);
JobConf jobConf = new JobConf(super.getConf());
jobConf.setJarByClass(MOFExample.class);
jobConf.setMapperClass(IdentityMapper.class);
jobConf.setInputFormat(KeyValueTextInputFormat.class);
jobConf.setOutputFormat(KeyBasedMultipleTextOutputFormat.class);
jobConf.setOutputValueClass(Text.class);
jobConf.setOutputKeyClass(Text.class);
FileInputFormat.setInputPaths(jobConf, csvInputs);
FileOutputFormat.setOutputPath(jobConf, outputDir);
//jobConf.setNumMapTasks(4);
jobConf.setNumReduceTasks(4);
return JobClient.runJob(jobConf).isSuccessful() ? 0 : 1;
}
public static void main(final String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new MOFExample(), args);
System.exit(res);
}
}
This code runs fine on small text file but when the number of lines of input file are greater than 1900 which is yet not a large file it throws an exception:
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at MOFExample.run(MOFExample.java:57)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at MOFExample.main(MOFExample.java:61)
I also tried this tutorial but this one returns empty output directory without any exception when the input file is large however this one also worked fine with small input file.
Note: I am using Single-Node Cluster

hdfs plain text write to hbase,Output directory not set

In Map I read Hdfs file update to Hbase,
Version：hadoop 2.5.1 hbase 1.0.0
Exception as follows :
Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set.
maybe there is something wrong with
job.setOutputFormatClass(TableOutputFormat.class);
this line prompt：
The method setOutputFormatClass(Class<? extends OutputFormat>) in the type Job is not applicable for the arguments (Class<TableOutputFormat>)
codes as follows:
public class HdfsAppend2HbaseUtil extends Configured implements Tool{
public static class HdfsAdd2HbaseMapper extends Mapper<Text, Text, ImmutableBytesWritable, Put>{
public void map(Text ikey, Text ivalue, Context context)
throws IOException, InterruptedException {
String oldIdList = HBaseHelper.getValueByKey(ikey.toString());
StringBuffer sb = new StringBuffer(oldIdList);
String newIdList = ivalue.toString();
sb.append("\t" + newIdList);
Put p = new Put(ikey.toString().getBytes());
p.addColumn("idFam".getBytes(), "idsList".getBytes(), sb.toString().getBytes());
context.write(new ImmutableBytesWritable(), p);
}
}
public int run(String[] paths) throws Exception {
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "master,salve1");
conf.set("hbase.zookeeper.property.clientPort", "2181");
Job job = Job.getInstance(conf,"AppendToHbase");
job.setJarByClass(cn.edu.hadoop.util.HdfsAppend2HbaseUtil.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setMapperClass(HdfsAdd2HbaseMapper.class);
job.setNumReduceTasks(0);
job.setOutputFormatClass(TableOutputFormat.class);
job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, "reachableTable");
FileInputFormat.setInputPaths(job, new Path(paths[0]));
job.setOutputKeyClass(ImmutableBytesWritable.class);
job.setOutputValueClass(Put.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
System.out.println("Append Start: ");
long time1 = System.currentTimeMillis();
long time2;
String[] pathsStr = {Const.TwoDegreeReachableOutputPathDetail};
int exitCode = ToolRunner.run(new HdfsAppend2HbaseUtil(), pathsStr);
time2 = System.currentTimeMillis();
System.out.println("Append Cost " + "\t" + (time2-time1)/1000 +" s");
System.exit(exitCode);
}
}

You didn't mention the output directory where it is to write the output like you gave for input path.
Mention it like this.
FileOutputFormat.setOutputPath(job, new Path(<output path>));

At last , I know why,just as I supposed there is something wrong with:
job.setOutputFormatClass(TableOutputFormat.class);
this line prompt：
The method setOutputFormatClass(Class<? extends OutputFormat>) in the type Job is not applicable for the arguments (Class<TableOutputFormat>)
In fact here we need import
org.apache.hadoop.hbase.mapreduce.TableOutputFormat
not to import
org.apache.hadoop.hbase.mapred.TableOutputFormat
the former extends from org.apache.hadoop.mapred.FileOutputFormat
see:
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapred/TableOutputFormat.html
and the later extends from
org.apache.hadoop.mapreduce.OutputFormat
see:
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html
At last Thank U all very much!!!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Hadoop 2.4.0 + HCatalog + Mapreduce - java

Related

How to solve org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text

ClassCastException:java.lang.Exception: java.lang.ClassCastException in mapred

Spark: Two SparkContexts in a single Application Best Practice

Job failed Exception hadoop

hdfs plain text write to hbase,Output directory not set

Categories

Resources