I am running a program which uses Apache Spark to get get data from Apache Kafka cluster and puts the data in a Hadoop file. My program is below:
public final class SparkKafkaConsumer {
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("JavaKafkaWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
Map<String, Integer> topicMap = new HashMap<String, Integer>();
String[] topics = "Topic1, Topic2, Topic3".split(",");
for (String topic: topics) {
topicMap.put(topic, 3);
}
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc, "kafka.test.com:2181", "NameConsumer", topicMap);
JavaDStream<String> lines = messages.map(new Function<Tuple2<String, String>, String>() {
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String x) {
return Lists.newArrayList(",".split(x));
}
});
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
}).reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
wordCounts.print();
wordCounts.saveAsHadoopFiles("hdfs://localhost:8020/user/spark/stream/", "txt");
jssc.start();
jssc.awaitTermination();
}
}
I am using the this command to submit the application: C:\spark-1.6.2-bin-hadoop2.6\bin\spark-submit --packages org.apache.spark:spark-streaming-kafka_2.10:1.6.2 --class "SparkKafkaConsumer" --master local[4] target\simple-project-1.0.jar
I am getting this error: java.lang.RuntimeException: class scala.runtime.Nothing$ not org.apache.hadoop.mapred.OutputFormat at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:2148)
What is causing this error and how do I solve it?
I agree that the error is not really evocative, but it is usually better to specify the format of the data you want to output in any of the saveAsHadoopFile methods to protect yourself from this type of exception.
Here's the prototype of your particular method in the documentation :
saveAsHadoopFiles(java.lang.String prefix, java.lang.String suffix, java.lang.Class<?> keyClass, java.lang.Class<?> valueClass, java.lang.Class<F> outputFormatClass)
In your example, that would correspond to :
wordCounts.saveAsHadoopFiles("hdfs://localhost:8020/user/spark/stream/", "txt", Text.class, IntWritable.class, TextOutputFormat.class)
Based on the format of your wordCounts PairDStream, I chose Text as the key is of type String, and IntWritable as the value associated to the key is of type Integer.
Use TextOutputFormat if you just want basic plain text files, but you can look into the subclasses of FileOutputFormat for more output options.
As this was also asked, the Text class comes from the org.apache.hadoop.io package and the TextOutputFormat comes from the org.apache.hadoop.mapred package.
Just for completeness (#Jonathan gave the right answer )
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.TextOutputFormat;
...
wordCounts.saveAsHadoopFiles("hdfs://localhost:8020/user/spark/stream/", "txt", Text.class, IntWritable.class, TextOutputFormat.class)
Related
I am new in spark streaming programming please someone explain for me what is the problem
I thing that that i iterate a null structure but i have a producer class which works normally
my source code :
public class Main3 implements java.io.Serializable {
public static JavaDStream<Double> pr;
public void consumer() throws Exception{
// Configure Spark to connect to Kafka running on local machine
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"localhost:9092");
kafkaParams.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put(ConsumerConfig.GROUP_ID_CONFIG,"group1");
kafkaParams.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,"latest");
kafkaParams.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG,true);
Collection<String> topics = Arrays.asList("testing");
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("SparkKafka10WordCount");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(30));
final JavaInputDStream<ConsumerRecord<String, String>> receiver=
KafkaUtils.createDirectStream(jssc, LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String,String>Subscribe(topics,kafkaParams));
JavaDStream<String> stream = receiver.map(new Function<ConsumerRecord<String,String>, String>() {
#Override
public String call(ConsumerRecord<String, String> kafkaRecord) throws Exception {
return kafkaRecord.value();
}
});
stream.foreachRDD( x->x.saveAsTextFile("/home/khouloud/Desktop/exemple/b")); //that does no do any thing
stream.foreachRDD( x-> {
x.collect().stream().forEach(n-> System.out.println("item of list: "+n));
}); // also this i see any thing in the console
stream.foreachRDD( rdd -> {
if (rdd.isEmpty()) System.out.println("its empty"); }); //nothing`
JavaPairDStream<Integer, List<Double>> points= stream.mapToPair(new PairFunction<String, Integer, List<Double>>(){
#Override
public Tuple2<Integer, List<Double>> call(String x) throws Exception {
String[] item = x.split(" ");
List<Double> l = new ArrayList<Double>();
for (int i= 1 ; i < item.length ; i++)
{
l.add(new Double(item[i]));
}
return new Tuple2<>(new Integer(item[0]), l);
}}
);`
Error -
`org.apache.spark.SparkException: Task not serializable at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)
at
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2294) at
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:547)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:547)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at
org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:265)
at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:546)
at
org.apache.spark.streaming.api.java.JavaDStreamLike$class.mapToPair(JavaDStreamLike.scala:163)
at
org.apache.spark.streaming.api.java.AbstractJavaDStreamLike.mapToPair(JavaDStreamLike.scala:42)
at Min.calculDegSim(Min.java:43) at SkyRule.execute(SkyRule.java:34)
at Main3.consumer(Main3.java:159) at
Executer$2.run(Executer.java:27) at
java.lang.Thread.run(Thread.java:748) Caused by:
java.io.NotSerializableException: Graph is unexpectedly null when
DStream is being serialized. Serialization stack:
at
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:337)
I want to do something very simple: to check what is the content of each partition in the first RDD of my DStream.
This is what I'm doing now:
SparkConf sparkConfiguration= new SparkConf().setAppName("DataAnalysis").setMaster("local[*]");
JavaStreamingContext sparkStrContext=new JavaStreamingContext(sparkConfiguration, Durations.seconds(1));
JavaReceiverInputDStream<String> receiveParkingData=sparkStrContext.socketTextStream("localhost",5554);
Time time=new Time(1000);
JavaRDD<String>dataRDD= receiveParkingData.compute(time);
//I get an error in this RDD
JavaRDD<String>indexDataRDD=dataRDD.mapPartitionsWithIndex(new Function2<Integer, Iterator<String>, Iterator<String>>() {
#Override
public Iterator<String> call(Integer integer, Iterator<String> stringIterator) throws Exception {
return null;
}
});
indexDataRDD.collect();
So I want to print the content of each partition and its ID. However, on the indexDataRDD I get this message in my IntelliJ IDE: mapPartitionsWithIndex (Function2<Integer, Iterator<String>, Iterator<String>>, boolean) in AbstractJavaRDDLike cannot be applied to (Function2<Integer, Iterator<String>, Iterator<String>>)
Can someone help me with this issue? Is there another, easier way to get the content in each partition? I really want to know the specific content of each partition.
Thank you so much.
Here is sample program for mapPartitionsWithIndex for your reference.
public class SparkDemo {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("SparkDemo").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
List<String> data = Arrays.asList("one","two","three","four","five");
JavaRDD<String> javaRDD = sc.parallelize(data, 2);
JavaRDD<String> mapPartitionsWithIndexRDD = javaRDD
.mapPartitionsWithIndex(new Function2<Integer, Iterator<String>, Iterator<String>>() {
#Override
public Iterator<String> call(Integer index, Iterator<String> iterator) throws Exception {
LinkedList<String> linkedList = new LinkedList<String>();
while (iterator.hasNext()){
linkedList.add(Integer.toString(index) + "-" + iterator.next());
}
return linkedList.iterator();
}
}, false);
System.out.println("mapPartitionsWithIndexRDD " + mapPartitionsWithIndexRDD.collect());
sc.stop();
sc.close();
}
}
I have a use case, where I need to read messages from kafka and for each message, extract data and invoke elasticsearch Index. The response will be further used to do further processing.
I am getting below error when invoking JavaEsSpark.esJsonRDD
java.lang.ClassCastException: org.elasticsearch.spark.rdd.EsPartition incompatible with org.apache.spark.rdd.ParallelCollectionPartition
at org.apache.spark.rdd.ParallelCollectionRDD.compute(ParallelCollectionRDD.scala:102)
My code snippet is below
public static void main(String[] args) {
if (args.length < 4) {
System.err.println("Usage: JavaKafkaIntegration <zkQuorum> <group> <topics> <numThreads>");
System.exit(1);
}
SparkConf sparkConf = new SparkConf().setAppName("JavaKafkaIntegration").setMaster("local[2]").set("spark.driver.allowMultipleContexts", "true");
//Setting when using JavaEsSpark.esJsonRDD
sparkConf.set("es.nodes",<NODE URL>);
sparkConf.set("es.nodes.wan.only","true");
context = new JavaSparkContext(sparkConf);
// Create the context with 2 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
int numThreads = Integer.parseInt(args[3]);
Map<String, Integer> topicMap = new HashMap<>();
String[] topics = args[2].split(",");
for (String topic: topics) {
topicMap.put(topic, numThreads);
}
//Receive Message From kafka
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc,args[0], args[1], topicMap);
JavaDStream<String> jsons = messages
.map(new Function<Tuple2<String, String>, String>() {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public String call(Tuple2<String, String> tuple2){
JavaRDD<String> esRDD = JavaEsSpark.esJsonRDD(context, <index>,<search string> ).values() ;
return null;
}
});
jsons.print();
jssc.start();
jssc.awaitTermination();
}
I am getting error when invoking JavaEsSpark.esJsonRDD. Is it correct way to do it? How do I successfully invoke ES from spark?
I am running kafka and spark on windows and invoking external elastic search index.
I've successfully built a very simple Spark Streaming application in Java that is based on the HdfsCount example in Scala.
When I submit this application to my local Spark, it waits for a file to be written to a given directory, and when I create that file it successfully prints the number of words. I terminate the application by pressing Ctrl+C.
Now I've tried to create a very basic unit test for this functionality, but in the test I was not able to print the same information, that is the number of words.
What am I missing?
Below is the unit test file, and after that I've also included the code snippet that shows the countWords method:
StarterAppTest.java
import com.google.common.io.Files;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.junit.*;
import java.io.*;
public class StarterAppTest {
JavaStreamingContext ssc;
File tempDir;
#Before
public void setUp() {
ssc = new JavaStreamingContext("local", "test", new Duration(3000));
tempDir = Files.createTempDir();
tempDir.deleteOnExit();
}
#After
public void tearDown() {
ssc.stop();
ssc = null;
}
#Test
public void testInitialization() {
Assert.assertNotNull(ssc.sc());
}
#Test
public void testCountWords() {
StarterApp starterApp = new StarterApp();
try {
JavaDStream<String> lines = ssc.textFileStream(tempDir.getAbsolutePath());
JavaPairDStream<String, Integer> wordCounts = starterApp.countWords(lines);
ssc.start();
File tmpFile = new File(tempDir.getAbsolutePath(), "tmp.txt");
PrintWriter writer = new PrintWriter(tmpFile, "UTF-8");
writer.println("8-Dec-2014: Emre Emre Emre Ergin Ergin Ergin");
writer.close();
System.err.println("===== Word Counts =======");
wordCounts.print();
System.err.println("===== Word Counts =======");
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
Assert.assertTrue(true);
}
}
This test compiles and starts to run, Spark Streaming prints a lot of diagnostic messages on the console but the call to wordCounts.print() does not print anything, whereas in StarterApp.java itself, they do.
I've also tried adding ssc.awaitTermination(); after ssc.start() but nothing changed in that respect. After that I've also tried to create a new file manually in the directory that this Spark Streaming application was checking but this time it gave an error.
For completeness, below is the wordCounts method:
public JavaPairDStream<String, Integer> countWords(JavaDStream<String> lines) {
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
#Override
public Iterable<String> call(String x) { return Lists.newArrayList(SPACE.split(x)); }
});
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(
new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String s) { return new Tuple2<>(s, 1); }
}).reduceByKey((i1, i2) -> i1 + i2);
return wordCounts;
}
Few pointers:
Give at least 2 cores to SparkStreaming context. 1 for the Streaming and 1 for the Spark processing. "local" -> "local[2]"
Your streaming interval is of 3000ms, so somewhere in your program you need to wait -at least- that time to expect an output.
Spark Streaming needs some time for the setup of listeners. The file is being created immediately after ssc.start is issued. There's no warranty that the filesystem listener is already in place. I'd do some sleep(xx) after ssc.start
In Streaming, it's all about the right timing.
I am trying to save the Spark streaming output to a file on HDFS. Right now, it is not saving any file.
Here is my code :
StreamingExamples.setStreamingLogLevels();
SparkConf sparkConf = new SparkConf().setAppName("MyTestCOunt");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new Duration(1000));
JavaReceiverInputDStream<String> lines = ssc.socketTextStream(args[0], Integer.parseInt(args[1]), StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
#Override
public Iterable<String> call(String x) {
return Lists.newArrayList(SPACE.split(x));
}
});
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(
new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
}).reduceByKey(new Function2<Integer, Integer, Integer>() {
#Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
wordCounts.print();
wordCounts.saveAsHadoopFiles("hdfs://mynamenode:8020/user/spark/mystream/","abc");
ssc.start();
ssc.awaitTermination();
wordCounts.print() works, but not wordCounts.saveAsHadoopFiles, any ideas why ?
I am running below commands :
1) nc -lk 9999
2) ./bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999
Thanks in advance..!!!
I fixed the same problem by specifying master as local[x] x > 1. If you run master as local, Spark could not assign slot to execute task.
Like
SparkConf conf = new SparkConf().setAppName("conveyor").setMaster("local[4]");
Try:
wordCounts.dstream().saveAsTextFiles("hdfs://mynamenode:8020/user/spark/mystream/", "abc");
instead:
wordCounts.saveAsHadoopFiles("hdfs://mynamenode:8020/user/spark/mystream/","abc");
JavaDStream<String> lines;
Initialize lines with our data.
`
lines.foreachRDD(new VoidFunction<JavaRDD<String>>() {
public void call(JavaRDD<String > rdd) throws Exception {
Date today = new Date();
String date = (new SimpleDateFormat("dd-MM-yyyy").format(today));
rdd.saveAsTextFile(OUTPUT_LOCATION+"/"+date+"/");
}});
`
I fixed this by changing the Sandbox / Server timezone to my local timezone, as my Twitter account has GMT and my Sandbox has UTC. I have used the following commands to change my Sandbox timezone:
ntpdate pool.ntp.org
chkconfig ntpd on
ntpdate pool.ntp.org
/etc/init.d/ntpd start
date
I haven't restarted my Hadoop services after the timezone change.