Spark streaming output not saved to HDFS file - java

I am trying to save the Spark streaming output to a file on HDFS. Right now, it is not saving any file.
Here is my code :
StreamingExamples.setStreamingLogLevels();
SparkConf sparkConf = new SparkConf().setAppName("MyTestCOunt");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new Duration(1000));
JavaReceiverInputDStream<String> lines = ssc.socketTextStream(args[0], Integer.parseInt(args[1]), StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
#Override
public Iterable<String> call(String x) {
return Lists.newArrayList(SPACE.split(x));
}
});
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(
new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
}).reduceByKey(new Function2<Integer, Integer, Integer>() {
#Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
wordCounts.print();
wordCounts.saveAsHadoopFiles("hdfs://mynamenode:8020/user/spark/mystream/","abc");
ssc.start();
ssc.awaitTermination();
wordCounts.print() works, but not wordCounts.saveAsHadoopFiles, any ideas why ?
I am running below commands :
1) nc -lk 9999
2) ./bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999
Thanks in advance..!!!

I fixed the same problem by specifying master as local[x] x > 1. If you run master as local, Spark could not assign slot to execute task.
Like
SparkConf conf = new SparkConf().setAppName("conveyor").setMaster("local[4]");

Try:
wordCounts.dstream().saveAsTextFiles("hdfs://mynamenode:8020/user/spark/mystream/", "abc");
instead:
wordCounts.saveAsHadoopFiles("hdfs://mynamenode:8020/user/spark/mystream/","abc");

JavaDStream<String> lines;
Initialize lines with our data.
`
lines.foreachRDD(new VoidFunction<JavaRDD<String>>() {
public void call(JavaRDD<String > rdd) throws Exception {
Date today = new Date();
String date = (new SimpleDateFormat("dd-MM-yyyy").format(today));
rdd.saveAsTextFile(OUTPUT_LOCATION+"/"+date+"/");
}});
`

I fixed this by changing the Sandbox / Server timezone to my local timezone, as my Twitter account has GMT and my Sandbox has UTC. I have used the following commands to change my Sandbox timezone:
ntpdate pool.ntp.org
chkconfig ntpd on
ntpdate pool.ntp.org
/etc/init.d/ntpd start
date
I haven't restarted my Hadoop services after the timezone change.

Related

Use mapPartitionsWithIndex for DStream - Spark Streaming

I want to do something very simple: to check what is the content of each partition in the first RDD of my DStream.
This is what I'm doing now:
SparkConf sparkConfiguration= new SparkConf().setAppName("DataAnalysis").setMaster("local[*]");
JavaStreamingContext sparkStrContext=new JavaStreamingContext(sparkConfiguration, Durations.seconds(1));
JavaReceiverInputDStream<String> receiveParkingData=sparkStrContext.socketTextStream("localhost",5554);
Time time=new Time(1000);
JavaRDD<String>dataRDD= receiveParkingData.compute(time);
//I get an error in this RDD
JavaRDD<String>indexDataRDD=dataRDD.mapPartitionsWithIndex(new Function2<Integer, Iterator<String>, Iterator<String>>() {
#Override
public Iterator<String> call(Integer integer, Iterator<String> stringIterator) throws Exception {
return null;
}
});
indexDataRDD.collect();
So I want to print the content of each partition and its ID. However, on the indexDataRDD I get this message in my IntelliJ IDE: mapPartitionsWithIndex (Function2<Integer, Iterator<String>, Iterator<String>>, boolean) in AbstractJavaRDDLike cannot be applied to (Function2<Integer, Iterator<String>, Iterator<String>>)
Can someone help me with this issue? Is there another, easier way to get the content in each partition? I really want to know the specific content of each partition.
Thank you so much.
Here is sample program for mapPartitionsWithIndex for your reference.
public class SparkDemo {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("SparkDemo").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
List<String> data = Arrays.asList("one","two","three","four","five");
JavaRDD<String> javaRDD = sc.parallelize(data, 2);
JavaRDD<String> mapPartitionsWithIndexRDD = javaRDD
.mapPartitionsWithIndex(new Function2<Integer, Iterator<String>, Iterator<String>>() {
#Override
public Iterator<String> call(Integer index, Iterator<String> iterator) throws Exception {
LinkedList<String> linkedList = new LinkedList<String>();
while (iterator.hasNext()){
linkedList.add(Integer.toString(index) + "-" + iterator.next());
}
return linkedList.iterator();
}
}, false);
System.out.println("mapPartitionsWithIndexRDD " + mapPartitionsWithIndexRDD.collect());
sc.stop();
sc.close();
}
}

Getting Hadoop OutputFormat RunTimeException while running Apache Spark Kafka Stream

I am running a program which uses Apache Spark to get get data from Apache Kafka cluster and puts the data in a Hadoop file. My program is below:
public final class SparkKafkaConsumer {
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("JavaKafkaWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
Map<String, Integer> topicMap = new HashMap<String, Integer>();
String[] topics = "Topic1, Topic2, Topic3".split(",");
for (String topic: topics) {
topicMap.put(topic, 3);
}
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc, "kafka.test.com:2181", "NameConsumer", topicMap);
JavaDStream<String> lines = messages.map(new Function<Tuple2<String, String>, String>() {
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String x) {
return Lists.newArrayList(",".split(x));
}
});
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
}).reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
wordCounts.print();
wordCounts.saveAsHadoopFiles("hdfs://localhost:8020/user/spark/stream/", "txt");
jssc.start();
jssc.awaitTermination();
}
}
I am using the this command to submit the application: C:\spark-1.6.2-bin-hadoop2.6\bin\spark-submit --packages org.apache.spark:spark-streaming-kafka_2.10:1.6.2 --class "SparkKafkaConsumer" --master local[4] target\simple-project-1.0.jar
I am getting this error: java.lang.RuntimeException: class scala.runtime.Nothing$ not org.apache.hadoop.mapred.OutputFormat at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:2148)
What is causing this error and how do I solve it?
I agree that the error is not really evocative, but it is usually better to specify the format of the data you want to output in any of the saveAsHadoopFile methods to protect yourself from this type of exception.
Here's the prototype of your particular method in the documentation :
saveAsHadoopFiles(java.lang.String prefix, java.lang.String suffix, java.lang.Class<?> keyClass, java.lang.Class<?> valueClass, java.lang.Class<F> outputFormatClass)
In your example, that would correspond to :
wordCounts.saveAsHadoopFiles("hdfs://localhost:8020/user/spark/stream/", "txt", Text.class, IntWritable.class, TextOutputFormat.class)
Based on the format of your wordCounts PairDStream, I chose Text as the key is of type String, and IntWritable as the value associated to the key is of type Integer.
Use TextOutputFormat if you just want basic plain text files, but you can look into the subclasses of FileOutputFormat for more output options.
As this was also asked, the Text class comes from the org.apache.hadoop.io package and the TextOutputFormat comes from the org.apache.hadoop.mapred package.
Just for completeness (#Jonathan gave the right answer )
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.TextOutputFormat;
...
wordCounts.saveAsHadoopFiles("hdfs://localhost:8020/user/spark/stream/", "txt", Text.class, IntWritable.class, TextOutputFormat.class)

Getting error when invoking elasticSearch from spark

I have a use case, where I need to read messages from kafka and for each message, extract data and invoke elasticsearch Index. The response will be further used to do further processing.
I am getting below error when invoking JavaEsSpark.esJsonRDD
java.lang.ClassCastException: org.elasticsearch.spark.rdd.EsPartition incompatible with org.apache.spark.rdd.ParallelCollectionPartition
at org.apache.spark.rdd.ParallelCollectionRDD.compute(ParallelCollectionRDD.scala:102)
My code snippet is below
public static void main(String[] args) {
if (args.length < 4) {
System.err.println("Usage: JavaKafkaIntegration <zkQuorum> <group> <topics> <numThreads>");
System.exit(1);
}
SparkConf sparkConf = new SparkConf().setAppName("JavaKafkaIntegration").setMaster("local[2]").set("spark.driver.allowMultipleContexts", "true");
//Setting when using JavaEsSpark.esJsonRDD
sparkConf.set("es.nodes",<NODE URL>);
sparkConf.set("es.nodes.wan.only","true");
context = new JavaSparkContext(sparkConf);
// Create the context with 2 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
int numThreads = Integer.parseInt(args[3]);
Map<String, Integer> topicMap = new HashMap<>();
String[] topics = args[2].split(",");
for (String topic: topics) {
topicMap.put(topic, numThreads);
}
//Receive Message From kafka
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc,args[0], args[1], topicMap);
JavaDStream<String> jsons = messages
.map(new Function<Tuple2<String, String>, String>() {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public String call(Tuple2<String, String> tuple2){
JavaRDD<String> esRDD = JavaEsSpark.esJsonRDD(context, <index>,<search string> ).values() ;
return null;
}
});
jsons.print();
jssc.start();
jssc.awaitTermination();
}
I am getting error when invoking JavaEsSpark.esJsonRDD. Is it correct way to do it? How do I successfully invoke ES from spark?
I am running kafka and spark on windows and invoking external elastic search index.

Processing several files in SPARK separately

I need help with implementation one workflow with Apache Spark. My task is in next:
I have several CSV files as source data. Note: these files could has different layout
I have metadata with info how I need parse each file (this is not problem)
Main goal: result is source file with several additional columns. I have to update each source file without joining to one output range. For example: source 10 files -> 10 result files and each result file have data only from corresponding source file.
As I know Spark can open many files by mask:
var source = sc.textFile("/source/data*.gz");
But in this case I can't recognize which line of a file. If I get list of source files and try to process by following scenario:
JavaSparkContext sc = new JavaSparkContext(...);
List<String> files = new ArrayList() //list of source files full name's
for(String f : files)
{
JavaRDD<String> data = sc.textFile(f);
//process this file with Spark
outRdd.coalesce(1, true).saveAsTextFile(f + "_out");
}
But in this case I will processed all files in sequential mode.
My question is next: how I can processed many files in parallel mode?. For example: one file - one executor?
I tried to implement this by simple code with source data:
//JSON file with paths to 4 source files, saved in inData variable
{
"files": [
{
"name": "/mnt/files/DigilantDaily_1.gz",
"layout": "layout_1"
},
{
"name": "/mnt/files/DigilantDaily_2.gz",
"layout": "layout_2"
},
{
"name": "/mnt/files/DigilantDaily_3.gz",
"layout": "layout_3"
},
{
"name": "/mnt/files/DigilantDaily_4.gz",
"layout": "layout_4"
}
]
}
sourceFiles= new ArrayList<>();
JSONObject jsFiles = (JSONObject) new JSONParser().parse(new FileReader(new File(inData)));
Iterator<JSONObject> iterator = ((JSONArray)jsFiles.get("files")).iterator();
while (iterator.hasNext()){
SourceFile sf = new SourceFile();
JSONObject js = iterator.next();
sf.FilePath = (String) js.get("name");
sf.MetaPath = (String) js.get("layout");
sourceFiles.add(sf);
}
SparkConf sparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("spark-app");
final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
try {
final Validator validator = new Validator();
ExecutorService pool = Executors.newFixedThreadPool(4);
for(final SourceFile f : sourceFiles)
{
pool.execute(new Runnable() {
#Override
public void run() {
final Path inFile = Paths.get(f.FilePath);
JavaRDD<String> d1 = sparkContext
.textFile(f.FilePath)
.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String s) throws Exception {
return validator.parseRow(s);
}
});
JavaPairRDD<String, Integer> d2 = d1.mapToPair(new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String s) throws Exception {
String userAgent = validator.getUserAgent(s);
return new Tuple2<>(DeviceType.deviceType(userAgent), 1);
}
});
JavaPairRDD<String, Integer> d3 = d2.reduceByKey(new Function2<Integer, Integer, Integer>() {
#Override
public Integer call(Integer val1, Integer val2) throws Exception {
return val1 + val2;
}
});
d3.coalesce(1, true)
.saveAsTextFile(outFolder + "/" + inFile.getFileName().toString());//, org.apache.hadoop.io.compress.GzipCodec.class);
}
});
}
pool.shutdown();
pool.awaitTermination(60, TimeUnit.MINUTES);
} catch (Exception e) {
throw e;
} finally {
if (sparkContext != null) {
sparkContext.stop();
}
}
But this code failed with exception:
Exception in thread "pool-13-thread-2" Exception in thread "pool-13-thread-3" Exception in thread "pool-13-thread-1" Exception in thread "pool-13-thread-4" java.lang.Error: org.apache.spark.SparkException: Task not serializable
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:335)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:334)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.RDD.filter(RDD.scala:334)
at org.apache.spark.api.java.JavaRDD.filter(JavaRDD.scala:78)
at append.dev.App$1.run(App.java:87)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
... 2 more
I would like to know where I have a mistake?
Thanks for help me!
I have used a similar multithreaded approach with good results. I beleive the problem is located in the inner class you define.
Create your runnable/callable on a separate class and make sure it gets across to Spark with you submitted jars. Also, implement serializable as you are implicitly passing state to your function (f.FilePath).
You could use sc.wholeTextFiles(dirname) to get an RDD of (filename, content) pairs and map over that.

How can I make Spark Streaming count the words in a file in a unit test?

I've successfully built a very simple Spark Streaming application in Java that is based on the HdfsCount example in Scala.
When I submit this application to my local Spark, it waits for a file to be written to a given directory, and when I create that file it successfully prints the number of words. I terminate the application by pressing Ctrl+C.
Now I've tried to create a very basic unit test for this functionality, but in the test I was not able to print the same information, that is the number of words.
What am I missing?
Below is the unit test file, and after that I've also included the code snippet that shows the countWords method:
StarterAppTest.java
import com.google.common.io.Files;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.junit.*;
import java.io.*;
public class StarterAppTest {
JavaStreamingContext ssc;
File tempDir;
#Before
public void setUp() {
ssc = new JavaStreamingContext("local", "test", new Duration(3000));
tempDir = Files.createTempDir();
tempDir.deleteOnExit();
}
#After
public void tearDown() {
ssc.stop();
ssc = null;
}
#Test
public void testInitialization() {
Assert.assertNotNull(ssc.sc());
}
#Test
public void testCountWords() {
StarterApp starterApp = new StarterApp();
try {
JavaDStream<String> lines = ssc.textFileStream(tempDir.getAbsolutePath());
JavaPairDStream<String, Integer> wordCounts = starterApp.countWords(lines);
ssc.start();
File tmpFile = new File(tempDir.getAbsolutePath(), "tmp.txt");
PrintWriter writer = new PrintWriter(tmpFile, "UTF-8");
writer.println("8-Dec-2014: Emre Emre Emre Ergin Ergin Ergin");
writer.close();
System.err.println("===== Word Counts =======");
wordCounts.print();
System.err.println("===== Word Counts =======");
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
Assert.assertTrue(true);
}
}
This test compiles and starts to run, Spark Streaming prints a lot of diagnostic messages on the console but the call to wordCounts.print() does not print anything, whereas in StarterApp.java itself, they do.
I've also tried adding ssc.awaitTermination(); after ssc.start() but nothing changed in that respect. After that I've also tried to create a new file manually in the directory that this Spark Streaming application was checking but this time it gave an error.
For completeness, below is the wordCounts method:
public JavaPairDStream<String, Integer> countWords(JavaDStream<String> lines) {
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
#Override
public Iterable<String> call(String x) { return Lists.newArrayList(SPACE.split(x)); }
});
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(
new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String s) { return new Tuple2<>(s, 1); }
}).reduceByKey((i1, i2) -> i1 + i2);
return wordCounts;
}
Few pointers:
Give at least 2 cores to SparkStreaming context. 1 for the Streaming and 1 for the Spark processing. "local" -> "local[2]"
Your streaming interval is of 3000ms, so somewhere in your program you need to wait -at least- that time to expect an output.
Spark Streaming needs some time for the setup of listeners. The file is being created immediately after ssc.start is issued. There's no warranty that the filesystem listener is already in place. I'd do some sleep(xx) after ssc.start
In Streaming, it's all about the right timing.

Categories