Processing several files in SPARK separately - java

I need help with implementation one workflow with Apache Spark. My task is in next:
I have several CSV files as source data. Note: these files could has different layout
I have metadata with info how I need parse each file (this is not problem)
Main goal: result is source file with several additional columns. I have to update each source file without joining to one output range. For example: source 10 files -> 10 result files and each result file have data only from corresponding source file.
As I know Spark can open many files by mask:
var source = sc.textFile("/source/data*.gz");
But in this case I can't recognize which line of a file. If I get list of source files and try to process by following scenario:
JavaSparkContext sc = new JavaSparkContext(...);
List<String> files = new ArrayList() //list of source files full name's
for(String f : files)
{
JavaRDD<String> data = sc.textFile(f);
//process this file with Spark
outRdd.coalesce(1, true).saveAsTextFile(f + "_out");
}
But in this case I will processed all files in sequential mode.
My question is next: how I can processed many files in parallel mode?. For example: one file - one executor?
I tried to implement this by simple code with source data:
//JSON file with paths to 4 source files, saved in inData variable
{
"files": [
{
"name": "/mnt/files/DigilantDaily_1.gz",
"layout": "layout_1"
},
{
"name": "/mnt/files/DigilantDaily_2.gz",
"layout": "layout_2"
},
{
"name": "/mnt/files/DigilantDaily_3.gz",
"layout": "layout_3"
},
{
"name": "/mnt/files/DigilantDaily_4.gz",
"layout": "layout_4"
}
]
}
sourceFiles= new ArrayList<>();
JSONObject jsFiles = (JSONObject) new JSONParser().parse(new FileReader(new File(inData)));
Iterator<JSONObject> iterator = ((JSONArray)jsFiles.get("files")).iterator();
while (iterator.hasNext()){
SourceFile sf = new SourceFile();
JSONObject js = iterator.next();
sf.FilePath = (String) js.get("name");
sf.MetaPath = (String) js.get("layout");
sourceFiles.add(sf);
}
SparkConf sparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("spark-app");
final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
try {
final Validator validator = new Validator();
ExecutorService pool = Executors.newFixedThreadPool(4);
for(final SourceFile f : sourceFiles)
{
pool.execute(new Runnable() {
#Override
public void run() {
final Path inFile = Paths.get(f.FilePath);
JavaRDD<String> d1 = sparkContext
.textFile(f.FilePath)
.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String s) throws Exception {
return validator.parseRow(s);
}
});
JavaPairRDD<String, Integer> d2 = d1.mapToPair(new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String s) throws Exception {
String userAgent = validator.getUserAgent(s);
return new Tuple2<>(DeviceType.deviceType(userAgent), 1);
}
});
JavaPairRDD<String, Integer> d3 = d2.reduceByKey(new Function2<Integer, Integer, Integer>() {
#Override
public Integer call(Integer val1, Integer val2) throws Exception {
return val1 + val2;
}
});
d3.coalesce(1, true)
.saveAsTextFile(outFolder + "/" + inFile.getFileName().toString());//, org.apache.hadoop.io.compress.GzipCodec.class);
}
});
}
pool.shutdown();
pool.awaitTermination(60, TimeUnit.MINUTES);
} catch (Exception e) {
throw e;
} finally {
if (sparkContext != null) {
sparkContext.stop();
}
}
But this code failed with exception:
Exception in thread "pool-13-thread-2" Exception in thread "pool-13-thread-3" Exception in thread "pool-13-thread-1" Exception in thread "pool-13-thread-4" java.lang.Error: org.apache.spark.SparkException: Task not serializable
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:335)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:334)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.RDD.filter(RDD.scala:334)
at org.apache.spark.api.java.JavaRDD.filter(JavaRDD.scala:78)
at append.dev.App$1.run(App.java:87)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
... 2 more
I would like to know where I have a mistake?
Thanks for help me!

I have used a similar multithreaded approach with good results. I beleive the problem is located in the inner class you define.
Create your runnable/callable on a separate class and make sure it gets across to Spark with you submitted jars. Also, implement serializable as you are implicitly passing state to your function (f.FilePath).

You could use sc.wholeTextFiles(dirname) to get an RDD of (filename, content) pairs and map over that.

Related

Can't seem to transform KStream<A,B> to KTable<X,Y>

This is my first attempt at trying to use a KTable. I have a Kafka Stream that contains Avro serialized objects of type A,B. And this works fine. I can write a Consumer that consumes just fine or a simple KStream that simply counts records.
The B object has a field containing a country code. I'd like to supply that code to a KTable so it can count the number of records that contain a particular country code. To do so I'm trying to convert the stream into a stream of X,Y (or really: country-code, count). Eventually I look at the contents of the table and extract an array of KV pairs.
The code I have (included) always errors out with the following (see the line with 'Caused by'):
2018-07-26 13:42:48.688 [com.findology.tools.controller.TestEventGeneratorController-16d7cd06-4742-402e-a679-898b9ef78c41-StreamThread-1; AssignedStreamsTasks] ERROR -- stream-thread [com.findology.tools.controller.TestEventGeneratorController-16d7c\
d06-4742-402e-a679-898b9ef78c41-StreamThread-1] Failed to process stream task 0_0 due to the following error:
org.apache.kafka.streams.errors.StreamsException: Exception caught in process. taskId=0_0, processor=KSTREAM-SOURCE-0000000000, topic=com.findology.model.traffic.CpaTrackingCallback, partition=0, offset=962649
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:240)
at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:94)
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:411)
at org.apache.kafka.streams.processor.internals.StreamThread.processAndMaybeCommit(StreamThread.java:922)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:802)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:749)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:719)
Caused by: org.apache.kafka.streams.errors.StreamsException: A serializer (key: org.apache.kafka.common.serialization.ByteArraySerializer / value: org.apache.kafka.common.serialization.ByteArraySerializer) is not compatible to the actual key or value type (key type: java.lang.Integer / value type: java.lang.Integer). Change the default Serdes in StreamConfig or provide correct Serdes via method parameters.
at org.apache.kafka.streams.processor.internals.SinkNode.process(SinkNode.java:92)
at org.apache.kafka.streams.processor.internals.AbstractProcessorContext.forward(AbstractProcessorContext.java:174)
at org.apache.kafka.streams.kstream.internals.KStreamFilter$KStreamFilterProcessor.process(KStreamFilter.java:43)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:46)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:211)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:124)
at org.apache.kafka.streams.processor.internals.AbstractProcessorContext.forward(AbstractProcessorContext.java:174)
at org.apache.kafka.streams.kstream.internals.KStreamTransform$KStreamTransformProcessor.process(KStreamTransform.java:59)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:46)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:211)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:124)
at org.apache.kafka.streams.processor.internals.AbstractProcessorContext.forward(AbstractProcessorContext.java:174)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:80)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:224)
... 6 more
Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to [B
at org.apache.kafka.common.serialization.ByteArraySerializer.serialize(ByteArraySerializer.java:21)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.send(RecordCollectorImpl.java:146)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.send(RecordCollectorImpl.java:94)
at org.apache.kafka.streams.processor.internals.SinkNode.process(SinkNode.java:87)
... 19 more
And here is the code I'm using. I've omitted certain classes for brevity. Note that I'm not using the Confluent KafkaAvro classes.
private synchronized void createStreamProcessor2() {
if (streams == null) {
try {
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, getClass().getName());
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
StreamsConfig config = new StreamsConfig(props);
StreamsBuilder builder = new StreamsBuilder();
Map<String, Object> serdeProps = new HashMap<>();
serdeProps.put("schema.registry.url", schemaRegistryURL);
AvroSerde<CpaTrackingCallback> cpaTrackingCallbackAvroSerde = new AvroSerde<>(schemaRegistryURL);
cpaTrackingCallbackAvroSerde.configure(serdeProps, false);
// This is the key to telling kafka the specific Serde instance to use
// to deserialize the Avro encoded value
KStream<Long, CpaTrackingCallback> stream = builder.stream(CpaTrackingCallback.class.getName(),
Consumed.with(Serdes.Long(), cpaTrackingCallbackAvroSerde));
// provide a way to convert CpsTrackicking... info into just country codes
// (Long, CpaTrackingCallback) -> (countryCode:Integer, placeHolder:Long)
TransformerSupplier<Long, CpaTrackingCallback, KeyValue<Integer, Long>> transformer = new TransformerSupplier<Long, CpaTrackingCallback, KeyValue<Integer, Long>>() {
#Override
public Transformer<Long, CpaTrackingCallback, KeyValue<Integer, Long>> get() {
return new Transformer<Long, CpaTrackingCallback, KeyValue<Integer, Long>>() {
#Override
public void init(ProcessorContext context) {
// Not doing Punctuate so no need to store context
}
#Override
public KeyValue<Integer, Long> transform(Long key, CpaTrackingCallback value) {
return new KeyValue(value.getCountryCode(), 1);
}
#Override
public KeyValue<Integer, Long> punctuate(long timestamp) {
return null;
}
#Override
public void close() {
}
};
}
};
KTable<Integer, Long> countryCounts = stream.transform(transformer).groupByKey() //
.count(Materialized.as("country-counts"));
streams = new KafkaStreams(builder.build(), config);
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
streams.cleanUp();
streams.start();
try {
countryCountsView = waitUntilStoreIsQueryable("country-counts", QueryableStoreTypes.keyValueStore(),
streams);
}
catch (InterruptedException e) {
log.warn("Interrupted while waiting for query store to become available", e);
}
}
catch (Exception e) {
log.error(e);
}
}
}
The bare groupByKey() method on KStream uses the default serializer/deserializer (which you haven't set). Use the method groupByKey(Serialized<K,V> serialized), as in:
.groupByKey(Serialized.with(Serdes.Integer(), Serdes.Long()))
Also note, what you do in your custom TransformerSupplier, you can do simply with a KStream.map call.

Use mapPartitionsWithIndex for DStream - Spark Streaming

I want to do something very simple: to check what is the content of each partition in the first RDD of my DStream.
This is what I'm doing now:
SparkConf sparkConfiguration= new SparkConf().setAppName("DataAnalysis").setMaster("local[*]");
JavaStreamingContext sparkStrContext=new JavaStreamingContext(sparkConfiguration, Durations.seconds(1));
JavaReceiverInputDStream<String> receiveParkingData=sparkStrContext.socketTextStream("localhost",5554);
Time time=new Time(1000);
JavaRDD<String>dataRDD= receiveParkingData.compute(time);
//I get an error in this RDD
JavaRDD<String>indexDataRDD=dataRDD.mapPartitionsWithIndex(new Function2<Integer, Iterator<String>, Iterator<String>>() {
#Override
public Iterator<String> call(Integer integer, Iterator<String> stringIterator) throws Exception {
return null;
}
});
indexDataRDD.collect();
So I want to print the content of each partition and its ID. However, on the indexDataRDD I get this message in my IntelliJ IDE: mapPartitionsWithIndex (Function2<Integer, Iterator<String>, Iterator<String>>, boolean) in AbstractJavaRDDLike cannot be applied to (Function2<Integer, Iterator<String>, Iterator<String>>)
Can someone help me with this issue? Is there another, easier way to get the content in each partition? I really want to know the specific content of each partition.
Thank you so much.
Here is sample program for mapPartitionsWithIndex for your reference.
public class SparkDemo {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("SparkDemo").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
List<String> data = Arrays.asList("one","two","three","four","five");
JavaRDD<String> javaRDD = sc.parallelize(data, 2);
JavaRDD<String> mapPartitionsWithIndexRDD = javaRDD
.mapPartitionsWithIndex(new Function2<Integer, Iterator<String>, Iterator<String>>() {
#Override
public Iterator<String> call(Integer index, Iterator<String> iterator) throws Exception {
LinkedList<String> linkedList = new LinkedList<String>();
while (iterator.hasNext()){
linkedList.add(Integer.toString(index) + "-" + iterator.next());
}
return linkedList.iterator();
}
}, false);
System.out.println("mapPartitionsWithIndexRDD " + mapPartitionsWithIndexRDD.collect());
sc.stop();
sc.close();
}
}

Create and save Sequence<Text,Byte[]> files in a foreachPartition function in Spark

I am trying to load a group of files, make some checks on them and later saving them in HDFS. I haven't found a good way to create and save these Sequence files, though. Here is my loader main function
SparkConf sparkConf = new SparkConf().setAppName("writingHDFS")
.setMaster("local[2]")
.set("spark.streaming.stopGracefullyOnShutdown", "true");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
//JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(5*1000));
JavaPairRDD<String, PortableDataStream> imageByteRDD = jsc.binaryFiles("file:///home/cloudera/Pictures/cat");
JavaPairRDD<String, String> imageRDD = jsc.wholeTextFiles("file:///home/cloudera/Pictures/");
imageRDD.mapToPair(new PairFunction<Tuple2<String,String>, Text, Text>() {
#Override
public Tuple2<Text, Text> call(Tuple2<String, String> arg0)
throws Exception {
return new Tuple2<Text, Text>(new Text(arg0._1),new Text(arg0._2));
}
}).saveAsNewAPIHadoopFile("hdfs://localhost:8020/user/hdfs/sparkling/try.seq", Text.class, Text.class, SequenceFileOutputFormat.class);
It simply loads some images as text files, puts the name of the file as key of the PairRDD and use the native saveAsNewAPIHadoopFile.
I would like now to save file by file in ardd.foreach or rdd.foreachPartition` but I cannot find a proper method:
This stackoverflow answer creates a Job for the occasion. It seems to work, but it needs the file inputed as a path, while I already have an RDD made of them
A couple of solution I found create a directory for each file (OutputStream out = fs.create(new Path(dst));) which wouldn't be as much of a problem, if it weren't for the fact that I get an exception for Mkdirs didn't work
EDIT: I may have found a way, but I have a Task not serializable exception:
JavaPairRDD imageByteRDD = jsc.binaryFiles("file:///home/cloudera/Pictures/cat");
imageByteRDD.foreach(new VoidFunction<Tuple2<String,PortableDataStream>>() {
#Override
public void call(Tuple2<String, PortableDataStream> fileTuple) throws Exception {
Text key = new Text(fileTuple._1());
BytesWritable value = new BytesWritable( fileTuple._2().toArray());
SequenceFile.Writer writer = SequenceFile.createWriter(serializableConfiguration.getConf(), SequenceFile.Writer.file(new Path("/user/hdfs/sparkling/" + key)),
SequenceFile.Writer.compression(SequenceFile.CompressionType.RECORD, new BZip2Codec()),
SequenceFile.Writer.keyClass(Text.class), SequenceFile.Writer.valueClass(BytesWritable.class));
key = new Text("MiaoMiao!");
writer.append(key, value);
IOUtils.closeStream(writer);
}
});
I have tried wrapping the entire function in a Serializable class, but no luck. Help?
The way I did it was a (pseudocode, I'll try to edit this answer as soon as I get to my office)
rdd.foreachPartition{
Configuration conf = ConfigurationSingletonClass.getConfiguration();
etcetera, etcetera...
}
EDIT: got to my office, here is the complete segment of code: the configuration is created inside the rdd.foreachPartition (for each was a little too much). In the iterator there is the file writing itself, to a sequence file format.
JavaPairRDD<String, PortableDataStream> imageByteRDD = jsc.binaryFiles(SOURCE_PATH);
if(!imageByteRDD.isEmpty())
imageByteRDD.foreachPartition(new VoidFunction<Iterator<Tuple2<String,PortableDataStream>>>() {
#Override
public void call(
Iterator<Tuple2<String, PortableDataStream>> arg0)
throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", HDFS_PATH);
while(arg0.hasNext()){
Tuple2<String,PortableDataStream>fileTuple = arg0.next();
Text key = new Text(fileTuple._1());
String fileName = key.toString().split(SEP_PATH)[key.toString().split(SEP_PATH).length-1].split(DOT_REGEX)[0];
String fileExtension = fileName.split(DOT_REGEX)[fileName.split(DOT_REGEX).length-1];
BytesWritable value = new BytesWritable( fileTuple._2().toArray());
SequenceFile.Writer writer = SequenceFile.createWriter(
conf,
SequenceFile.Writer.file(new Path(DEST_PATH + fileName + SEP_KEY + getCurrentTimeStamp()+DOT+fileExtension)),
SequenceFile.Writer.compression(SequenceFile.CompressionType.RECORD, new BZip2Codec()),
SequenceFile.Writer.keyClass(Text.class), SequenceFile.Writer.valueClass(BytesWritable.class));
key = new Text(key.toString().split(SEP_PATH)[key.toString().split(SEP_PATH).length-2] + SEP_KEY + fileName + SEP_KEY + fileExtension);
writer.append(key, value);
IOUtils.closeStream(writer);
}
}
});
Hope this will help.

How do I load sequence files(from HDFS) parallely and process each of it in parallel through Spark..?

I need to load HDFS files parallelly and process(read it and filter it based on some criteria) each file parallely. Following code loading the files in serial way. Running Spark Application with three Workers(4 cores each). I even tried setting paration parameter in parallelize method, but no performance Improvement. I'm sure my cluster has enough resources to run the jobs in parallel. What changes should I do to make it parallel ?
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
sparkConf.set("spark.closure.serializer", "org.apache.spark.serializer.JavaSerializer");
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
JavaRDD<String> files = sparkContext.parallelize(fileList);
Iterator<String> localIterator = files.toLocalIterator();
while (localIterator.hasNext())
{
String hdfsPath = localIterator.next();
long startTime = DateUtil.getCurrentTimeMillis();
JavaPairRDD<IntWritable, BytesWritable> hdfsContent = sparkContext.sequenceFile(hdfsPath, IntWritable.class, BytesWritable.class);
try
{
JavaRDD<Message> logs = hdfsContent.map(new Function<Tuple2<IntWritable, BytesWritable>, Message>()
{
public Message call(Tuple2<IntWritable, BytesWritable> tuple2) throws Exception
{
BytesWritable value = tuple2._2();
BytesWritable tmp = new BytesWritable();
tmp.setCapacity(value.getLength());
tmp.set(value);
return (Message) getProtos(logtype, tmp.getBytes());
}
});
final JavaRDD<Message> filteredLogs = logs.filter(new Function<Message, Boolean>()
{
public Boolean call(Message msg) throws Exception
{
FieldDescriptor fd = msg.getDescriptorForType().findFieldByName("method");
String value = (String) msg.getField(fd);
if (value.equals("POST"))
{
return true;
}
return false;
}
});
long timetaken = DateUtil.getCurrentTimeMillis() - startTime;
LOGGER.log(Level.INFO, "HDFS: {0} Total Log Count : {1} Filtered Log Count : {2} TimeTaken : {3}", new Object[] { hdfsPath, logs.count(), filteredLogs.count(), timetaken });
}
catch (Exception e)
{
LOGGER.log(Level.INFO, "Exception : ", e);
}
}
Instead of iterating the files RDD, I also tried Spark functions like map & foreach. But it throws following Spark Exception. No external variables are referenced inside the closure and My class(OldLogAnalyzer) already implements Serializable interface. Also KryoSerializer and Javaserializer are configured in SparkConf. I'm puzzled what is not serializable in my code.
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1622)
at org.apache.spark.rdd.RDD.map(RDD.scala:286)
at org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:81)
at org.apache.spark.api.java.JavaRDD.map(JavaRDD.scala:32)
at com.test.logs.spark.OldLogAnalyzer.main(OldLogAnalyzer.java:423)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: org.apache.spark.api.java.JavaSparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.api.java.JavaSparkContext, value: org.apache.spark.api.java.JavaSparkContext#68f277a2)
- field (class: com.test.logs.spark.OldLogAnalyzer$10, name: val$sparkContext, type: class org.apache.spark.api.java.JavaSparkContext)
- object (class com.test.logs.spark.OldLogAnalyzer$10, com.test.logs.spark.OldLogAnalyzer$10#2f80b005)
- field (class: org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, type: interface org.apache.spark.api.java.function.Function)
- object (class org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:38)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
... 15 more

How can I make Spark Streaming count the words in a file in a unit test?

I've successfully built a very simple Spark Streaming application in Java that is based on the HdfsCount example in Scala.
When I submit this application to my local Spark, it waits for a file to be written to a given directory, and when I create that file it successfully prints the number of words. I terminate the application by pressing Ctrl+C.
Now I've tried to create a very basic unit test for this functionality, but in the test I was not able to print the same information, that is the number of words.
What am I missing?
Below is the unit test file, and after that I've also included the code snippet that shows the countWords method:
StarterAppTest.java
import com.google.common.io.Files;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.junit.*;
import java.io.*;
public class StarterAppTest {
JavaStreamingContext ssc;
File tempDir;
#Before
public void setUp() {
ssc = new JavaStreamingContext("local", "test", new Duration(3000));
tempDir = Files.createTempDir();
tempDir.deleteOnExit();
}
#After
public void tearDown() {
ssc.stop();
ssc = null;
}
#Test
public void testInitialization() {
Assert.assertNotNull(ssc.sc());
}
#Test
public void testCountWords() {
StarterApp starterApp = new StarterApp();
try {
JavaDStream<String> lines = ssc.textFileStream(tempDir.getAbsolutePath());
JavaPairDStream<String, Integer> wordCounts = starterApp.countWords(lines);
ssc.start();
File tmpFile = new File(tempDir.getAbsolutePath(), "tmp.txt");
PrintWriter writer = new PrintWriter(tmpFile, "UTF-8");
writer.println("8-Dec-2014: Emre Emre Emre Ergin Ergin Ergin");
writer.close();
System.err.println("===== Word Counts =======");
wordCounts.print();
System.err.println("===== Word Counts =======");
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
Assert.assertTrue(true);
}
}
This test compiles and starts to run, Spark Streaming prints a lot of diagnostic messages on the console but the call to wordCounts.print() does not print anything, whereas in StarterApp.java itself, they do.
I've also tried adding ssc.awaitTermination(); after ssc.start() but nothing changed in that respect. After that I've also tried to create a new file manually in the directory that this Spark Streaming application was checking but this time it gave an error.
For completeness, below is the wordCounts method:
public JavaPairDStream<String, Integer> countWords(JavaDStream<String> lines) {
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
#Override
public Iterable<String> call(String x) { return Lists.newArrayList(SPACE.split(x)); }
});
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(
new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String s) { return new Tuple2<>(s, 1); }
}).reduceByKey((i1, i2) -> i1 + i2);
return wordCounts;
}
Few pointers:
Give at least 2 cores to SparkStreaming context. 1 for the Streaming and 1 for the Spark processing. "local" -> "local[2]"
Your streaming interval is of 3000ms, so somewhere in your program you need to wait -at least- that time to expect an output.
Spark Streaming needs some time for the setup of listeners. The file is being created immediately after ssc.start is issued. There's no warranty that the filesystem listener is already in place. I'd do some sleep(xx) after ssc.start
In Streaming, it's all about the right timing.

Categories