Use mapPartitionsWithIndex for DStream - Spark Streaming

Use mapPartitionsWithIndex for DStream - Spark Streaming - java

I want to do something very simple: to check what is the content of each partition in the first RDD of my DStream.
This is what I'm doing now:
SparkConf sparkConfiguration= new SparkConf().setAppName("DataAnalysis").setMaster("local[*]");
JavaStreamingContext sparkStrContext=new JavaStreamingContext(sparkConfiguration, Durations.seconds(1));
JavaReceiverInputDStream<String> receiveParkingData=sparkStrContext.socketTextStream("localhost",5554);
Time time=new Time(1000);
JavaRDD<String>dataRDD= receiveParkingData.compute(time);
//I get an error in this RDD
JavaRDD<String>indexDataRDD=dataRDD.mapPartitionsWithIndex(new Function2<Integer, Iterator<String>, Iterator<String>>() {
#Override
public Iterator<String> call(Integer integer, Iterator<String> stringIterator) throws Exception {
return null;
}
});
indexDataRDD.collect();
So I want to print the content of each partition and its ID. However, on the indexDataRDD I get this message in my IntelliJ IDE: mapPartitionsWithIndex (Function2<Integer, Iterator<String>, Iterator<String>>, boolean) in AbstractJavaRDDLike cannot be applied to (Function2<Integer, Iterator<String>, Iterator<String>>)
Can someone help me with this issue? Is there another, easier way to get the content in each partition? I really want to know the specific content of each partition.
Thank you so much.

Here is sample program for mapPartitionsWithIndex for your reference.
public class SparkDemo {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("SparkDemo").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
List<String> data = Arrays.asList("one","two","three","four","five");
JavaRDD<String> javaRDD = sc.parallelize(data, 2);
JavaRDD<String> mapPartitionsWithIndexRDD = javaRDD
.mapPartitionsWithIndex(new Function2<Integer, Iterator<String>, Iterator<String>>() {
#Override
public Iterator<String> call(Integer index, Iterator<String> iterator) throws Exception {
LinkedList<String> linkedList = new LinkedList<String>();
while (iterator.hasNext()){
linkedList.add(Integer.toString(index) + "-" + iterator.next());
}
return linkedList.iterator();
}
}, false);
System.out.println("mapPartitionsWithIndexRDD " + mapPartitionsWithIndexRDD.collect());
sc.stop();
sc.close();
}
}

Related

java.io.NotSerializableException: Graph is unexpectedly null when DStream is being serialized

I am new in spark streaming programming please someone explain for me what is the problem
I thing that that i iterate a null structure but i have a producer class which works normally
my source code :
public class Main3 implements java.io.Serializable {
public static JavaDStream<Double> pr;
public void consumer() throws Exception{
// Configure Spark to connect to Kafka running on local machine
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"localhost:9092");
kafkaParams.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put(ConsumerConfig.GROUP_ID_CONFIG,"group1");
kafkaParams.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,"latest");
kafkaParams.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG,true);
Collection<String> topics = Arrays.asList("testing");
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("SparkKafka10WordCount");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(30));
final JavaInputDStream<ConsumerRecord<String, String>> receiver=
KafkaUtils.createDirectStream(jssc, LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String,String>Subscribe(topics,kafkaParams));
JavaDStream<String> stream = receiver.map(new Function<ConsumerRecord<String,String>, String>() {
#Override
public String call(ConsumerRecord<String, String> kafkaRecord) throws Exception {
return kafkaRecord.value();
}
});
stream.foreachRDD( x->x.saveAsTextFile("/home/khouloud/Desktop/exemple/b")); //that does no do any thing
stream.foreachRDD( x-> {
x.collect().stream().forEach(n-> System.out.println("item of list: "+n));
}); // also this i see any thing in the console
stream.foreachRDD( rdd -> {
if (rdd.isEmpty()) System.out.println("its empty"); }); //nothing`
JavaPairDStream<Integer, List<Double>> points= stream.mapToPair(new PairFunction<String, Integer, List<Double>>(){
#Override
public Tuple2<Integer, List<Double>> call(String x) throws Exception {
String[] item = x.split(" ");
List<Double> l = new ArrayList<Double>();
for (int i= 1 ; i < item.length ; i++)
{
l.add(new Double(item[i]));
}
return new Tuple2<>(new Integer(item[0]), l);
}}
);`
Error -
`org.apache.spark.SparkException: Task not serializable at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)
at
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2294) at
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:547)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:547)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at
org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:265)
at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:546)
at
org.apache.spark.streaming.api.java.JavaDStreamLike$class.mapToPair(JavaDStreamLike.scala:163)
at
org.apache.spark.streaming.api.java.AbstractJavaDStreamLike.mapToPair(JavaDStreamLike.scala:42)
at Min.calculDegSim(Min.java:43) at SkyRule.execute(SkyRule.java:34)
at Main3.consumer(Main3.java:159) at
Executer$2.run(Executer.java:27) at
java.lang.Thread.run(Thread.java:748) Caused by:
java.io.NotSerializableException: Graph is unexpectedly null when
DStream is being serialized. Serialization stack:
at
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:337)

Getting error when invoking elasticSearch from spark

I have a use case, where I need to read messages from kafka and for each message, extract data and invoke elasticsearch Index. The response will be further used to do further processing.
I am getting below error when invoking JavaEsSpark.esJsonRDD
java.lang.ClassCastException: org.elasticsearch.spark.rdd.EsPartition incompatible with org.apache.spark.rdd.ParallelCollectionPartition
at org.apache.spark.rdd.ParallelCollectionRDD.compute(ParallelCollectionRDD.scala:102)
My code snippet is below
public static void main(String[] args) {
if (args.length < 4) {
System.err.println("Usage: JavaKafkaIntegration <zkQuorum> <group> <topics> <numThreads>");
System.exit(1);
}
SparkConf sparkConf = new SparkConf().setAppName("JavaKafkaIntegration").setMaster("local[2]").set("spark.driver.allowMultipleContexts", "true");
//Setting when using JavaEsSpark.esJsonRDD
sparkConf.set("es.nodes",<NODE URL>);
sparkConf.set("es.nodes.wan.only","true");
context = new JavaSparkContext(sparkConf);
// Create the context with 2 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
int numThreads = Integer.parseInt(args[3]);
Map<String, Integer> topicMap = new HashMap<>();
String[] topics = args[2].split(",");
for (String topic: topics) {
topicMap.put(topic, numThreads);
}
//Receive Message From kafka
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc,args[0], args[1], topicMap);
JavaDStream<String> jsons = messages
.map(new Function<Tuple2<String, String>, String>() {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public String call(Tuple2<String, String> tuple2){
JavaRDD<String> esRDD = JavaEsSpark.esJsonRDD(context, <index>,<search string> ).values() ;
return null;
}
});
jsons.print();
jssc.start();
jssc.awaitTermination();
}
I am getting error when invoking JavaEsSpark.esJsonRDD. Is it correct way to do it? How do I successfully invoke ES from spark?
I am running kafka and spark on windows and invoking external elastic search index.

Processing several files in SPARK separately

I need help with implementation one workflow with Apache Spark. My task is in next:
I have several CSV files as source data. Note: these files could has different layout
I have metadata with info how I need parse each file (this is not problem)
Main goal: result is source file with several additional columns. I have to update each source file without joining to one output range. For example: source 10 files -> 10 result files and each result file have data only from corresponding source file.
As I know Spark can open many files by mask:
var source = sc.textFile("/source/data*.gz");
But in this case I can't recognize which line of a file. If I get list of source files and try to process by following scenario:
JavaSparkContext sc = new JavaSparkContext(...);
List<String> files = new ArrayList() //list of source files full name's
for(String f : files)
{
JavaRDD<String> data = sc.textFile(f);
//process this file with Spark
outRdd.coalesce(1, true).saveAsTextFile(f + "_out");
}
But in this case I will processed all files in sequential mode.
My question is next: how I can processed many files in parallel mode?. For example: one file - one executor?
I tried to implement this by simple code with source data:
//JSON file with paths to 4 source files, saved in inData variable
{
"files": [
{
"name": "/mnt/files/DigilantDaily_1.gz",
"layout": "layout_1"
},
{
"name": "/mnt/files/DigilantDaily_2.gz",
"layout": "layout_2"
},
{
"name": "/mnt/files/DigilantDaily_3.gz",
"layout": "layout_3"
},
{
"name": "/mnt/files/DigilantDaily_4.gz",
"layout": "layout_4"
}
]
}
sourceFiles= new ArrayList<>();
JSONObject jsFiles = (JSONObject) new JSONParser().parse(new FileReader(new File(inData)));
Iterator<JSONObject> iterator = ((JSONArray)jsFiles.get("files")).iterator();
while (iterator.hasNext()){
SourceFile sf = new SourceFile();
JSONObject js = iterator.next();
sf.FilePath = (String) js.get("name");
sf.MetaPath = (String) js.get("layout");
sourceFiles.add(sf);
}
SparkConf sparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("spark-app");
final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
try {
final Validator validator = new Validator();
ExecutorService pool = Executors.newFixedThreadPool(4);
for(final SourceFile f : sourceFiles)
{
pool.execute(new Runnable() {
#Override
public void run() {
final Path inFile = Paths.get(f.FilePath);
JavaRDD<String> d1 = sparkContext
.textFile(f.FilePath)
.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String s) throws Exception {
return validator.parseRow(s);
}
});
JavaPairRDD<String, Integer> d2 = d1.mapToPair(new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String s) throws Exception {
String userAgent = validator.getUserAgent(s);
return new Tuple2<>(DeviceType.deviceType(userAgent), 1);
}
});
JavaPairRDD<String, Integer> d3 = d2.reduceByKey(new Function2<Integer, Integer, Integer>() {
#Override
public Integer call(Integer val1, Integer val2) throws Exception {
return val1 + val2;
}
});
d3.coalesce(1, true)
.saveAsTextFile(outFolder + "/" + inFile.getFileName().toString());//, org.apache.hadoop.io.compress.GzipCodec.class);
}
});
}
pool.shutdown();
pool.awaitTermination(60, TimeUnit.MINUTES);
} catch (Exception e) {
throw e;
} finally {
if (sparkContext != null) {
sparkContext.stop();
}
}
But this code failed with exception:
Exception in thread "pool-13-thread-2" Exception in thread "pool-13-thread-3" Exception in thread "pool-13-thread-1" Exception in thread "pool-13-thread-4" java.lang.Error: org.apache.spark.SparkException: Task not serializable
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:335)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:334)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.RDD.filter(RDD.scala:334)
at org.apache.spark.api.java.JavaRDD.filter(JavaRDD.scala:78)
at append.dev.App$1.run(App.java:87)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
... 2 more
I would like to know where I have a mistake?
Thanks for help me!

I have used a similar multithreaded approach with good results. I beleive the problem is located in the inner class you define.
Create your runnable/callable on a separate class and make sure it gets across to Spark with you submitted jars. Also, implement serializable as you are implicitly passing state to your function (f.FilePath).

You could use sc.wholeTextFiles(dirname) to get an RDD of (filename, content) pairs and map over that.

Spark: Two SparkContexts in a single Application Best Practice

I think I have an interesting question for all of you today. In the code below you will notice I have two SparkContexts one for SparkStreaming and the other one which is a normal SparkContext. According to best practices you should only have one SparkContext in a Spark application even though its possible to circumvent this via allowMultipleContexts in the configuration.
Problem is, I need to retrieve data from hive and from a Kafka topic to do some logic, and whenever I submit my application it obviously returns "Cannot have 2 Spark Contexts Running on JVM".
My question is, is there a correct way to do this than how I am doing it right now?
public class MainApp {
private final String logFile= Properties.getString("SparkLogFileDir");
private static final String KAFKA_GROUPID = Properties.getString("KafkaGroupId");
private static final String ZOOKEEPER_URL = Properties.getString("ZookeeperURL");
private static final String KAFKA_BROKER = Properties.getString("KafkaBroker");
private static final String KAFKA_TOPIC = Properties.getString("KafkaTopic");
private static final String Database = Properties.getString("HiveDatabase");
private static final Integer KAFKA_PARA = Properties.getInt("KafkaParrallel");
public static void main(String[] args){
//set settings
String sql="";
//START APP
System.out.println("Starting NPI_TWITTERAPP...." + new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
System.out.println("Configuring Settings...."+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
SparkConf conf = new SparkConf()
.setAppName(Properties.getString("SparkAppName"))
.setMaster(Properties.getString("SparkMasterUrl"));
//Set Spark/hive/sql Context
JavaSparkContext sc = new JavaSparkContext(conf);
JavaStreamingContext jssc = new JavaStreamingContext(conf, new Duration(5000));
JavaHiveContext HiveSqlContext = new JavaHiveContext(sc);
//Check if Twitter Hive Table Exists
try {
HiveSqlContext.sql("DROP TABLE IF EXISTS "+Database+"TWITTERSTORE");
HiveSqlContext.sql("CREATE TABLE IF NOT EXISTS "+Database+".TWITTERSTORE "
+" (created_at String, id String, id_str String, text String, source String, truncated String, in_reply_to_user_id String, processed_at String, lon String, lat String)"
+" STORED AS TEXTFILE");
}catch(Exception e){
System.out.println(e);
}
//Check if Ivapp Table Exists
sql ="CREATE TABLE IF NOT EXISTS "+Database+".IVAPPGEO AS SELECT DISTINCT a.LATITUDE, a.LONGITUDE, b.ODNCIRCUIT_OLT_CLLI, b.ODNCIRCUIT_OLT_TID, a.CITY, a.STATE, a.ZIP FROM "
+Database+".T_PONNMS_SERVICE B, "
+Database+".CLLI_LATLON_MSTR A WHERE a.BID_CLLI = substr(b.ODNCIRCUIT_OLT_CLLI,0,8)";
try {
System.out.println(sql + new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
HiveSqlContext.sql(sql);
sql = "SELECT LATITUDE, LONGITUDE, ODNCIRCUIT_OLT_CLLI, ODNCIRCUIT_OLT_TID, CITY, STATE, ZIP FROM "+Database+".IVAPPGEO";
JavaSchemaRDD RDD_IVAPPGEO = HiveSqlContext.sql(sql).cache();
}catch(Exception e){
System.out.println(sql + new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
}
//JavaHiveContext hc = new JavaHiveContext();
System.out.println("Retrieve Data from Kafka Topic: "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
Map<String, Integer> topicMap = new HashMap<String, Integer>();
topicMap.put(KAFKA_TOPIC,KAFKA_PARA);
JavaPairReceiverInputDStream<String, String> messages = KafkaUtils.createStream(
jssc, KAFKA_GROUPID, ZOOKEEPER_URL, topicMap);
JavaDStream<String> json = messages.map(
new Function<Tuple2<String, String>, String>() {
private static final long serialVersionUID = 42l;
#Override
public String call(Tuple2<String, String> message) {
return message._2();
}
}
);
System.out.println("Completed Kafka Messages... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
System.out.println("Filtering Resultset... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
JavaPairDStream<Long, String> tweets = json.mapToPair(
new TwitterFilterFunction());
JavaPairDStream<Long, String> filtered = tweets.filter(
new Function<Tuple2<Long, String>, Boolean>() {
private static final long serialVersionUID = 42l;
#Override
public Boolean call(Tuple2<Long, String> tweet) {
return tweet != null;
}
}
);
JavaDStream<Tuple2<Long, String>> tweetsFiltered = filtered.map(
new TextFilterFunction());
tweetsFiltered = tweetsFiltered.map(
new StemmingFunction());
System.out.println("Finished Filtering Resultset... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
System.out.println("Processing Sentiment Data... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
//calculate postive tweets
JavaPairDStream<Tuple2<Long, String>, Float> positiveTweets =
tweetsFiltered.mapToPair(new PositiveScoreFunction());
//calculate negative tweets
JavaPairDStream<Tuple2<Long, String>, Float> negativeTweets =
tweetsFiltered.mapToPair(new NegativeScoreFunction());
JavaPairDStream<Tuple2<Long, String>, Tuple2<Float, Float>> joined =
positiveTweets.join(negativeTweets);
//Score tweets
JavaDStream<Tuple4<Long, String, Float, Float>> scoredTweets =
joined.map(new Function<Tuple2<Tuple2<Long, String>,
Tuple2<Float, Float>>,
Tuple4<Long, String, Float, Float>>() {
private static final long serialVersionUID = 42l;
#Override
public Tuple4<Long, String, Float, Float> call(
Tuple2<Tuple2<Long, String>, Tuple2<Float, Float>> tweet)
{
return new Tuple4<Long, String, Float, Float>(
tweet._1()._1(),
tweet._1()._2(),
tweet._2()._1(),
tweet._2()._2());
}
});
System.out.println("Finished Processing Sentiment Data... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
System.out.println("Outputting Tweets Data to flat file "+Properties.getString("HdfsOutput")+" ... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
JavaDStream<Tuple5<Long, String, Float, Float, String>> result =
scoredTweets.map(new ScoreTweetsFunction());
result.foreachRDD(new FileWriter());
System.out.println("Outputting Sentiment Data to Hive... "+ new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime()));
jssc.start();
jssc.awaitTermination();
}
}

Creating SparkContext
You can create a SparkContext instance with or without creating a SparkConf object first.
Getting Existing or Creating New SparkContext (getOrCreate methods)
getOrCreate(): SparkContext
getOrCreate(conf: SparkConf): SparkContext
SparkContext.getOrCreate methods allow you to get the existing SparkContext or create a new one.
import org.apache.spark.SparkContext
val sc = SparkContext.getOrCreate()
// Using an explicit SparkConf object
import org.apache.spark.SparkConf
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("SparkMe App")
val sc = SparkContext.getOrCreate(conf)
Refer Here - https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sparkcontext.html

Apparently if I use sc.close() to close the original SparkContext before executing JavaStreaming Context it works perfectly, no errors or issues.

you can use a singleton object ContextManager which would handle which context to provide.
public class ContextManager {
private static JavaSparkContext context;
private static String currentType;
private ContextManager() {}
public static JavaSparkContext getContext(String type) {
if(type == currentType && context != null) {
return context;
}
else if (type == "streaming"){
.. clean up the current context ..
.. initialize the context to streaming context ..
currentType = type;
}
else {
..clean up the current context..
... initialize the context to normal context ..
currentType = type;
}
return context;
}
}
There are some issues like in projects where you switch context quite rapidly the overhead would be quite large.

You can access the SparkContext from your JavaStreamingSparkContext, and use that reference when creating additional contexts.
SparkConf sparkConfig = new SparkConf().setAppName("foo");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConfig, Duration.seconds(30));
SqlContext sqlContext = new SqlContext(jssc.sparkContext());

Efficient Spark Cassandra Java join

I've got two tables:
my_keyspace.name with columns:
name (string) - partition key
timestamp (date) - second part of partition key
id (int) - third part of partition key
my_keyspace.data with columns:
timestamp (date) - partition key
id (int) - second part of partition key
data (string)
I'm trying to join on timestamp and id from a name table. I'm doing it by getting all timestamps and ids associated with a given name and retrieving data from data table for those entries.
It's really fast to do it in CQL. I expected Spark Cassandra to be equally fast at it, but instead it seems to be doing a full table scan. It might be due to not knowing which fields are partition/primary key. Though I don't seem to be able to find a way to tell it the mappings.
How can I make this join as efficient as it should be? Here's my code sample:
private static void notSoEfficientJoin() {
SparkConf conf = new SparkConf().setAppName("Simple Application")
.setMaster("local[*]")
.set("spark.cassandra.connection.host", "localhost")
.set("spark.driver.allowMultipleContexts", "true");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaPairRDD<DataKey, NameRow> nameIndexRDD = javaFunctions(sc).cassandraTable("my_keyspace", "name", mapRowTo(NameRow.class)).where("name = 'John'")
.keyBy(new Function<NameRow, DataKey>() {
#Override
public DataKey call(NameRow v1) throws Exception {
return new DataKey(v1.timestamp, v1.id);
}
});
JavaPairRDD<DataKey, DataRow> dataRDD = javaFunctions(sc).cassandraTable("my_keyspace", "data", mapRowTo(DataRow.class))
.keyBy(new Function<DataRow, DataKey>() {
#Override
public DataKey call(DataRow v1) throws Exception {
return new DataKey(v1.timestamp, v1.id);
}
});
JavaRDD<String> cassandraRowsRDD = nameIndexRDD.join(dataRDD)
.map(new Function<Tuple2<DataKey, Tuple2<NameRow, DataRow>>, String>() {
#Override
public String call(Tuple2<DataKey, Tuple2<NameRow, DataRow>> v1) throws Exception {
NameRow nameRow = v1._2()._1();
DataRow dataRow = v1._2()._2();
return nameRow + " " + dataRow;
}
});
List<String> collect = cassandraRowsRDD.collect();
}

The way to do this join more efficiently is to actually invoke joinWithCassandraTable this can be done by wrapping results with another javaFunctions call:
private static void moreEfficientJoin() {
SparkConf conf = new SparkConf().setAppName("Simple Application")
.setMaster("local[*]")
.set("spark.cassandra.connection.host", "localhost")
.set("spark.driver.allowMultipleContexts", "true");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<DataKey> nameIndexRDD = sc.parallelize(javaFunctions(sc).cassandraTable("my_keyspace", "name", mapRowTo(DataKey.class))
.where("name = 'John'")
.collect());
JavaRDD<Data> dataRDD = javaFunctions(nameIndexRDD).joinWithCassandraTable("my_keyspace", "data", allColumns, someColumns("timestamp", "id"), mapRowTo(Data.class), mapToRow(DataKey.class))
.map(new Function<Tuple2<DataKey, Data>, Data>() {
#Override
public Data call(Tuple2<DataKey, Data> v1) throws Exception {
return v1._2();
}
});
List<Data> data = dataRDD.collect();
}
The important thing is to wrap a JavaRDD with javaFunctions. So it is possible to not call collect and sc.parallelize on nameIndexRDD

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Use mapPartitionsWithIndex for DStream - Spark Streaming - java

Related

java.io.NotSerializableException: Graph is unexpectedly null when DStream is being serialized

Getting error when invoking elasticSearch from spark

Processing several files in SPARK separately

Spark: Two SparkContexts in a single Application Best Practice

Efficient Spark Cassandra Java join

Categories

Resources