Split dataset based on column values in spark

Split dataset based on column values in spark - java

I am trying to split the Dataset into different Datasets based on Manufacturer column contents. It is very slow Please suggest a way to improve the code, so that it can execute faster and reduce the usage of Java code.
List<Row> lsts= countsByAge.collectAsList();
for(Row lst:lsts) {
String man = lst.toString();
man = man.replaceAll("[\\p{Ps}\\p{Pe}]", "");
Dataset<Row> DF = src.filter("Manufacturer='" + man + "'");
DF.show();
}
The Code, Input and Output Datasets are as shown below.
package org.sparkexample;
import org.apache.parquet.filter2.predicate.Operators.Column;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.RelationalGroupedDataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import java.util.Arrays;
import java.util.List;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
public class GroupBy {
public static void main(String[] args) {
System.setProperty("hadoop.home.dir", "C:\\winutils");
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder().appName("split datasets").getOrCreate();
sc.setLogLevel("ERROR");
Dataset<Row> src= sqlContext.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.load("sample.csv");
Dataset<Row> unq_manf=src.select("Manufacturer").distinct();
List<Row> lsts= unq_manf.collectAsList();
for(Row lst:lsts) {
String man = lst.toString();
man = man.replaceAll("[\\p{Ps}\\p{Pe}]", "");
Dataset<Row> DF = src.filter("Manufacturer='" + man + "'");
DF.show();
}
}
}
Input Table
+------+------------+--------------------+---+
|ItemID|Manufacturer| Category name|UPC|
+------+------------+--------------------+---+
| 804| ael|Brush & Broom Han...|123|
| 805| ael|Wheel Brush Parts...|124|
| 813| ael| Drivers Gloves|125|
| 632| west| Pipe Wrenches|126|
| 804| bil| Masonry Brushes|127|
| 497| west| Power Tools Other|128|
| 496| west| Power Tools Other|129|
| 495| bil| Hole Saws|130|
| 499| bil| Battery Chargers|131|
| 497| west| Power Tools Other|132|
+------+------------+--------------------+---+
Output
+------------+
|Manufacturer|
+------------+
| ael|
| west|
| bil|
+------------+
+------+------------+--------------------+---+
|ItemID|Manufacturer| Category name|UPC|
+------+------------+--------------------+---+
| 804| ael|Brush & Broom Han...|123|
| 805| ael|Wheel Brush Parts...|124|
| 813| ael| Drivers Gloves|125|
+------+------------+--------------------+---+
+------+------------+-----------------+---+
|ItemID|Manufacturer| Category name|UPC|
+------+------------+-----------------+---+
| 632| west| Pipe Wrenches|126|
| 497| west|Power Tools Other|128|
| 496| west|Power Tools Other|129|
| 497| west|Power Tools Other|132|
+------+------------+-----------------+---+
+------+------------+----------------+---+
|ItemID|Manufacturer| Category name|UPC|
+------+------------+----------------+---+
| 804| bil| Masonry Brushes|127|
| 495| bil| Hole Saws|130|
| 499| bil|Battery Chargers|131|
+------+------------+----------------+---+

You have two choice in this case:
First you have to collect unique manufacturer values and then map
over resulting array:
val df = Seq(("HP", 1), ("Brother", 2), ("Canon", 3), ("HP", 5)).toDF("k", "v")
val brands = df.select("k").distinct.collect.flatMap(_.toSeq)
val BrandArray = brands.map(brand => df.where($"k" <=> brand))
BrandArray.foreach { x =>
x.show()
println("---------------------------------------")
}
You can also save the data frame based on manufacturer.
df.write.partitionBy("hour").saveAsTable("parquet")

Instead of splitting the dataset/dataframe by manufacturers it might be optimal to write the dataframe using manufacturer as the partition key if you need to query based on manufacturer frequently
Incase you still want separate dataframes based on one of the column values one of the approaches using pyspark and spark 2.0+ could be-
from pyspark.sql import functions as F
df = spark.read.csv("sample.csv",header=True)
# collect list of manufacturers
manufacturers = df.select('manufacturer').distinct().collect()
# loop through manufacturers to filter df by manufacturers and write it separately
for m in manufacturers:
df1 = df.where(F.col('manufacturers')==m[0])
df1[.repartition(repartition_col)].write.parquet(<write_path>,[write_mode])

Related

Effective record linkage

I've asked a bit similar question earlier today. Here it is.
Shortly: I need to do record linkage for two large datasets (1.6M & 6M). I was going to use Sparks thinking that Cartesian product I was warned about would not be such a big problem. But it is. It hit the performance so hard that the linkage process didn't finish in 7 hours..
Is there another library/framework/tool for doing this more effectively? Or maybe improve performance of the solution below?
The code I ended up with:
object App {
def left(col: Column, n: Int) = {
assert(n > 0)
substring(col, 1, n)
}
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[4]")
.appName("MatchingApp")
.getOrCreate()
import spark.implicits._
val a = spark.read
.format("csv")
.option("header", true)
.option("delimiter", ";")
.load("/home/helveticau/workstuff/a.csv")
.withColumn("FULL_NAME", concat_ws(" ", col("FIRST_NAME"), col("LAST_NAME")))
.withColumn("BIRTH_DATE", to_date(col("BIRTH_DATE"), "yyyy-MM-dd"))
val b = spark.read
.format("csv")
.option("header", true)
.option("delimiter", ";")
.load("/home/helveticau/workstuff/b.txt")
.withColumn("FULL_NAME", concat_ws(" ", col("FIRST_NAME"), col("LAST_NAME")))
.withColumn("BIRTH_DATE", to_date(col("BIRTH_DATE"), "dd.MM.yyyy"))
// #formatter:off
val condition = a
.col("FULL_NAME").contains(b.col("FIRST_NAME"))
.and(a.col("FULL_NAME").contains(b.col("LAST_NAME")))
.and(a.col("BIRTH_DATE").equalTo(b.col("BIRTH_DATE"))
.or(a.col("STREET").startsWith(left(b.col("STR"), 3))))
// #formatter:on
val startMillis = System.currentTimeMillis();
val res = a.join(b, condition, "left_outer")
val count = res
.filter(col("B_ID").isNotNull)
.count()
println(s"Count: $count")
val executionTime = Duration.ofMillis(System.currentTimeMillis() - startMillis)
println(s"Execution time: ${executionTime.toMinutes}m")
}
}
Probably the condition is too complicated, but it must be that way.

You may improve performance of your current solution by changing a bit the logic of how your perform your linkage:
First perform an inner join of a and b dataframes with columns that you know matches. In your case, it seems to be LAST_NAME and FIRST_NAME columns.
Then filter the resulting dataframe with your specific complex conditions, In your case, birth dates are equal or street matches condition.
Finally, if you need to also keep the not linked records, perform a right join with the a dataframe.
Your code could be rewritten as follow:
import org.apache.spark.sql.functions.{col, substring, to_date}
import org.apache.spark.sql.SparkSession
import java.time.Duration
object App {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[4]")
.appName("MatchingApp")
.getOrCreate()
val a = spark.read
.format("csv")
.option("header", true)
.option("delimiter", ";")
.load("/home/helveticau/workstuff/a.csv")
.withColumn("BIRTH_DATE", to_date(col("BIRTH_DATE"), "yyyy-MM-dd"))
val b = spark.read
.format("csv")
.option("header", true)
.option("delimiter", ";")
.load("/home/helveticau/workstuff/b.txt")
.withColumn("BIRTH_DATE", to_date(col("BIRTH_DATE"), "dd.MM.yyyy"))
val condition = a.col("BIRTH_DATE").equalTo(b.col("BIRTH_DATE"))
.or(a.col("STREET").startsWith(substring(b.col("STR"), 1, 3)))
val startMillis = System.currentTimeMillis();
val res = a.join(b, Seq("LAST_NAME", "FIRST_NAME"))
.filter(condition)
// two following lines optional if you want to only keep records with not null B_ID
.select("B_ID", "A_ID")
.join(a, Seq("A_ID"), "right_outer")
val count = res
.filter(col("B_ID").isNotNull)
.count()
println(s"Count: $count")
val executionTime = Duration.ofMillis(System.currentTimeMillis() - startMillis)
println(s"Execution time: ${executionTime.toMinutes}m")
}
}
So you will avoid cartesian product at the price of two joins instead of only one.
Example
With file a.csv containing the following data:
"A_ID";"FIRST_NAME";"LAST_NAME";"BIRTH_DATE";"STREET"
10;John;Doe;1965-10-21;Johnson Road
11;Rebecca;Davis;1977-02-27;Lincoln Road
12;Samantha;Johns;1954-03-31;Main Street
13;Roger;Penrose;1987-12-25;Oxford Street
14;Robert;Smith;1981-08-26;Canergie Road
15;Britney;Stark;1983-09-27;Alshire Road
And b.txt having the following data:
"B_ID";"FIRST_NAME";"LAST_NAME";"BIRTH_DATE";"STR"
29;John;Doe;21.10.1965;Johnson Road
28;Rebecca;Davis;28.03.1986;Lincoln Road
27;Shirley;Iron;30.01.1956;Oak Street
26;Roger;Penrose;25.12.1987;York Street
25;Robert;Dayton;26.08.1956;Canergie Road
24;Britney;Stark;22.06.1962;Algon Road
res dataframe will be:
+----+----+----------+---------+----------+-------------+
|A_ID|B_ID|FIRST_NAME|LAST_NAME|BIRTH_DATE|STREET |
+----+----+----------+---------+----------+-------------+
|10 |29 |John |Doe |1965-10-21|Johnson Road |
|11 |28 |Rebecca |Davis |1977-02-27|Lincoln Road |
|12 |null|Samantha |Johns |1954-03-31|Main Street |
|13 |26 |Roger |Penrose |1987-12-25|Oxford Street|
|14 |null|Robert |Smith |1981-08-26|Canergie Road|
|15 |null|Britney |Stark |1983-09-27|Alshire Road |
+----+----+----------+---------+----------+-------------+
Note: if your FIRST_NAME and LAST_NAME columns are not exactly the same, you can try to make them matches with Spark's built-in functions, for instance:
trim to remove spaces at start and end of string
lower to transform the column to lower case (and thus ignore case in comparison)
What is really important is to have the maximum number of columns that exactly match.

Grouping multiple columns without aggregation

I have a dataframe (Dataset<Row>) which has six columns in it, out of six, four needs to be grouped and for the other two columns it may repeat the grouped columns n times based on the varying value in those two columns.
required dataset like below:
id | batch | batch_Id | session_name | time | value
001| abc | 098 | course-I | 1551409926133 | 2.3
001| abc | 098 | course-I | 1551404747843 | 7.3
001| abc | 098 | course-I | 1551409934220 | 6.3
I tired something like below
Dataset<Row> df2 = df.select("*")
.groupBy(col("id"), col("batch_Id"), col("session_name"))
.agg(max("time"));
I added agg to get groupby output but don't know how to achieve it.
Help much appreciated... Thank you.

I don't think you were too far off.
Given your first dataset:
+---+-----+--------+------------+-------------+-----+
| id|batch|batch_Id|session_name| time|value|
+---+-----+--------+------------+-------------+-----+
|001| abc| 098| course-I|1551409926133| 2.3|
|001| abc| 098| course-I|1551404747843| 7.3|
|001| abc| 098| course-I|1551409934220| 6.3|
|002| def| 097| course-II|1551409926453| 2.3|
|002| def| 097| course-II|1551404747843| 7.3|
|002| def| 097| course-II|1551409934220| 6.3|
+---+-----+--------+------------+-------------+-----+
And assuming your desired output is:
+---+--------+------------+-------------+
| id|batch_Id|session_name| max(time)|
+---+--------+------------+-------------+
|002| 097| course-II|1551409934220|
|001| 098| course-I|1551409934220|
+---+--------+------------+-------------+
I would write the following code for the aggregation:
Dataset<Row> maxValuesDf = rawDf.select("*")
.groupBy(col("id"), col("batch_id"), col("session_name"))
.agg(max("time"));
And the whole app would look like:
package net.jgp.books.spark.ch13.lab900_max_value;
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.max;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class MaxValueAggregationApp {
/**
* main() is your entry point to the application.
*
* #param args
*/
public static void main(String[] args) {
MaxValueAggregationApp app = new MaxValueAggregationApp();
app.start();
}
/**
* The processing code.
*/
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("Aggregates max values")
.master("local[*]")
.getOrCreate();
// Reads a CSV file with header, called books.csv, stores it in a
// dataframe
Dataset<Row> rawDf = spark.read().format("csv")
.option("header", true)
.option("sep", "|")
.load("data/misc/courses.csv");
// Shows at most 20 rows from the dataframe
rawDf.show(20);
// Performs the aggregation, grouping on columns id, batch_id, and
// session_name
Dataset<Row> maxValuesDf = rawDf.select("*")
.groupBy(col("id"), col("batch_id"), col("session_name"))
.agg(max("time"));
maxValuesDf.show(5);
}
}
Does it help?

Connect to Hive from Spark without using "hive-site.xml"

Is there any way to connect to Hive from Spark without using "hive-site.xml"?
SparkLauncher sl = new SparkLauncher(evnProps);
sl.addSparkArg("--verbose");
sl.addAppArgs(appArgs);
sl.addFile(evnProps.get(KEY_YARN_CONF_DIR) + "/hive-site.xml");
We are passing "hive-site.xml" to SparkLauncher.I want to remove dependency on "hive-site.xml"enter code here.

Spark SQL supports reading and writing data stored in Apache Hive. However, since Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them automatically. Note that these Hive dependencies must also be present on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to access data stored in Hive.
Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/.
When working with Hive, one must instantiate SparkSession with Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions. Users who do not have an existing Hive deployment can still enable Hive support. When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse. You may need to grant write privilege to the user who starts the Spark application.
import java.io.File;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public static class Record implements Serializable {
private int key;
private String value;
public int getKey() {
return key;
}
public void setKey(int key) {
this.key = key;
}
public String getValue() {
return value;
}
public void setValue(String value) {
this.value = value;
}
}
// warehouseLocation points to the default location for managed databases and tables
String warehouseLocation = new File("spark-warehouse").getAbsolutePath();
SparkSession spark = SparkSession
.builder()
.appName("Java Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate();
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive");
spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src");
// Queries are expressed in HiveQL
spark.sql("SELECT * FROM src").show();
// +---+-------+
// |key| value|
// +---+-------+
// |238|val_238|
// | 86| val_86|
// |311|val_311|
// ...
// Aggregation queries are also supported.
spark.sql("SELECT COUNT(*) FROM src").show();
// +--------+
// |count(1)|
// +--------+
// | 500 |
// +--------+
// The results of SQL queries are themselves DataFrames and support all normal functions.
Dataset<Row> sqlDF = spark.sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key");
// The items in DataFrames are of type Row, which lets you to access each column by ordinal.
Dataset<String> stringsDS = sqlDF.map(
(MapFunction<Row, String>) row -> "Key: " + row.get(0) + ", Value: " + row.get(1),
Encoders.STRING());
stringsDS.show();
// +--------------------+
// | value|
// +--------------------+
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// ...
// You can also use DataFrames to create temporary views within a SparkSession.
List<Record> records = new ArrayList<>();
for (int key = 1; key < 100; key++) {
Record record = new Record();
record.setKey(key);
record.setValue("val_" + key);
records.add(record);
}
Dataset<Row> recordsDF = spark.createDataFrame(records, Record.class);
recordsDF.createOrReplaceTempView("records");
// Queries can then join DataFrames data with data stored in Hive.
spark.sql("SELECT * FROM records r JOIN src s ON r.key = s.key").show();
// +---+------+---+------+
// |key| value|key| value|
// +---+------+---+------+
// | 2| val_2| 2| val_2|
// | 2| val_2| 2| val_2|
// | 4| val_4| 4| val_4|
// ...

Mysql SQL query DATEDIFF failed in H2 where mode was MYSQL

Background: In one of my projects I am doing component testing on Spring Batch using JUnit. Here application DB is MYSQL. In Junit test execution I let the data-source switch between
MYSQL and
H2(jdbc:h2:mem:MYTESTDB;DB_CLOSE_DELAY=-1;DB_CLOSE_ON_EXIT=FALSE;MODE=MYSQL)
based on configuration.
Use MYSQL as the data source for debugging purpose and H2 to run the test in isolation in build servers.
Everything works fine until in application logic I had to use a query with DATEDIFF.
Issue: Query fails with
org.h2.jdbc.JdbcSQLException: Syntax error in SQL statement
Reason: Even through H2 run on MySQL mode it uses H2 Functions and those functions are different
MYSQL DATEDIFF definition is DATEDIFF(expr1,expr2)
e.g. SELECT DATEDIFF('2010-11-30 23:59:59','2010-12-31')
==> 1
H2 DATEDIFF definision is DATEDIFF(unitstring, expr1, expr2)
unitstring = { YEAR | YY | MONTH | MM | WEEK | DAY | DD | DAY_OF_YEAR
| DOY | HOUR | HH | MINUTE | MI | SECOND | SS | MILLISECOND | MS }
e.g. SELECT DATEDIFF(dd, '2010-11-30 23:59:59','2010-12-31')
==> 1
Solutions tried and failed: I tried to write a custom function
package com.asela.util;
import java.lang.reflect.Field;
import java.sql.Date;
import java.time.temporal.ChronoUnit;
import java.util.Map;
import java.util.Objects;
import org.h2.expression.Function;
public class H2Function {
public static long dateDifference(Date date1, Date date2) {
Objects.nonNull(date1);
Objects.nonNull(date2);
return ChronoUnit.DAYS.between(date1.toLocalDate(), date2.toLocalDate());
}
}
And set it with H2
DROP ALIAS IF EXISTS DATEDIFF;
CREATE ALIAS DATEDIFF FOR "com.asela.util.H2Function.dateDifference";
Above was not able to replace existing DATEDIFF still fails with
org.h2.jdbc.JdbcSQLException: Function alias "DATEDIFF" already exists; SQL statement:
Any other approach I can try to make this work?

Got a workaround with Reflection for the problem. Access H2 Functions map and remove DATEDIFF from there. Then add the replacement function.
package com.asela.util;
import java.lang.reflect.Field;
import java.sql.Date;
import java.time.temporal.ChronoUnit;
import java.util.Map;
import java.util.Objects;
import org.h2.expression.Function;
public class H2Function {
#SuppressWarnings("rawtypes")
public static int removeDateDifference() {
try {
Field field = Function.class.getDeclaredField("FUNCTIONS");
field.setAccessible(true);
((Map)field.get(null)).remove("DATEDIFF");
} catch (Exception e) {
throw new RuntimeException("failed to remove date-difference");
}
return 0;
}
public static long dateDifference(Date date1, Date date2) {
Objects.nonNull(date1);
Objects.nonNull(date2);
return ChronoUnit.DAYS.between(date1.toLocalDate(), date2.toLocalDate());
}
}
Then in schema
CREATE ALIAS IF NOT EXISTS REMOVE_DATE_DIFF FOR "com.asela.util.H2Function.removeDateDifference";
CALL REMOVE_DATE_DIFF();
DROP ALIAS IF EXISTS DATEDIFF;
CREATE ALIAS DATEDIFF FOR "com.asela.util.H2Function.dateDifference";

How to save data from spark streaming to cassandra using java?

I get some entries from the stream in linux terminal, assign them as lines, break them into words. But instead of printing them out I want to save them to Cassandra.
I have a Keyspace named ks, with a table inside it named record.
I know that some code like CassandraStreamingJavaUtil.javaFunctions(words).writerBuilder("ks", "record").saveToCassandra(); has to do the job but I guess I am doing something wrong. Can someone help ?
Here is my Cassandra ks.record schema (I added these data through CQLSH)
id | birth_date | name
----+---------------------------------+-----------
10 | 1987-12-01 23:00:00.000000+0000 | Catherine
11 | 2004-09-07 22:00:00.000000+0000 | Isadora
1 | 2016-05-10 13:00:04.452000+0000 | John
2 | 2016-05-10 13:00:04.452000+0000 | Troy
12 | 1970-10-01 23:00:00.000000+0000 | Anna
3 | 2016-05-10 13:00:04.452000+0000 | Andrew
Here is my Java code :
import com.datastax.spark.connector.japi.CassandraStreamingJavaUtil;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import scala.Tuple2;
import java.util.Arrays;
import static com.datastax.spark.connector.japi.CassandraJavaUtil.javaFunctions;
import static com.datastax.spark.connector.japi.CassandraJavaUtil.mapToRow;
import static com.datastax.spark.connector.japi.CassandraStreamingJavaUtil.*;
public class CassandraStreaming2 {
public static void main(String[] args) {
// Create a local StreamingContext with two working thread and batch interval of 1 second
SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("CassandraStreaming");
JavaStreamingContext sc = new JavaStreamingContext(conf, Durations.seconds(1));
// Create a DStream that will connect to hostname:port, like localhost:9999
JavaReceiverInputDStream<String> lines = sc.socketTextStream("localhost", 9999);
// Split each line into words
JavaDStream<String> words = lines.flatMap(
(FlatMapFunction<String, String>) x -> Arrays.asList(x.split(" "))
);
words.print();
//CassandraStreamingJavaUtil.javaFunctions(words).writerBuilder("ks", "record").saveToCassandra();
sc.start(); // Start the computation
sc.awaitTermination(); // Wait for the computation to terminate
}
}

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/7_java_api.md#saving-data-to-cassandra
As per the docs, you need to also pass a RowWriter factory. The most common way to do this is to use the mapToRow(Class) api, this is the missing parameter described.
But you have an additional problem, your code doesn't yet specify the data in a way that can be written to C*. You have a JavaDStream of only Strings. And a single String cannot be made into a Cassandra Row for your given schema.
Basically you are telling the connector
Write "hello" to CassandraTable (id, birthday, value)
Without telling it where the hello goes (what should the id be? what should the birthday be?)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Split dataset based on column values in spark - java

Related

Effective record linkage

Grouping multiple columns without aggregation

Connect to Hive from Spark without using "hive-site.xml"

Mysql SQL query DATEDIFF failed in H2 where mode was MYSQL

How to save data from spark streaming to cassandra using java?

Categories

Resources