Is there any way to connect to Hive from Spark without using "hive-site.xml"?
SparkLauncher sl = new SparkLauncher(evnProps);
sl.addSparkArg("--verbose");
sl.addAppArgs(appArgs);
sl.addFile(evnProps.get(KEY_YARN_CONF_DIR) + "/hive-site.xml");
We are passing "hive-site.xml" to SparkLauncher.I want to remove dependency on "hive-site.xml"enter code here.
Spark SQL supports reading and writing data stored in Apache Hive. However, since Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them automatically. Note that these Hive dependencies must also be present on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to access data stored in Hive.
Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/.
When working with Hive, one must instantiate SparkSession with Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions. Users who do not have an existing Hive deployment can still enable Hive support. When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse. You may need to grant write privilege to the user who starts the Spark application.
import java.io.File;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public static class Record implements Serializable {
private int key;
private String value;
public int getKey() {
return key;
}
public void setKey(int key) {
this.key = key;
}
public String getValue() {
return value;
}
public void setValue(String value) {
this.value = value;
}
}
// warehouseLocation points to the default location for managed databases and tables
String warehouseLocation = new File("spark-warehouse").getAbsolutePath();
SparkSession spark = SparkSession
.builder()
.appName("Java Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate();
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive");
spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src");
// Queries are expressed in HiveQL
spark.sql("SELECT * FROM src").show();
// +---+-------+
// |key| value|
// +---+-------+
// |238|val_238|
// | 86| val_86|
// |311|val_311|
// ...
// Aggregation queries are also supported.
spark.sql("SELECT COUNT(*) FROM src").show();
// +--------+
// |count(1)|
// +--------+
// | 500 |
// +--------+
// The results of SQL queries are themselves DataFrames and support all normal functions.
Dataset<Row> sqlDF = spark.sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key");
// The items in DataFrames are of type Row, which lets you to access each column by ordinal.
Dataset<String> stringsDS = sqlDF.map(
(MapFunction<Row, String>) row -> "Key: " + row.get(0) + ", Value: " + row.get(1),
Encoders.STRING());
stringsDS.show();
// +--------------------+
// | value|
// +--------------------+
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// ...
// You can also use DataFrames to create temporary views within a SparkSession.
List<Record> records = new ArrayList<>();
for (int key = 1; key < 100; key++) {
Record record = new Record();
record.setKey(key);
record.setValue("val_" + key);
records.add(record);
}
Dataset<Row> recordsDF = spark.createDataFrame(records, Record.class);
recordsDF.createOrReplaceTempView("records");
// Queries can then join DataFrames data with data stored in Hive.
spark.sql("SELECT * FROM records r JOIN src s ON r.key = s.key").show();
// +---+------+---+------+
// |key| value|key| value|
// +---+------+---+------+
// | 2| val_2| 2| val_2|
// | 2| val_2| 2| val_2|
// | 4| val_4| 4| val_4|
// ...
Related
I have a dataframe (Dataset<Row>) which has six columns in it, out of six, four needs to be grouped and for the other two columns it may repeat the grouped columns n times based on the varying value in those two columns.
required dataset like below:
id | batch | batch_Id | session_name | time | value
001| abc | 098 | course-I | 1551409926133 | 2.3
001| abc | 098 | course-I | 1551404747843 | 7.3
001| abc | 098 | course-I | 1551409934220 | 6.3
I tired something like below
Dataset<Row> df2 = df.select("*")
.groupBy(col("id"), col("batch_Id"), col("session_name"))
.agg(max("time"));
I added agg to get groupby output but don't know how to achieve it.
Help much appreciated... Thank you.
I don't think you were too far off.
Given your first dataset:
+---+-----+--------+------------+-------------+-----+
| id|batch|batch_Id|session_name| time|value|
+---+-----+--------+------------+-------------+-----+
|001| abc| 098| course-I|1551409926133| 2.3|
|001| abc| 098| course-I|1551404747843| 7.3|
|001| abc| 098| course-I|1551409934220| 6.3|
|002| def| 097| course-II|1551409926453| 2.3|
|002| def| 097| course-II|1551404747843| 7.3|
|002| def| 097| course-II|1551409934220| 6.3|
+---+-----+--------+------------+-------------+-----+
And assuming your desired output is:
+---+--------+------------+-------------+
| id|batch_Id|session_name| max(time)|
+---+--------+------------+-------------+
|002| 097| course-II|1551409934220|
|001| 098| course-I|1551409934220|
+---+--------+------------+-------------+
I would write the following code for the aggregation:
Dataset<Row> maxValuesDf = rawDf.select("*")
.groupBy(col("id"), col("batch_id"), col("session_name"))
.agg(max("time"));
And the whole app would look like:
package net.jgp.books.spark.ch13.lab900_max_value;
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.max;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class MaxValueAggregationApp {
/**
* main() is your entry point to the application.
*
* #param args
*/
public static void main(String[] args) {
MaxValueAggregationApp app = new MaxValueAggregationApp();
app.start();
}
/**
* The processing code.
*/
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("Aggregates max values")
.master("local[*]")
.getOrCreate();
// Reads a CSV file with header, called books.csv, stores it in a
// dataframe
Dataset<Row> rawDf = spark.read().format("csv")
.option("header", true)
.option("sep", "|")
.load("data/misc/courses.csv");
// Shows at most 20 rows from the dataframe
rawDf.show(20);
// Performs the aggregation, grouping on columns id, batch_id, and
// session_name
Dataset<Row> maxValuesDf = rawDf.select("*")
.groupBy(col("id"), col("batch_id"), col("session_name"))
.agg(max("time"));
maxValuesDf.show(5);
}
}
Does it help?
Background: In one of my projects I am doing component testing on Spring Batch using JUnit. Here application DB is MYSQL. In Junit test execution I let the data-source switch between
MYSQL and
H2(jdbc:h2:mem:MYTESTDB;DB_CLOSE_DELAY=-1;DB_CLOSE_ON_EXIT=FALSE;MODE=MYSQL)
based on configuration.
Use MYSQL as the data source for debugging purpose and H2 to run the test in isolation in build servers.
Everything works fine until in application logic I had to use a query with DATEDIFF.
Issue: Query fails with
org.h2.jdbc.JdbcSQLException: Syntax error in SQL statement
Reason: Even through H2 run on MySQL mode it uses H2 Functions and those functions are different
MYSQL DATEDIFF definition is DATEDIFF(expr1,expr2)
e.g. SELECT DATEDIFF('2010-11-30 23:59:59','2010-12-31')
==> 1
H2 DATEDIFF definision is DATEDIFF(unitstring, expr1, expr2)
unitstring = { YEAR | YY | MONTH | MM | WEEK | DAY | DD | DAY_OF_YEAR
| DOY | HOUR | HH | MINUTE | MI | SECOND | SS | MILLISECOND | MS }
e.g. SELECT DATEDIFF(dd, '2010-11-30 23:59:59','2010-12-31')
==> 1
Solutions tried and failed: I tried to write a custom function
package com.asela.util;
import java.lang.reflect.Field;
import java.sql.Date;
import java.time.temporal.ChronoUnit;
import java.util.Map;
import java.util.Objects;
import org.h2.expression.Function;
public class H2Function {
public static long dateDifference(Date date1, Date date2) {
Objects.nonNull(date1);
Objects.nonNull(date2);
return ChronoUnit.DAYS.between(date1.toLocalDate(), date2.toLocalDate());
}
}
And set it with H2
DROP ALIAS IF EXISTS DATEDIFF;
CREATE ALIAS DATEDIFF FOR "com.asela.util.H2Function.dateDifference";
Above was not able to replace existing DATEDIFF still fails with
org.h2.jdbc.JdbcSQLException: Function alias "DATEDIFF" already exists; SQL statement:
Any other approach I can try to make this work?
Got a workaround with Reflection for the problem. Access H2 Functions map and remove DATEDIFF from there. Then add the replacement function.
package com.asela.util;
import java.lang.reflect.Field;
import java.sql.Date;
import java.time.temporal.ChronoUnit;
import java.util.Map;
import java.util.Objects;
import org.h2.expression.Function;
public class H2Function {
#SuppressWarnings("rawtypes")
public static int removeDateDifference() {
try {
Field field = Function.class.getDeclaredField("FUNCTIONS");
field.setAccessible(true);
((Map)field.get(null)).remove("DATEDIFF");
} catch (Exception e) {
throw new RuntimeException("failed to remove date-difference");
}
return 0;
}
public static long dateDifference(Date date1, Date date2) {
Objects.nonNull(date1);
Objects.nonNull(date2);
return ChronoUnit.DAYS.between(date1.toLocalDate(), date2.toLocalDate());
}
}
Then in schema
CREATE ALIAS IF NOT EXISTS REMOVE_DATE_DIFF FOR "com.asela.util.H2Function.removeDateDifference";
CALL REMOVE_DATE_DIFF();
DROP ALIAS IF EXISTS DATEDIFF;
CREATE ALIAS DATEDIFF FOR "com.asela.util.H2Function.dateDifference";
We are using Apache-spark with mongo-spark library(for connecting with MongoDB) and spark-redshift library(for connecting with Amazon Redshift DWH).
And we are experiencing very bad performance for our job.
So I am hoping to get some help to understand whether we are doing anything wrong in our program or this is what we can expect with the infrastructure we have used.
We are running our job with MESOS resouce manager on 4 AWS EC2 nodes with following configuration with each node:
RAM: 16GB, CPU cores: 4, SSD: 200GB
We have 3 tables in Redshift cluster:
TABLE_NAME SCHEMA NUMBER_OF_ROWS
table1 (table1Id, table2FkId, table3FkId, ...) 50M
table2 (table2Id, phonenumber, email,...) 700M
table3 (table3Id, ...) 2K
and In MongoDB we have a collection having 35 million documents with a sample document as below (all email and phone numbers are unique here, no duplication):
{
"_id": "19ac0487-a75f-49d9-928e-c300e0ac7c7c",
"idKeys": {
"email": [
"a#gmail.com",
"b#gmail.com"
],
"phonenumber": [
"1111111111",
"2222222222"
]
},
"flag": false,
...
...
...
}
Which we are filtering and flattening(see the code at the end for mongo-spark aggregation pipeline) with spark-mongo connector to following format (as we need to JOIN data from Redshift and Mongo ON email OR phonenumber match for which another option available is array_contains() in spark SQL which is a bit slow) :
{"_id": "19ac0487-a75f-49d9-928e-c300e0ac7c7c", "email": "a#gmail.com", "phonenumber": null},
{"_id": "19ac0487-a75f-49d9-928e-c300e0ac7c7c","email": "b#gmail.com","phonenumber": null},
{"_id": "19ac0487-a75f-49d9-928e-c300e0ac7c7c","email": null,"phonenumber": "1111111111"},
{"_id": "19ac0487-a75f-49d9-928e-c300e0ac7c7c","email": null,"phonenumber": "22222222222"}
Spark computation steps (please refer the code below to understand these steps better):
First we are loading all data from 3 Redshift tables into table1Dataset, table2Dataset, table3Dataset respectively by using spark-redshift connector.
joining these 3 tables with SparkSQL and creating new Dataset redshiftJoinedDataset. (this operation independently finishes in 6 hours)
loading MongoDB data into mongoDataset by using mongo-spark connector.
joining mongoDataset and redshiftJoinedDataset. (here is the bottleneck as we need to join over 50million rows from redshift with over 100million flattened rows from mongodb)
Note:- Also mongo-spark seems to have some internal issue with its aggregation pipeline execution which might be making it very slow.
then we are doing some aggregation and grouping the data on finalId
here is the code for the steps mentioned above:
import com.mongodb.spark.MongoSpark;
import com.mongodb.spark.rdd.api.java.JavaMongoRDD;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import org.bson.Document;
import java.util.Arrays;
public class SparkMongoRedshiftTest {
private static SparkSession sparkSession;
private static SparkContext sparkContext;
private static SQLContext sqlContext;
public static void main(String[] args) {
sparkSession = SparkSession.builder().appName("redshift-spark-test").getOrCreate();
sparkContext = sparkSession.sparkContext();
sqlContext = new SQLContext(sparkContext);
Dataset table1Dataset = executeRedshiftQuery("(SELECT table1Id,table2FkId,table3FkId FROM table1)");
table1Dataset.createOrReplaceTempView("table1Dataset");
Dataset table2Dataset = executeRedshiftQuery("(SELECT table2Id,phonenumber,email FROM table2)");
table2Dataset.createOrReplaceTempView("table2Dataset");
Dataset table3Dataset = executeRedshiftQuery("(SELECT table3Id FROM table3");
table3Dataset.createOrReplaceTempView("table3Dataset");
Dataset redshiftJoinedDataset = sqlContext.sql(" SELECT a.*,b.*,c.*" +
" FROM table1Dataset a " +
" LEFT JOIN table2Dataset b ON a.table2FkId = b.table2Id" +
" LEFT JOIN table3Dataset c ON a.table3FkId = c.table3Id");
redshiftJoinedDataset.createOrReplaceTempView("redshiftJoinedDataset");
JavaMongoRDD<Document> userIdentityRDD = MongoSpark.load(getJavaSparkContext());
Dataset mongoDataset = userIdentityRDD.withPipeline(
Arrays.asList(
Document.parse("{$match: {flag: false}}"),
Document.parse("{ $unwind: { path: \"$idKeys.email\" } }"),
Document.parse("{$group: {_id: \"$_id\",emailArr: {$push: {email: \"$idKeys.email\",phonenumber: {$ifNull: [\"$description\", null]}}},\"idKeys\": {$first: \"$idKeys\"}}}"),
Document.parse("{$unwind: \"$idKeys.phonenumber\"}"),
Document.parse("{$group: {_id: \"$_id\",phoneArr: {$push: {phonenumber: \"$idKeys.phonenumber\",email: {$ifNull: [\"$description\", null]}}},\"emailArr\": {$first: \"$emailArr\"}}}"),
Document.parse("{$project: {_id: 1,value: {$setUnion: [\"$emailArr\", \"$phoneArr\"]}}}"),
Document.parse("{$unwind: \"$value\"}"),
Document.parse("{$project: {email: \"$value.email\",phonenumber: \"$value.phonenumber\"}}")
)).toDF();
mongoDataset.createOrReplaceTempView("mongoDataset");
Dataset joinRedshiftAndMongoDataset = sqlContext.sql(" SELECT a.* , b._id AS finalId " +
" FROM redshiftJoinedData AS a INNER JOIN mongoDataset AS b " +
" ON b.email = a.email OR b.phonenumber = a.phonenumber");
//aggregating joinRedshiftAndMongoDataset
//then storing to mysql
}
private static Dataset executeRedshiftQuery(String query) {
return sqlContext.read()
.format("com.databricks.spark.redshift")
.option("url", "jdbc://...")
.option("query", query)
.option("aws_iam_role", "...")
.option("tempdir", "s3a://...")
.load();
}
public static JavaSparkContext getJavaSparkContext() {
sparkContext.conf().set("spark.mongodb.input.uri", "");
sparkContext.conf().set("spark.sql.crossJoin.enabled", "true");
return new JavaSparkContext(sparkContext);
}
}
Time estimation to finish this job on above mentioned infrastructure is over 2 months.
So to summarize the joins quantitatively:
RedshiftDataWithMongoDataJoin => (RedshiftDataJoin) INNER_JOIN (MongoData)
=> (50M LEFT_JOIN 700M LEFT_JOIN 2K) INNER_JOIN (~100M)
=> (50M) INNER_JOIN (~100M)
Any help with this will be appreciated.
So after a lot of investigation we came to know that 90% of data in table2 had either email or phonenumber null and I had missed to handle joins on null values in the query.
So that was the main problem for this performance bottleneck.
After fixing this problem the job now runs within 2 hours.
So there are no issues with spark-redshift or mongo-spark those are performing exceptionally well :)
I am trying to split the Dataset into different Datasets based on Manufacturer column contents. It is very slow Please suggest a way to improve the code, so that it can execute faster and reduce the usage of Java code.
List<Row> lsts= countsByAge.collectAsList();
for(Row lst:lsts) {
String man = lst.toString();
man = man.replaceAll("[\\p{Ps}\\p{Pe}]", "");
Dataset<Row> DF = src.filter("Manufacturer='" + man + "'");
DF.show();
}
The Code, Input and Output Datasets are as shown below.
package org.sparkexample;
import org.apache.parquet.filter2.predicate.Operators.Column;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.RelationalGroupedDataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import java.util.Arrays;
import java.util.List;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
public class GroupBy {
public static void main(String[] args) {
System.setProperty("hadoop.home.dir", "C:\\winutils");
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder().appName("split datasets").getOrCreate();
sc.setLogLevel("ERROR");
Dataset<Row> src= sqlContext.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.load("sample.csv");
Dataset<Row> unq_manf=src.select("Manufacturer").distinct();
List<Row> lsts= unq_manf.collectAsList();
for(Row lst:lsts) {
String man = lst.toString();
man = man.replaceAll("[\\p{Ps}\\p{Pe}]", "");
Dataset<Row> DF = src.filter("Manufacturer='" + man + "'");
DF.show();
}
}
}
Input Table
+------+------------+--------------------+---+
|ItemID|Manufacturer| Category name|UPC|
+------+------------+--------------------+---+
| 804| ael|Brush & Broom Han...|123|
| 805| ael|Wheel Brush Parts...|124|
| 813| ael| Drivers Gloves|125|
| 632| west| Pipe Wrenches|126|
| 804| bil| Masonry Brushes|127|
| 497| west| Power Tools Other|128|
| 496| west| Power Tools Other|129|
| 495| bil| Hole Saws|130|
| 499| bil| Battery Chargers|131|
| 497| west| Power Tools Other|132|
+------+------------+--------------------+---+
Output
+------------+
|Manufacturer|
+------------+
| ael|
| west|
| bil|
+------------+
+------+------------+--------------------+---+
|ItemID|Manufacturer| Category name|UPC|
+------+------------+--------------------+---+
| 804| ael|Brush & Broom Han...|123|
| 805| ael|Wheel Brush Parts...|124|
| 813| ael| Drivers Gloves|125|
+------+------------+--------------------+---+
+------+------------+-----------------+---+
|ItemID|Manufacturer| Category name|UPC|
+------+------------+-----------------+---+
| 632| west| Pipe Wrenches|126|
| 497| west|Power Tools Other|128|
| 496| west|Power Tools Other|129|
| 497| west|Power Tools Other|132|
+------+------------+-----------------+---+
+------+------------+----------------+---+
|ItemID|Manufacturer| Category name|UPC|
+------+------------+----------------+---+
| 804| bil| Masonry Brushes|127|
| 495| bil| Hole Saws|130|
| 499| bil|Battery Chargers|131|
+------+------------+----------------+---+
You have two choice in this case:
First you have to collect unique manufacturer values and then map
over resulting array:
val df = Seq(("HP", 1), ("Brother", 2), ("Canon", 3), ("HP", 5)).toDF("k", "v")
val brands = df.select("k").distinct.collect.flatMap(_.toSeq)
val BrandArray = brands.map(brand => df.where($"k" <=> brand))
BrandArray.foreach { x =>
x.show()
println("---------------------------------------")
}
You can also save the data frame based on manufacturer.
df.write.partitionBy("hour").saveAsTable("parquet")
Instead of splitting the dataset/dataframe by manufacturers it might be optimal to write the dataframe using manufacturer as the partition key if you need to query based on manufacturer frequently
Incase you still want separate dataframes based on one of the column values one of the approaches using pyspark and spark 2.0+ could be-
from pyspark.sql import functions as F
df = spark.read.csv("sample.csv",header=True)
# collect list of manufacturers
manufacturers = df.select('manufacturer').distinct().collect()
# loop through manufacturers to filter df by manufacturers and write it separately
for m in manufacturers:
df1 = df.where(F.col('manufacturers')==m[0])
df1[.repartition(repartition_col)].write.parquet(<write_path>,[write_mode])
I get some entries from the stream in linux terminal, assign them as lines, break them into words. But instead of printing them out I want to save them to Cassandra.
I have a Keyspace named ks, with a table inside it named record.
I know that some code like CassandraStreamingJavaUtil.javaFunctions(words).writerBuilder("ks", "record").saveToCassandra(); has to do the job but I guess I am doing something wrong. Can someone help ?
Here is my Cassandra ks.record schema (I added these data through CQLSH)
id | birth_date | name
----+---------------------------------+-----------
10 | 1987-12-01 23:00:00.000000+0000 | Catherine
11 | 2004-09-07 22:00:00.000000+0000 | Isadora
1 | 2016-05-10 13:00:04.452000+0000 | John
2 | 2016-05-10 13:00:04.452000+0000 | Troy
12 | 1970-10-01 23:00:00.000000+0000 | Anna
3 | 2016-05-10 13:00:04.452000+0000 | Andrew
Here is my Java code :
import com.datastax.spark.connector.japi.CassandraStreamingJavaUtil;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import scala.Tuple2;
import java.util.Arrays;
import static com.datastax.spark.connector.japi.CassandraJavaUtil.javaFunctions;
import static com.datastax.spark.connector.japi.CassandraJavaUtil.mapToRow;
import static com.datastax.spark.connector.japi.CassandraStreamingJavaUtil.*;
public class CassandraStreaming2 {
public static void main(String[] args) {
// Create a local StreamingContext with two working thread and batch interval of 1 second
SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("CassandraStreaming");
JavaStreamingContext sc = new JavaStreamingContext(conf, Durations.seconds(1));
// Create a DStream that will connect to hostname:port, like localhost:9999
JavaReceiverInputDStream<String> lines = sc.socketTextStream("localhost", 9999);
// Split each line into words
JavaDStream<String> words = lines.flatMap(
(FlatMapFunction<String, String>) x -> Arrays.asList(x.split(" "))
);
words.print();
//CassandraStreamingJavaUtil.javaFunctions(words).writerBuilder("ks", "record").saveToCassandra();
sc.start(); // Start the computation
sc.awaitTermination(); // Wait for the computation to terminate
}
}
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/7_java_api.md#saving-data-to-cassandra
As per the docs, you need to also pass a RowWriter factory. The most common way to do this is to use the mapToRow(Class) api, this is the missing parameter described.
But you have an additional problem, your code doesn't yet specify the data in a way that can be written to C*. You have a JavaDStream of only Strings. And a single String cannot be made into a Cassandra Row for your given schema.
Basically you are telling the connector
Write "hello" to CassandraTable (id, birthday, value)
Without telling it where the hello goes (what should the id be? what should the birthday be?)