How to save data from spark streaming to cassandra using java? - java

I get some entries from the stream in linux terminal, assign them as lines, break them into words. But instead of printing them out I want to save them to Cassandra.
I have a Keyspace named ks, with a table inside it named record.
I know that some code like CassandraStreamingJavaUtil.javaFunctions(words).writerBuilder("ks", "record").saveToCassandra(); has to do the job but I guess I am doing something wrong. Can someone help ?
Here is my Cassandra ks.record schema (I added these data through CQLSH)
id | birth_date | name
----+---------------------------------+-----------
10 | 1987-12-01 23:00:00.000000+0000 | Catherine
11 | 2004-09-07 22:00:00.000000+0000 | Isadora
1 | 2016-05-10 13:00:04.452000+0000 | John
2 | 2016-05-10 13:00:04.452000+0000 | Troy
12 | 1970-10-01 23:00:00.000000+0000 | Anna
3 | 2016-05-10 13:00:04.452000+0000 | Andrew
Here is my Java code :
import com.datastax.spark.connector.japi.CassandraStreamingJavaUtil;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import scala.Tuple2;
import java.util.Arrays;
import static com.datastax.spark.connector.japi.CassandraJavaUtil.javaFunctions;
import static com.datastax.spark.connector.japi.CassandraJavaUtil.mapToRow;
import static com.datastax.spark.connector.japi.CassandraStreamingJavaUtil.*;
public class CassandraStreaming2 {
public static void main(String[] args) {
// Create a local StreamingContext with two working thread and batch interval of 1 second
SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("CassandraStreaming");
JavaStreamingContext sc = new JavaStreamingContext(conf, Durations.seconds(1));
// Create a DStream that will connect to hostname:port, like localhost:9999
JavaReceiverInputDStream<String> lines = sc.socketTextStream("localhost", 9999);
// Split each line into words
JavaDStream<String> words = lines.flatMap(
(FlatMapFunction<String, String>) x -> Arrays.asList(x.split(" "))
);
words.print();
//CassandraStreamingJavaUtil.javaFunctions(words).writerBuilder("ks", "record").saveToCassandra();
sc.start(); // Start the computation
sc.awaitTermination(); // Wait for the computation to terminate
}
}

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/7_java_api.md#saving-data-to-cassandra
As per the docs, you need to also pass a RowWriter factory. The most common way to do this is to use the mapToRow(Class) api, this is the missing parameter described.
But you have an additional problem, your code doesn't yet specify the data in a way that can be written to C*. You have a JavaDStream of only Strings. And a single String cannot be made into a Cassandra Row for your given schema.
Basically you are telling the connector
Write "hello" to CassandraTable (id, birthday, value)
Without telling it where the hello goes (what should the id be? what should the birthday be?)

Related

Set default bin value if not present in aerospike

Suppose I have 2 bins in aeropike set
number(key) 2. timeLeft
I wanted to get a timeLeft value from aerospike for a number.
But if the particular record is not present then I want to create the record and set a default value 6000 to timeLeft and then get the value in the single transaction.
public Record someMethod(String num) {
WritePolicy writePolicy = aerospikeRepo.getWritePolicy(null, ttl, true);
return aerospikeRepo.operate(writePolicy, namespace, set, num, Operation.get());
}
Personally, I think the .operate() method of the aerospike client will be used somehow but did not find relevant Operation to set the default value if not present.
You can do it using Expressions. Here is sample code:
import com.aerospike.client.AerospikeClient;
import com.aerospike.client.policy.WritePolicy;
import com.aerospike.client.Bin;
import com.aerospike.client.Key;
import com.aerospike.client.Record;
import com.aerospike.client.Value;
import com.aerospike.client.policy.RecordExistsAction;
import com.aerospike.client.AerospikeException;
import com.aerospike.client.ResultCode;
import com.aerospike.client.Operation;
import com.aerospike.client.exp.Exp;
import com.aerospike.client.exp.ExpOperation;
import com.aerospike.client.exp.ExpWriteFlags;
import com.aerospike.client.exp.Expression;
System.out.println("Client modules imported.");
AerospikeClient client = new AerospikeClient("localhost", 3000);
WritePolicy wP = new WritePolicy();
wP.respondAllOps = true;
int iNumber = 11;
int iTimeLeft = 6000;
for(int i=0; i<5; i++){
Key key = new Key ("test", "testset", iNumber);
Expression tlExp = Exp.build(Exp.val(iTimeLeft));
Record record = client.operate(wP, key,
ExpOperation.write("timeLeft", tlExp, ExpWriteFlags.CREATE_ONLY | ExpWriteFlags.POLICY_NO_FAIL),
//ExpOperation.write("timeLeft", tlExp, ExpWriteFlags.DEFAULT),
Operation.get("timeLeft"));
List<?> list = record.getList("timeLeft");
System.out.println(list.get(1));
iTimeLeft = iTimeLeft - 1000; //should not alter record value
}
This gives the following output:
Client modules imported.
6000
6000
6000
6000
6000
However, if I use the DEFAULT, the output will be modified each time. (what you don't want, compared to the correct flags above (CREATE_ONLY|POLICY_NO_FAIL i.e. silently go on to next operation if you want to update the record only if the bin does not exist).
Client modules imported.
6000
5000
4000
3000
2000

Grouping multiple columns without aggregation

I have a dataframe (Dataset<Row>) which has six columns in it, out of six, four needs to be grouped and for the other two columns it may repeat the grouped columns n times based on the varying value in those two columns.
required dataset like below:
id | batch | batch_Id | session_name | time | value
001| abc | 098 | course-I | 1551409926133 | 2.3
001| abc | 098 | course-I | 1551404747843 | 7.3
001| abc | 098 | course-I | 1551409934220 | 6.3
I tired something like below
Dataset<Row> df2 = df.select("*")
.groupBy(col("id"), col("batch_Id"), col("session_name"))
.agg(max("time"));
I added agg to get groupby output but don't know how to achieve it.
Help much appreciated... Thank you.
I don't think you were too far off.
Given your first dataset:
+---+-----+--------+------------+-------------+-----+
| id|batch|batch_Id|session_name| time|value|
+---+-----+--------+------------+-------------+-----+
|001| abc| 098| course-I|1551409926133| 2.3|
|001| abc| 098| course-I|1551404747843| 7.3|
|001| abc| 098| course-I|1551409934220| 6.3|
|002| def| 097| course-II|1551409926453| 2.3|
|002| def| 097| course-II|1551404747843| 7.3|
|002| def| 097| course-II|1551409934220| 6.3|
+---+-----+--------+------------+-------------+-----+
And assuming your desired output is:
+---+--------+------------+-------------+
| id|batch_Id|session_name| max(time)|
+---+--------+------------+-------------+
|002| 097| course-II|1551409934220|
|001| 098| course-I|1551409934220|
+---+--------+------------+-------------+
I would write the following code for the aggregation:
Dataset<Row> maxValuesDf = rawDf.select("*")
.groupBy(col("id"), col("batch_id"), col("session_name"))
.agg(max("time"));
And the whole app would look like:
package net.jgp.books.spark.ch13.lab900_max_value;
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.max;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class MaxValueAggregationApp {
/**
* main() is your entry point to the application.
*
* #param args
*/
public static void main(String[] args) {
MaxValueAggregationApp app = new MaxValueAggregationApp();
app.start();
}
/**
* The processing code.
*/
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("Aggregates max values")
.master("local[*]")
.getOrCreate();
// Reads a CSV file with header, called books.csv, stores it in a
// dataframe
Dataset<Row> rawDf = spark.read().format("csv")
.option("header", true)
.option("sep", "|")
.load("data/misc/courses.csv");
// Shows at most 20 rows from the dataframe
rawDf.show(20);
// Performs the aggregation, grouping on columns id, batch_id, and
// session_name
Dataset<Row> maxValuesDf = rawDf.select("*")
.groupBy(col("id"), col("batch_id"), col("session_name"))
.agg(max("time"));
maxValuesDf.show(5);
}
}
Does it help?

Connect to Hive from Spark without using "hive-site.xml"

Is there any way to connect to Hive from Spark without using "hive-site.xml"?
SparkLauncher sl = new SparkLauncher(evnProps);
sl.addSparkArg("--verbose");
sl.addAppArgs(appArgs);
sl.addFile(evnProps.get(KEY_YARN_CONF_DIR) + "/hive-site.xml");
We are passing "hive-site.xml" to SparkLauncher.I want to remove dependency on "hive-site.xml"enter code here.
Spark SQL supports reading and writing data stored in Apache Hive. However, since Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them automatically. Note that these Hive dependencies must also be present on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to access data stored in Hive.
Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/.
When working with Hive, one must instantiate SparkSession with Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions. Users who do not have an existing Hive deployment can still enable Hive support. When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse. You may need to grant write privilege to the user who starts the Spark application.
import java.io.File;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public static class Record implements Serializable {
private int key;
private String value;
public int getKey() {
return key;
}
public void setKey(int key) {
this.key = key;
}
public String getValue() {
return value;
}
public void setValue(String value) {
this.value = value;
}
}
// warehouseLocation points to the default location for managed databases and tables
String warehouseLocation = new File("spark-warehouse").getAbsolutePath();
SparkSession spark = SparkSession
.builder()
.appName("Java Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate();
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive");
spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src");
// Queries are expressed in HiveQL
spark.sql("SELECT * FROM src").show();
// +---+-------+
// |key| value|
// +---+-------+
// |238|val_238|
// | 86| val_86|
// |311|val_311|
// ...
// Aggregation queries are also supported.
spark.sql("SELECT COUNT(*) FROM src").show();
// +--------+
// |count(1)|
// +--------+
// | 500 |
// +--------+
// The results of SQL queries are themselves DataFrames and support all normal functions.
Dataset<Row> sqlDF = spark.sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key");
// The items in DataFrames are of type Row, which lets you to access each column by ordinal.
Dataset<String> stringsDS = sqlDF.map(
(MapFunction<Row, String>) row -> "Key: " + row.get(0) + ", Value: " + row.get(1),
Encoders.STRING());
stringsDS.show();
// +--------------------+
// | value|
// +--------------------+
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// ...
// You can also use DataFrames to create temporary views within a SparkSession.
List<Record> records = new ArrayList<>();
for (int key = 1; key < 100; key++) {
Record record = new Record();
record.setKey(key);
record.setValue("val_" + key);
records.add(record);
}
Dataset<Row> recordsDF = spark.createDataFrame(records, Record.class);
recordsDF.createOrReplaceTempView("records");
// Queries can then join DataFrames data with data stored in Hive.
spark.sql("SELECT * FROM records r JOIN src s ON r.key = s.key").show();
// +---+------+---+------+
// |key| value|key| value|
// +---+------+---+------+
// | 2| val_2| 2| val_2|
// | 2| val_2| 2| val_2|
// | 4| val_4| 4| val_4|
// ...

Mysql SQL query DATEDIFF failed in H2 where mode was MYSQL

Background: In one of my projects I am doing component testing on Spring Batch using JUnit. Here application DB is MYSQL. In Junit test execution I let the data-source switch between
MYSQL and
H2(jdbc:h2:mem:MYTESTDB;DB_CLOSE_DELAY=-1;DB_CLOSE_ON_EXIT=FALSE;MODE=MYSQL)
based on configuration.
Use MYSQL as the data source for debugging purpose and H2 to run the test in isolation in build servers.
Everything works fine until in application logic I had to use a query with DATEDIFF.
Issue: Query fails with
org.h2.jdbc.JdbcSQLException: Syntax error in SQL statement
Reason: Even through H2 run on MySQL mode it uses H2 Functions and those functions are different
MYSQL DATEDIFF definition is DATEDIFF(expr1,expr2)
e.g. SELECT DATEDIFF('2010-11-30 23:59:59','2010-12-31')
==> 1
H2 DATEDIFF definision is DATEDIFF(unitstring, expr1, expr2)
unitstring = { YEAR | YY | MONTH | MM | WEEK | DAY | DD | DAY_OF_YEAR
| DOY | HOUR | HH | MINUTE | MI | SECOND | SS | MILLISECOND | MS }
e.g. SELECT DATEDIFF(dd, '2010-11-30 23:59:59','2010-12-31')
==> 1
Solutions tried and failed: I tried to write a custom function
package com.asela.util;
import java.lang.reflect.Field;
import java.sql.Date;
import java.time.temporal.ChronoUnit;
import java.util.Map;
import java.util.Objects;
import org.h2.expression.Function;
public class H2Function {
public static long dateDifference(Date date1, Date date2) {
Objects.nonNull(date1);
Objects.nonNull(date2);
return ChronoUnit.DAYS.between(date1.toLocalDate(), date2.toLocalDate());
}
}
And set it with H2
DROP ALIAS IF EXISTS DATEDIFF;
CREATE ALIAS DATEDIFF FOR "com.asela.util.H2Function.dateDifference";
Above was not able to replace existing DATEDIFF still fails with
org.h2.jdbc.JdbcSQLException: Function alias "DATEDIFF" already exists; SQL statement:
Any other approach I can try to make this work?
Got a workaround with Reflection for the problem. Access H2 Functions map and remove DATEDIFF from there. Then add the replacement function.
package com.asela.util;
import java.lang.reflect.Field;
import java.sql.Date;
import java.time.temporal.ChronoUnit;
import java.util.Map;
import java.util.Objects;
import org.h2.expression.Function;
public class H2Function {
#SuppressWarnings("rawtypes")
public static int removeDateDifference() {
try {
Field field = Function.class.getDeclaredField("FUNCTIONS");
field.setAccessible(true);
((Map)field.get(null)).remove("DATEDIFF");
} catch (Exception e) {
throw new RuntimeException("failed to remove date-difference");
}
return 0;
}
public static long dateDifference(Date date1, Date date2) {
Objects.nonNull(date1);
Objects.nonNull(date2);
return ChronoUnit.DAYS.between(date1.toLocalDate(), date2.toLocalDate());
}
}
Then in schema
CREATE ALIAS IF NOT EXISTS REMOVE_DATE_DIFF FOR "com.asela.util.H2Function.removeDateDifference";
CALL REMOVE_DATE_DIFF();
DROP ALIAS IF EXISTS DATEDIFF;
CREATE ALIAS DATEDIFF FOR "com.asela.util.H2Function.dateDifference";

Split dataset based on column values in spark

I am trying to split the Dataset into different Datasets based on Manufacturer column contents. It is very slow Please suggest a way to improve the code, so that it can execute faster and reduce the usage of Java code.
List<Row> lsts= countsByAge.collectAsList();
for(Row lst:lsts) {
String man = lst.toString();
man = man.replaceAll("[\\p{Ps}\\p{Pe}]", "");
Dataset<Row> DF = src.filter("Manufacturer='" + man + "'");
DF.show();
}
The Code, Input and Output Datasets are as shown below.
package org.sparkexample;
import org.apache.parquet.filter2.predicate.Operators.Column;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.RelationalGroupedDataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import java.util.Arrays;
import java.util.List;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
public class GroupBy {
public static void main(String[] args) {
System.setProperty("hadoop.home.dir", "C:\\winutils");
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder().appName("split datasets").getOrCreate();
sc.setLogLevel("ERROR");
Dataset<Row> src= sqlContext.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.load("sample.csv");
Dataset<Row> unq_manf=src.select("Manufacturer").distinct();
List<Row> lsts= unq_manf.collectAsList();
for(Row lst:lsts) {
String man = lst.toString();
man = man.replaceAll("[\\p{Ps}\\p{Pe}]", "");
Dataset<Row> DF = src.filter("Manufacturer='" + man + "'");
DF.show();
}
}
}
Input Table
+------+------------+--------------------+---+
|ItemID|Manufacturer| Category name|UPC|
+------+------------+--------------------+---+
| 804| ael|Brush & Broom Han...|123|
| 805| ael|Wheel Brush Parts...|124|
| 813| ael| Drivers Gloves|125|
| 632| west| Pipe Wrenches|126|
| 804| bil| Masonry Brushes|127|
| 497| west| Power Tools Other|128|
| 496| west| Power Tools Other|129|
| 495| bil| Hole Saws|130|
| 499| bil| Battery Chargers|131|
| 497| west| Power Tools Other|132|
+------+------------+--------------------+---+
Output
+------------+
|Manufacturer|
+------------+
| ael|
| west|
| bil|
+------------+
+------+------------+--------------------+---+
|ItemID|Manufacturer| Category name|UPC|
+------+------------+--------------------+---+
| 804| ael|Brush & Broom Han...|123|
| 805| ael|Wheel Brush Parts...|124|
| 813| ael| Drivers Gloves|125|
+------+------------+--------------------+---+
+------+------------+-----------------+---+
|ItemID|Manufacturer| Category name|UPC|
+------+------------+-----------------+---+
| 632| west| Pipe Wrenches|126|
| 497| west|Power Tools Other|128|
| 496| west|Power Tools Other|129|
| 497| west|Power Tools Other|132|
+------+------------+-----------------+---+
+------+------------+----------------+---+
|ItemID|Manufacturer| Category name|UPC|
+------+------------+----------------+---+
| 804| bil| Masonry Brushes|127|
| 495| bil| Hole Saws|130|
| 499| bil|Battery Chargers|131|
+------+------------+----------------+---+
You have two choice in this case:
First you have to collect unique manufacturer values and then map
over resulting array:
val df = Seq(("HP", 1), ("Brother", 2), ("Canon", 3), ("HP", 5)).toDF("k", "v")
val brands = df.select("k").distinct.collect.flatMap(_.toSeq)
val BrandArray = brands.map(brand => df.where($"k" <=> brand))
BrandArray.foreach { x =>
x.show()
println("---------------------------------------")
}
You can also save the data frame based on manufacturer.
df.write.partitionBy("hour").saveAsTable("parquet")
Instead of splitting the dataset/dataframe by manufacturers it might be optimal to write the dataframe using manufacturer as the partition key if you need to query based on manufacturer frequently
Incase you still want separate dataframes based on one of the column values one of the approaches using pyspark and spark 2.0+ could be-
from pyspark.sql import functions as F
df = spark.read.csv("sample.csv",header=True)
# collect list of manufacturers
manufacturers = df.select('manufacturer').distinct().collect()
# loop through manufacturers to filter df by manufacturers and write it separately
for m in manufacturers:
df1 = df.where(F.col('manufacturers')==m[0])
df1[.repartition(repartition_col)].write.parquet(<write_path>,[write_mode])

Categories