I am using spark-sql-2.4.1v with java 1.8.
and kafka versions spark-sql-kafka-0-10_2.11_2.4.3 and kafka-clients_0.10.0.0
StreamingQuery queryComapanyRecords =
comapanyRecords
.writeStream()
.format("kafka")
.option("kafka.bootstrap.servers",KAFKA_BROKER)
.option("topic", "in_topic")
.option("auto.create.topics.enable", "false")
.option("key.serializer","org.apache.kafka.common.serialization.StringDeserializer")
.option("value.serializer", "com.spgmi.ca.prescore.serde.MessageRecordSerDe")
.option("checkpointLocation", "/app/chkpnt/" )
.outputMode("append")
.start();
queryLinkingMessageRecords.awaitTermination();
Giving error :
Caused by: org.apache.spark.sql.AnalysisException: Required attribute 'value' not found;
at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$6.apply(KafkaWriter.scala:71)
at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$6.apply(KafkaWriter.scala:71)
at scala.Option.getOrElse(Option.scala:121)
I tried to fix as below, but unable to send the value i.e. which is a java bean in my case.
StreamingQuery queryComapanyRecords =
comapanyRecords.selectExpr("CAST(company_id AS STRING) AS key", "to_json(struct(\"company_id\",\"fiscal_year\",\"fiscal_quarter\")) AS value")
.writeStream()
.format("kafka")
.option("kafka.bootstrap.servers",KAFKA_BROKER)
.option("topic", "in_topic")
.start();
So is there anyway in java how to handle/send this value( i.e. Java
bean as record) ??.
Kafka data source requires a specific schema for reading (loading) and writing (saving) datasets.
Quoting the official documentation (highlighting the most important field / column):
Each row in the source has the following schema:
...
value binary
...
In other words, you have Kafka records in the value column when reading from a Kafka topic and you have to make your data to save to a Kafka topic available in the value column as well.
In other words, whatever is or is going to be in Kafka is in the value column. The value column is where you "store" business records (the data).
On to your question:
How to write selected columns to Kafka topic?
You should "pack" the selected columns together so they can all together be part of the value column. to_json standard function is a good fit so the selected columns are going to be a JSON message.
Example
Let me give you an example.
Don't forget to start a Spark application or spark-shell with the Kafka data source. Mind the versions of Scala (2.11 or 2.12) and Spark (e.g. 2.4.4).
spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4
Let's start by creating a sample dataset. Any multiple-field dataset would work.
val ns = Seq((0, "zero")).toDF("id", "name")
scala> ns.show
+---+----+
| id|name|
+---+----+
| 0|zero|
+---+----+
If we tried to write the dataset to a Kafka topic, it would error out due to value column missing. That's what you faced initially.
scala> ns.write.format("kafka").option("topic", "in_topic").save
org.apache.spark.sql.AnalysisException: Required attribute 'value' not found;
at org.apache.spark.sql.kafka010.KafkaWriter$.$anonfun$validateQuery$6(KafkaWriter.scala:71)
at scala.Option.getOrElse(Option.scala:138)
...
You have to come up with a way to "pack" multiple fields (columns) together and make it available as value column. struct and to_json standard functions will do it.
val vs = ns.withColumn("value", to_json(struct("id", "name")))
scala> vs.show(truncate = false)
+---+----+----------------------+
|id |name|value |
+---+----+----------------------+
|0 |zero|{"id":0,"name":"zero"}|
+---+----+----------------------+
Saving to a Kafka topic should now be a breeze.
vs.write.format("kafka").option("topic", "in_topic").save
Related
I have around 2 billions of rows in my cassandra database which I filter with the isin method based on an experimentlist with 4827 Strings, as shown below. However, I noticed that after the distinct command I have only 4774 unique rows. Any ideas why 53 are missing? Does the isin method has a threshold/limitations? I have double and triple checked the experimentlist, it does have 4827 Strings, and also the other 53 strings do exist in the database as I can query them with cqlsh. Any help much appreciated!
Dataset<Row> df1 = sp.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "mdb");
put("table", "experiment");
}
})
.load().select(col("experimentid")).filter(col("experimentid").isin(experimentlist.toArray()));
List<String> tmplist=df1.distinct().as(Encoders.STRING()).collectAsList();
System.out.println("tmplist "+tmplist.size());
Regarding the actual question about "missing data" - there could be problems when your cluster has missing writes, and repair isn't done regularly. Spark Cassandra Connector (SCC) reads data with consistency level LOCAL_ONE, and may hit nodes without all data. You can try to set consistency level to LOCAL_QUORUM (via --conf spark.cassandra.input.consistency.level=LOCAL_QUORUM), for example, and repeat the experiment, although it's better to make sure that data is repaired.
Another problem that you have is that you're using the .isin function - it's translating into a query SELECT ... FROM table WHERE partition_key IN (list). See the execution plan:
scala> import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.cassandra._
scala> val data = spark.read.cassandraFormat("m1", "test").load()
data: org.apache.spark.sql.DataFrame = [id: int, m: map<int,string>]
scala> data.filter($"id".isin(Seq(1,2,3,4):_*)).explain
== Physical Plan ==
*Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [id#169,m#170] PushedFilters: [*In(id, [1,2,3,4])], ReadSchema: struct<id:int,m:map<int,string>>
This query is very inefficient, and put an additional load to the node that performs query. In the SCC 2.5.0, there are some optimizations around that, but it's better to use so-called "Direct Join" that was also introduced in the SCC 2.5.0, so SCC will perform requests to specific partition keys in parallel - that's more effective and put the less load to the nodes. You can use it as following (the only difference that I have it as "DSE Direct Join", while in OSS SCC it's printed as "Cassandra Direct Join"):
scala> val toJoin = Seq(1,2,3,4).toDF("id")
toJoin: org.apache.spark.sql.DataFrame = [id: int]
scala> val joined = toJoin.join(data, data("id") === toJoin("id"))
joined: org.apache.spark.sql.DataFrame = [id: int, id: int ... 1 more field]
scala> joined.explain
== Physical Plan ==
DSE Direct Join [id = id#189] test.m1 - Reading (id, m) Pushed {}
+- LocalTableScan [id#189]
This direct join optimization needs to be explicitly enabled as described in the documentation.
Trying to parse a JSON document and Spark gives me an error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().;
at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.buildReader(JsonFileFormat.scala:120)
...
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2545)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2759)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:255)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:292)
at org.apache.spark.sql.Dataset.show(Dataset.scala:746)
at org.apache.spark.sql.Dataset.show(Dataset.scala:705)
at xxx.MyClass.xxx(MyClass.java:25)
I already tried to open the JSON doc in several online editors and it's valid.
This is my code:
Dataset<Row> df = spark.read()
.format("json")
.load("file.json");
df.show(3); // this is line 25
I am using Java 8 and Spark 2.4.
The _corrupt_record column is where Spark stores malformed records when it tries to ingest them. That could be a hint.
Spark also process two types of JSON documents, JSON Lines and normal JSON (in the earlier versions Spark could only do JSON Lines). You can find more in this Manning article.
You can try the multiline option, as in:
Dataset<Row> df = spark.read()
.format("json")
.option("multiline", true)
.load("file.json");
to see if it helps. If not, share your JSON doc (if you can).
set the multiline option to true. If it does not work share your json
I have data in JavaPairRDD in format
JavaPairdRDD<Tuple2<String, Tuple2<String,String>>>
I tried using below code
Encoder<Tuple2<String, Tuple2<String,String>>> encoder2 =
Encoders.tuple(Encoders.STRING(), Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
Dataset<Row> userViolationsDetails = spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");
But how to generate Dataset with 3 columns ??? As output of above code gives me data in 2 columns. Any pointers / suggestion ???
Try to run printSchema - you will see, that value2 is a complex type.
Having such information, you can write:
Dataset<Row> uvd = userViolationsDetails.selectExpr("value1", "value2._1 as value2", "value2._2 as value3")
value2._1 means first element of a tuple inside current "value2" field. We overwrite value2 field to have one value only
Note that this will work after https://issues.apache.org/jira/browse/SPARK-24548 is merged to master branch. Currently there is a bug in Spark and tuple is converted to struct with two fields named value
I use Spark 2.1.1.
I have the following DataSet<Row> ds1;
name | ratio | count // column names
"hello" | 1.56 | 34
(ds1.isStreaming gives true)
and I am trying to generate DataSet<String> ds2. other words when I write to a kafka sink I want to write something like this
{"name": "hello", "ratio": 1.56, "count": 34}
I have tried something like this df2.toJSON().writeStream().foreach(new KafkaSink()).start() but then it gives the following error
Queries with streaming sources must be executed with writeStream.start()
There are to_json and json_tuple however I am not sure how to leverage them here ?
I tried the following using json_tuple() function
Dataset<String> df4 = df3.select(json_tuple(new Column("result"), " name", "ratio", "count")).as(Encoders.STRING());
and I get the following error:
cannot resolve 'result' given input columns: [name, ratio, count];;
tl;dr Use struct function followed by to_json (as toJSON was broken for streaming datasets due to SPARK-17029 that got fixed just 20 days ago).
Quoting the scaladoc of struct:
struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns.
Given you use Java API you have 4 different variants of struct function, too:
public static Column struct(Column... cols) Creates a new struct column.
With to_json function your case is covered:
public static Column to_json(Column e) Converts a column containing a StructType into a JSON string with the specified schema.
The following is a Scala code (translating it to Java is your home exercise):
val ds1 = Seq(("hello", 1.56, 34)).toDF("name", "ratio", "count")
val recordCol = to_json(struct("name", "ratio", "count")) as "record"
scala> ds1.select(recordCol).show(truncate = false)
+----------------------------------------+
|record |
+----------------------------------------+
|{"name":"hello","ratio":1.56,"count":34}|
+----------------------------------------+
I've also given your solution a try (with Spark 2.3.0-SNAPSHOT built today) and it seems it works perfectly.
val fromKafka = spark.
readStream.
format("kafka").
option("subscribe", "topic1").
option("kafka.bootstrap.servers", "localhost:9092").
load.
select('value cast "string")
fromKafka.
toJSON. // <-- JSON conversion
writeStream.
format("console"). // using console sink
start
format("kafka") was added in SPARK-19719 and is not available in 2.1.0.
Using Sparks 2/java/Cassanda2.2
Trying to run a simple sparks sql query, it errors:
Tried as below, + variations like "'LAX'", and '=' instead of '=='.
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`LAX`' given input columns: [transdate, origin]; line 1 pos 42;
'Project ['origin]
+- 'Filter (origin#1 = 'LAX)
+- SubqueryAlias origins
+- LogicalRDD [transdate#0, origin#1]
JavaRDD<TransByDate> originDateRDD = javaFunctions(sc).cassandraTable("trans", "trans_by_date", CassandraJavaUtil.mapRowTo(TransByDate.class)).select(CassandraJavaUtil.column("origin"), CassandraJavaUtil.column("trans_date").as("transdate"));
long cnt1= originDateRDD.count();
System.out.println("sqlLike originDateRDD.count: "+cnt1); --> 406000
Dataset<Row> originDF = sparks.createDataFrame(originDateRDD, TransByDate.class);
originDF.createOrReplaceTempView("origins");
Dataset<Row> originlike = sparks.sql("SELECT origin FROM origins WHERE origin =="+ "LAX");
I have enabled Hive Support (if that helps)
Thanks
Put the column value inside single quote. Your query should look like below.
Dataset<Row> originlike = spark.sql("SELECT origin FROM origins WHERE origin == "+"'LAX'");
You can refer Querying Cassandra data using Spark SQL in Java for more detail.
Like query should be like below.
Dataset<Row> originlike = spark.sql("SELECT origin FROM origins WHERE origin like 'LA%'");
Hive is not the problem, here is the line that is your problem:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`LAX`' given input columns: [transdate, origin]; line 1 pos 42;
What this is saying is that among the column names, none are named LAX. The scala DSL asks for === when matching a value that is a key within a column, perahps something similar would be more ideal, something like a origins.filter($"origin === "LAX")