Process binary data in Spark Structured Streaming - java

I am using Kafka and Spark Structured Streaming. I am receiving kafka messages in following format.
{"deviceId":"001","sNo":1,"data":"aaaaa"}
{"deviceId":"002","sNo":1,"data":"bbbbb"}
{"deviceId":"001","sNo":2,"data":"ccccc"}
{"deviceId":"002","sNo":2,"data":"ddddd"}
I am reading it like below.
Dataset<String> data = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option(subscribeType, topics)
.load()
.selectExpr("CAST(value AS STRING)")
.as(Encoders.STRING());
Dataset<DeviceData> ds = data.as(ExpressionEncoder.javaBean(DeviceData.class)).orderBy("deviceId","sNo");
ds.foreach(event ->
processData(event.getDeviceId(),event.getSNo(),event.getData().getBytes())
);}
private void processData(String deviceId,int SNo, byte[] data)
{
//How to check previous processed Dataset???
}
In my json message "data" is String form of byte[]. I have a requirement where I need to process the binary "data" for given "deviceId" in order of "sNo". So for "deviceId"="001", I have to process the binary data for "sNo"=1 and then "sNo"=2 and so on. How can I check state of previous processed Dataset in Structured Streaming?

If you are looking for state management like DStream.mapWithState then it is not supported yet in Structured Streaming. Work is in progress. Please Check
https://issues.apache.org/jira/browse/SPARK-19067.

Related

How to consume correctly from Kafka topic with Java Spark structured streaming

I am new to kafka-spark streaming and trying to implement the examples from spark documentation with a Protocol buffer serializer/deserializer. So far I followed the official tutorials on
https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html
https://developers.google.com/protocol-buffers/docs/javatutorial
and now I stuck on with the following problem. This question might be similar with this post How to deserialize records from Kafka using Structured Streaming in Java?
I already implemented successful the serializer which writes the messages on the kafka topic. Now the task is to consume it with spark structured streaming with a custom deserializer.
public class CustomDeserializer implements Deserializer<Person> {
#Override
public Person deserialize(String topic, byte[] data) {
Person person = null;
try {
person = Person.parseFrom(data);
return person;
} catch (Exception e) {
//ToDo
}
return null;
}
Dataset<Row> dataset = sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", topic)
.option("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
.option("value.deserializer", "de.myproject.CustomDeserializer")
.load()
.select("value");
dataset.writeStream()
.format("console")
.start()
.awaitTermination();
But as output I still get the binaries.
-------------------------------------------
Batch: 0
-------------------------------------------
+--------------------+
| value|
+--------------------+
|[08 AC BD BB 09 1...|
+--------------------+
-------------------------------------------
Batch: 1
-------------------------------------------
+--------------------+
| value|
+--------------------+
|[08 82 EF D8 08 1...|
+--------------------+
Regarding the tutorial I just need to put the option for the value.deserializer to have a human readable format
.option("value.deserializer", "de.myproject.CustomDeserializer")
Did I miss something?
Did you miss this section of the documentation?
Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception:
key.deserializer: Keys are always deserialized as byte arrays with ByteArrayDeserializer. Use DataFrame operations to explicitly deserialize the keys.
value.deserializer: Values are always deserialized as byte arrays with ByteArrayDeserializer. Use DataFrame operations to explicitly deserialize the values.
You'll have to register a UDF that invokes your deserializers instead
Similar to Read protobuf kafka message using spark structured streaming
You need to convert byte to String datatype.
dataset.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
Then you can use functions.from_json(dataset.col("value"), StructType) to get back the actual DF.
Happy Coding :)

Weird error while parsing JSON in Apache Spark

Trying to parse a JSON document and Spark gives me an error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().;
at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.buildReader(JsonFileFormat.scala:120)
...
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2545)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2759)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:255)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:292)
at org.apache.spark.sql.Dataset.show(Dataset.scala:746)
at org.apache.spark.sql.Dataset.show(Dataset.scala:705)
at xxx.MyClass.xxx(MyClass.java:25)
I already tried to open the JSON doc in several online editors and it's valid.
This is my code:
Dataset<Row> df = spark.read()
.format("json")
.load("file.json");
df.show(3); // this is line 25
I am using Java 8 and Spark 2.4.
The _corrupt_record column is where Spark stores malformed records when it tries to ingest them. That could be a hint.
Spark also process two types of JSON documents, JSON Lines and normal JSON (in the earlier versions Spark could only do JSON Lines). You can find more in this Manning article.
You can try the multiline option, as in:
Dataset<Row> df = spark.read()
.format("json")
.option("multiline", true)
.load("file.json");
to see if it helps. If not, share your JSON doc (if you can).
set the multiline option to true. If it does not work share your json

window Data hourly(clockwise) basis in Apache beam

I am trying to aggregate streaming data for each hour(like 12:00 to 12:59 and 01:00 to 01:59) in DataFlow/Apache Beam Job.
Following is my use case
Data is streaming from pubsub, It has a timestamp(order date). I want to count no of orders in each hour i am getting, Also i want to allow delay of 5 hours. Following is my sample code that I am using
LOG.info("Start Running Pipeline");
DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(DataflowPipelineOptions.class);
Pipeline pipeline = Pipeline.create(options);
PCollection<String> directShipmentFeedData = pipeline.apply("Get Direct Shipment Feed Data", PubsubIO.readStrings().fromSubscription(directShipmentFeedSubscription));
PCollection<String> tibcoRetailOrderConfirmationFeedData = pipeline.apply("Get Tibco Retail Order Confirmation Feed Data", PubsubIO.readStrings().fromSubscription(tibcoRetailOrderConfirmationFeedSubscription));
PCollection<String> flattenData = PCollectionList.of(directShipmentFeedData).and(tibcoRetailOrderConfirmationFeedData)
.apply("Flatten Data from PubSub", Flatten.<String>pCollections());
flattenData
.apply(ParDo.of(new DataParse())).setCoder(SerializableCoder.of(SalesAndUnits.class))
// Adding Window
.apply(
Window.<SalesAndUnits>into(
SlidingWindows.of(Duration.standardMinutes(15))
.every(Duration.standardMinutes(1)))
)
// Data Enrich with Dimensions
.apply(ParDo.of(new DataEnrichWithDimentions()))
// Group And Hourly Sum
.apply(new GroupAndSumSales())
.apply(ParDo.of(new SQLWrite())).setCoder(SerializableCoder.of(SalesAndUnits.class));
pipeline.run();
LOG.info("Finish Running Pipeline");
I'd the use a window with the requirements you have. Something along the lines of
Window.into(
FixedWindows.of(Duration.standardHours(1))
).withAllowedLateness(Duration.standardHours(5)))
Possibly followed by a count as that's what I understood you need.
Hope it helps

How to reduce task size in Spark MLib?

I'm trying to implement Random Forest Classifier using Apache Spark (2.2.0) and Java.
Basically I've followed the example from the Spark documentation
For test purposes I'm using a local cluster:
SparkSession spark = SparkSession
.builder()
.master("local[*]")
.appName(appName)
.getOrCreate();
My training/test data includes 30k rows. Data is fetched from REST APIs and transformed to Spark DataSet.
List<PreparedWUMLogFile> logs = //... get from REST API
Dataset<PreparedWUMLogFile> dataSet = spark.createDataset(logs, Encoders.bean(PreparedWUMLogFile.class));
Dataset<Row> data = dataSet.toDF();
For many stages I get the following warning message:
[warn] o.a.s.s.TaskSetManager - Stage 0 contains a task of very large size (3002 KB). The maximum recommended task size is 100 KB.
How I can reduce the task size in this case?
Edit:
To be more concrete: There are 5 of 30 stages that produce these warning messages.
rdd at StringIndexer.scala:111 (two times)
take at VectorIndexer.scala:119
rdd at VectorIndexer.scala:122
rdd at Classifier.scala:82

How to write selected columns to Kafka topic?

I am using spark-sql-2.4.1v with java 1.8.
and kafka versions spark-sql-kafka-0-10_2.11_2.4.3 and kafka-clients_0.10.0.0
StreamingQuery queryComapanyRecords =
comapanyRecords
.writeStream()
.format("kafka")
.option("kafka.bootstrap.servers",KAFKA_BROKER)
.option("topic", "in_topic")
.option("auto.create.topics.enable", "false")
.option("key.serializer","org.apache.kafka.common.serialization.StringDeserializer")
.option("value.serializer", "com.spgmi.ca.prescore.serde.MessageRecordSerDe")
.option("checkpointLocation", "/app/chkpnt/" )
.outputMode("append")
.start();
queryLinkingMessageRecords.awaitTermination();
Giving error :
Caused by: org.apache.spark.sql.AnalysisException: Required attribute 'value' not found;
at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$6.apply(KafkaWriter.scala:71)
at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$6.apply(KafkaWriter.scala:71)
at scala.Option.getOrElse(Option.scala:121)
I tried to fix as below, but unable to send the value i.e. which is a java bean in my case.
StreamingQuery queryComapanyRecords =
comapanyRecords.selectExpr("CAST(company_id AS STRING) AS key", "to_json(struct(\"company_id\",\"fiscal_year\",\"fiscal_quarter\")) AS value")
.writeStream()
.format("kafka")
.option("kafka.bootstrap.servers",KAFKA_BROKER)
.option("topic", "in_topic")
.start();
So is there anyway in java how to handle/send this value( i.e. Java
bean as record) ??.
Kafka data source requires a specific schema for reading (loading) and writing (saving) datasets.
Quoting the official documentation (highlighting the most important field / column):
Each row in the source has the following schema:
...
value binary
...
In other words, you have Kafka records in the value column when reading from a Kafka topic and you have to make your data to save to a Kafka topic available in the value column as well.
In other words, whatever is or is going to be in Kafka is in the value column. The value column is where you "store" business records (the data).
On to your question:
How to write selected columns to Kafka topic?
You should "pack" the selected columns together so they can all together be part of the value column. to_json standard function is a good fit so the selected columns are going to be a JSON message.
Example
Let me give you an example.
Don't forget to start a Spark application or spark-shell with the Kafka data source. Mind the versions of Scala (2.11 or 2.12) and Spark (e.g. 2.4.4).
spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4
Let's start by creating a sample dataset. Any multiple-field dataset would work.
val ns = Seq((0, "zero")).toDF("id", "name")
scala> ns.show
+---+----+
| id|name|
+---+----+
| 0|zero|
+---+----+
If we tried to write the dataset to a Kafka topic, it would error out due to value column missing. That's what you faced initially.
scala> ns.write.format("kafka").option("topic", "in_topic").save
org.apache.spark.sql.AnalysisException: Required attribute 'value' not found;
at org.apache.spark.sql.kafka010.KafkaWriter$.$anonfun$validateQuery$6(KafkaWriter.scala:71)
at scala.Option.getOrElse(Option.scala:138)
...
You have to come up with a way to "pack" multiple fields (columns) together and make it available as value column. struct and to_json standard functions will do it.
val vs = ns.withColumn("value", to_json(struct("id", "name")))
scala> vs.show(truncate = false)
+---+----+----------------------+
|id |name|value |
+---+----+----------------------+
|0 |zero|{"id":0,"name":"zero"}|
+---+----+----------------------+
Saving to a Kafka topic should now be a breeze.
vs.write.format("kafka").option("topic", "in_topic").save

Categories