I am using Spark-streaming to receive data from a zero MQ Queue at an specific interval , enrich it and save it as parquet files . I want to compare data from one streaming window to another.(later in time using parquet files)
How can I find the timestamps a specific streaming window , which I can add as another filed while enrichment to facilitate my comparisons.
JavaStreamingContext javaStreamingContext = new JavaStreamingContext(sparkConf, new Duration(duration));
inputStream = javaStreamingContext.receiverStream(new StreamReceiver( hostName, port, StorageLevel.MEMORY_AND_DISK_SER()));
JavaDStream<myPojoFormat> enrichedData = inputStream.map(new Enricher());
In a nutshell I want time stamp of each streaming window .( Not record level but batch level)
You can use the transform method of JavaDStream which gets a Function2 s parameter. The Function2 gets a RDD and a Time object and returns a new RDD. The overall result will be a new JavaDStream in which RDD have been trasformed accord the logic you have chosen.
Related
I have a dataflow application(java) which is running in gcp and able to read the data from bigquery table and write to Kafka. But the application running as a batch mode, where as I would like make application as stream to read the data continuously from bigquery table and write to kafka topic.
Bigquery Table: Partitioned table with insert_time ( timestamp of record inserted intable) and message column
PCollection<TableRow> tablesRows = BigQueryUtil.readFromTable(pipeline,
"select message,processed from `myprojectid.mydatasetname.mytablename` " +
"where processed = false " +
"order by insert_time desc ")
.apply("Windowing",Window.into(FixedWindows.of(Duration.standardMinutes(1))));
.apply("Converting to writable message", ParDo.of(new ProcessRowDoFn()))
.apply("Writing Messages", KafkaIO.<String, String>write().
withBootstrapServers(bootStrapURLs).
withTopic(options.getKafkaInputTopics()).
withKeySerializer(StringSerializer.class).
withValueSerializer(StringSerializer.class).
withProducerFactoryFn(new ProducerFactoryFn(sslConfig, projected))
);
pipeline.run();
Note: I have tried below options but no luck yet
Options 1. I tried the options of options.streaming (true); its running as stream but it will finish on the first success write.
Options 2. Applied trigger
Window.into(
FixedWindows.of(Duration.standardMinutes(5)))
.triggering(
AfterWatermark.pastEndOfWindow()
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardDays(2))
.accumulatingFiredPanes();
Option 3. Making unbounded forcibly
WindowingStrategy<?, ?> windowingStrategy = tablesRows.setIsBoundedInternal(PCollection.IsBounded.UNBOUNDED).getWindowingStrategy();
.apply("Converting to writable message", ParDo.of(new ProcessRowDoFn())).setIsBoundedInternal(PCollection.IsBounded.UNBOUNDED)
Any solution is appreciated.
Some of the advice in Side Input Patterns in the Beam Programming Guide may be helpful here, even though you aren't using this as a side input. In particular, that article discusses using GenerateSequence to periodically emit a value and trigger a read from a bounded source.
This could allow your one time query to become a repeated query that periodically emits new records. It will be up to your query logic to determine what range of the table to scan on each query, though, and I expect it will be difficult to avoid emitting duplicate records. Hopefully your use case can tolerate that.
Emitting into the global window would look like:
PCollectionView<Map<String, String>> map =
p.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5L)))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))))
.apply(Sum.longsGlobally().withoutDefaults())
.apply(
ParDo.of(
new DoFn<Long, Map<String, String>>() {
#ProcessElement
public void process(
#Element Long input,
#Timestamp Instant timestamp,
OutputReceiver<Map<String, String>> o) {
// Read from BigQuery here and for each row output a record: o.output(PlaceholderExternalService.readTestData(timestamp)
);
}
}))
.apply(
Window.<Map<String, String>>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(View.asSingleton());
This assumes that the size of the query result is relatively small, since the read happens entirely within a DoFn invocation.
I'm currently building a streaming pipeline using Java SDK and trying to write to a BigQuery partitioned table using the BigQueryIO write/writeTableRows. I explored a couple of patterns but none of them succeed; few of them are below.
Using SerializableFunction to determine TableDestination
.withSchema(TableSchemaFactory.buildLineageSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED) or CREATE_NEVER
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
and then calling this function inside the .to() method
#Override
public TableDestination apply(ValueInSingleWindow<TableRow> input) {
TimePartitioning timePartitioning = new TimePartitioning();
timePartitioning.setField("processingdate");
String dest = String.format("%s.%s.%s", project, dataset, table);
return new TableDestination(dest, null, timePartitioning);
I also tried to format the partition column obtained from input and add it as part of the String location with $ annotation, like below:
#Override
public TableDestination apply(ValueInSingleWindow<TableRow> input) {
input.get("processingDate")
...convert to string MMddYYYY format
TimePartitioning timePartitioning = new TimePartitioning();
timePartitioning.setField("processingdate");
String dest = String.format("%s.%s.%s$%s", project, dataset, table, convertedDate);
return new TableDestination(dest, null, timePartitioning);
however, none of them succeed, either failing with
invalid timestamp
timestamp field value out of range
You can only stream to partitions within 0 days in the past and 0 days in the future relative to the current date.
The destination table's partition is not supported for streaming. You can only stream to meta-table of date partitioned tables.
Streaming to metadata partition of column based partitioning table is disallowed.
I can't seem to get the right combination. Has anyone encountered the same issue before? Can anyone point me to the right direction or give me some pointers? what I want to achieve is load the streaming data based on the date column defined and not on processing time.
Thank you!
I expect most of these issues will be solved if you drop the partition decorator from dest. In most cases the BigQuery APIs for loading data will be able to figure out the right partition based on the messages themselves.
So try changing your definition of dest to:
String dest = String.format("%s.%s.%s", project, dataset, table);
Today i found very strange thing in Kafka state store i google lot but didn't found the reason for the behavior.
Consider the below state store written in java:
private KeyValueStore<String, GenericRecord> userIdToUserRecord;
There are two processor who are using this state store.
topology.addStateStore(userIdToUserRecord, ALERT_PROCESSOR_NAME, USER_SETTING_PROCESSOR_NAME)
USER_SETTING_PROCESSOR_NAME will put the data to state store
userIdToUserRecord.put("user-12345", record);
ALERT_PROCESSOR_NAME will get the data from state store
userIdToUserRecord.get("user-12345");
Adding source to UserSettingProcessor
userSettingTopicName = user-setting-topic;
topology.addSource(sourceName, userSettingTopicName)
.addProcessor(processorName, UserSettingProcessor::new, sourceName);
Adding source to AlertEngineProcessor
alertTopicName = alert-topic;
topology.addSource(sourceName, alertTopicName)
.addProcessor(processorName, AlertEngineProcessor::new, sourceName);
Case 1:
Produce record using Kafka produce in java
First produce record to topic user-setting-topic using java it will add the user record to state store
Second produce record to topic alert-topic using java it will take record from state store using user id userIdToUserRecord.get("user-12345");
Worked fine i am using kafkaavroproducer to produce record to both the topic
Case 2:
First produce record to topic user-setting-topic using python it will add the user record to state store *userIdToUserRecord.put("user-100", GenericRecord);
Second produce record to topic alert-topic using java it will take record from state store using user id userIdToUserRecord.get("user-100");
the strange happen here userIdToUserRecord.get("user-100") will return null
I check the scenario like this also
i produce record to user-setting-topic using python then the userSettingProcessor process method triggered there is check in debug mode and try to get user record from state store userIdToUserRecord.get("user-100") it worked fine in userSettingProcessor i am able to get the data from state-store
Then i produce record to alert-topic using java then try to get the userIdToUserRecord.get("user-100") it will return null
i don't know this strange behavior anyone tell me about this behavior.
Python code:
value_schema = avro.load('user-setting.avsc')
value = {
"user-id":"user-12345",
"client_id":"5cfdd3db-b25a-4e21-a67d-462697096e20",
"alert_type":"WORK_ORDER_VOLUME"
}
print("------------------------Kafka Producer------------------------------")
avroProducer = AvroProducer(
{'bootstrap.servers': 'localhost:9092', 'schema.registry.url': 'http://localhost:8089'},
default_value_schema=value_schema)
avroProducer.produce(topic="user-setting-topic", value=value)
print("------------------------Sucess Producer------------------------------")
avroProducer.flush()
Java Code:
Schema schema = new Schema.Parser().parse(schemaString);
GenericData.Record record = new GenericData.Record(schema);
record.put("alert_id","5cfdd3db-b25a-4e21-a67d-462697096e20");
record.put("alert_created_at",123449437L);
record.put("alert_type","WORK_ORDER_VOLUME");
record.put("client_id","5cfdd3db-b25a-4e21-a67d-462697096e20");
//record.put("property_key","property_key-"+i);
record.put("alert_data","{\"alert_trigger_info\":{\"jll_value\":1.4,\"jll_category\":\"internal\",\"name\":\"trade_Value\",\"current_value\":40,\"calculated_value\":40.1},\"work_order\":{\"locations\":{\"country_name\":\"value\",\"state_province\":\"value\",\"city\":\"value\"},\"property\":{\"name\":\"property name\"}}}");
return record;
The problem is that the Java producer and the Python producer (that is based on the C producer) use a different default hash-function for data partitioning. You will need to provide a customized partitioning to one (or both) to make sure they use the same partitioning strategy.
Unfortunately, the Kafka protocol dose not specify what the default partitioning hash function should be and thus clients can use whatever they want by default.
I am new to spark streaming and elasticsearch, I am trying to read data from kafka topic using spark and storing data as rdd. In the rdd I want to append time stamp, as soon as new data comes and then push to elasticsearch.
lines.foreachRDD(rdd -> {
if(!rdd.isEmpty()){
// rdd.collect().forEach(System.out::println);
String timeStamp = new
SimpleDateFormat("yyyy::MM::dd::HH::mm::ss").format(new Date());
List<String> myList = new ArrayList<String>(Arrays.asList(timeStamp.split("\\s+")));
List<String> f = rdd.collect();
Map<List<String>, ?> rddMaps = ImmutableMap.of(f, 1);
Map<List<String>, ?> myListrdd = ImmutableMap.of(myList, 1);
JavaRDD<Map<List<String>, ?>> javaRDD = sc.parallelize(ImmutableList.of(rddMaps));
JavaEsSpark.saveToEs(javaRDD, "sample/docs");
}
});
Spark?
As far as I understand, spark streaming is for real time streaming data computation, like map, reduce, join and window. It seems no need to use such a powerful tool, in the case that what we need is just add a timestamp for event.
Logstash?
If this is the situation, Logstash may be more suitable for our case.
Logstash will record the timestamp when event coming and it also has persistent queue and Dead Letter Queues that ensure the data resiliency. It has the native support for push data to ES (after all they are belong to a serial of products), which make it is very easy to push data to.
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "logstash-%{type}-%{+YYYY.MM.dd}"
}
}
More
for more about Logstash, here is introduction.
here is a sample logstash config file.
Hope this is helpful.
Ref
Deploying and Scaling Logstash
If all you're using Spark Streaming for is getting the data from Kafka to Elasticsearch a neater way–and not needing any coding–would be to use Kafka Connect.
There is an Elasticsearch Kafka Connect sink. Depending on what you want to do with a Timestamp (e.g. for index routing, or to add as a field) you can use Single Message Transforms (there's an example of them here).
We are getting some text files every 1-minute and we aggregate it using Spark Streaming
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(60);
JavaDStream<String> file = ssc.textFileStream(inputDir)
However, upon aggregations are done we want to join the aggregated JavaPairDStream<> with another feed which is a reference data and arrives every 1-hour.
Is it possible in Spark Streaming to join 2 feeds arriving at different time-intervals?
Has anyone done this?
You should persist the result of the aggregation of your first feed as a RDD on disk so the DStream could remove the batch data from memory and keep going.
When your other DStream starts a batch, you can join it with the RDD created previously.
See Spark documentation on join
We looked at Spark Documentation on the Stream-DatSet joins:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams
and figured out that there is a overloaded method in
<K2,V2> JavaPairDStream<K2,V2> transformToPair(Function<R,JavaPairRDD<K2,V2>> transformFunc)
<K2,V2> JavaPairDStream<K2,V2> transformToPair(Function2<R,Time,JavaPairRDD<K2,V2>> transformFunc)
which accepts the time param and it can actually refresh the dataset during each streaming interval. This solved our problem.