Spark streaming - join JavaDStream feed (textFileStream) with a reference data file

Spark streaming - join JavaDStream feed (textFileStream) with a reference data file - java

We are getting some text files every 1-minute and we aggregate it using Spark Streaming
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(60);
JavaDStream<String> file = ssc.textFileStream(inputDir)
However, upon aggregations are done we want to join the aggregated JavaPairDStream<> with another feed which is a reference data and arrives every 1-hour.
Is it possible in Spark Streaming to join 2 feeds arriving at different time-intervals?
Has anyone done this?

You should persist the result of the aggregation of your first feed as a RDD on disk so the DStream could remove the batch data from memory and keep going.
When your other DStream starts a batch, you can join it with the RDD created previously.
See Spark documentation on join

We looked at Spark Documentation on the Stream-DatSet joins:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams
and figured out that there is a overloaded method in
<K2,V2> JavaPairDStream<K2,V2> transformToPair(Function<R,JavaPairRDD<K2,V2>> transformFunc)
<K2,V2> JavaPairDStream<K2,V2> transformToPair(Function2<R,Time,JavaPairRDD<K2,V2>> transformFunc)
which accepts the time param and it can actually refresh the dataset during each streaming interval. This solved our problem.

Related

GCP - Bigquery to Kafka as streaming

I have a dataflow application(java) which is running in gcp and able to read the data from bigquery table and write to Kafka. But the application running as a batch mode, where as I would like make application as stream to read the data continuously from bigquery table and write to kafka topic.
Bigquery Table: Partitioned table with insert_time ( timestamp of record inserted intable) and message column
PCollection<TableRow> tablesRows = BigQueryUtil.readFromTable(pipeline,
"select message,processed from `myprojectid.mydatasetname.mytablename` " +
"where processed = false " +
"order by insert_time desc ")
.apply("Windowing",Window.into(FixedWindows.of(Duration.standardMinutes(1))));
.apply("Converting to writable message", ParDo.of(new ProcessRowDoFn()))
.apply("Writing Messages", KafkaIO.<String, String>write().
withBootstrapServers(bootStrapURLs).
withTopic(options.getKafkaInputTopics()).
withKeySerializer(StringSerializer.class).
withValueSerializer(StringSerializer.class).
withProducerFactoryFn(new ProducerFactoryFn(sslConfig, projected))
);
pipeline.run();
Note: I have tried below options but no luck yet
Options 1. I tried the options of options.streaming (true); its running as stream but it will finish on the first success write.
Options 2. Applied trigger
Window.into(
FixedWindows.of(Duration.standardMinutes(5)))
.triggering(
AfterWatermark.pastEndOfWindow()
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardDays(2))
.accumulatingFiredPanes();
Option 3. Making unbounded forcibly
WindowingStrategy<?, ?> windowingStrategy = tablesRows.setIsBoundedInternal(PCollection.IsBounded.UNBOUNDED).getWindowingStrategy();
.apply("Converting to writable message", ParDo.of(new ProcessRowDoFn())).setIsBoundedInternal(PCollection.IsBounded.UNBOUNDED)
Any solution is appreciated.

Some of the advice in Side Input Patterns in the Beam Programming Guide may be helpful here, even though you aren't using this as a side input. In particular, that article discusses using GenerateSequence to periodically emit a value and trigger a read from a bounded source.
This could allow your one time query to become a repeated query that periodically emits new records. It will be up to your query logic to determine what range of the table to scan on each query, though, and I expect it will be difficult to avoid emitting duplicate records. Hopefully your use case can tolerate that.
Emitting into the global window would look like:
PCollectionView<Map<String, String>> map =
p.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5L)))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))))
.apply(Sum.longsGlobally().withoutDefaults())
.apply(
ParDo.of(
new DoFn<Long, Map<String, String>>() {
#ProcessElement
public void process(
#Element Long input,
#Timestamp Instant timestamp,
OutputReceiver<Map<String, String>> o) {
// Read from BigQuery here and for each row output a record: o.output(PlaceholderExternalService.readTestData(timestamp)
);
}
}))
.apply(
Window.<Map<String, String>>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(View.asSingleton());
This assumes that the size of the query result is relatively small, since the read happens entirely within a DoFn invocation.

Is it possible to import in Dataflow streaming pipeline written in Python the Java method `wrapBigQueryInsertError`?

I'm trying to create a Dataflow streaming pipeline with Python3 that reads messages from a Pub/Sub topic to end up writing them on a BigQuery table "from scratch". I've seen in the Dataflow Java template named PubSubToBigQuery.java (that carries out what I'm looking for) a piece of code in the 3th step to handle those Pub/Sub messages transformed into table rows that fail when you try to insert them into the BigQuery table. Finally, in the code pieces of the steps 4 and 5, those are flatten and inserted in an error table:
Step 3:
PCollection<FailsafeElement<String, String>> failedInserts =
writeResult
.getFailedInsertsWithErr()
.apply(
"WrapInsertionErrors",
MapElements.into(FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor())
.via((BigQueryInsertError e) -> wrapBigQueryInsertError(e)))
.setCoder(FAILSAFE_ELEMENT_CODER);
Steps 4 & 5
PCollectionList.of(
ImmutableList.of(
convertedTableRows.get(UDF_DEADLETTER_OUT),
convertedTableRows.get(TRANSFORM_DEADLETTER_OUT)))
.apply("Flatten", Flatten.pCollections())
.apply(
"WriteFailedRecords",
ErrorConverters.WritePubsubMessageErrors.newBuilder()
.setErrorRecordsTable(
ValueProviderUtils.maybeUseDefaultDeadletterTable(
options.getOutputDeadletterTable(),
options.getOutputTableSpec(),
DEFAULT_DEADLETTER_TABLE_SUFFIX))
.setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJson())
.build());
failedInserts.apply(
"WriteFailedRecords",
ErrorConverters.WriteStringMessageErrors.newBuilder()
.setErrorRecordsTable(
ValueProviderUtils.maybeUseDefaultDeadletterTable(
options.getOutputDeadletterTable(),
options.getOutputTableSpec(),
DEFAULT_DEADLETTER_TABLE_SUFFIX))
.setErrorRecordsTableSchema(ResourceUtils.getDeadletterTableSchemaJson())
.build());
In order to do this, I suspect that the key to making this possible lies in the first imported library in the template:
package com.google.cloud.teleport.templates;
import static com.google.cloud.teleport.templates.TextToBigQueryStreaming.wrapBigQueryInsertError;
Is this method available in Python?
If not, there is some way to perform the same in Python that is not to check that the structure and the data type of fields of the records that should be inserted corresponds to what the BigQuery table expects?
This kind of workaround slows down my streaming pipeline too much.

In Beam Python, when performing a streaming BigQuery write, the rows which failed during the BigQuery write are returned by the transform. See https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L1248
So you can process these in the same way as the Java template.

How can I append timestamp to rdd and push to elasticsearch

I am new to spark streaming and elasticsearch, I am trying to read data from kafka topic using spark and storing data as rdd. In the rdd I want to append time stamp, as soon as new data comes and then push to elasticsearch.
lines.foreachRDD(rdd -> {
if(!rdd.isEmpty()){
// rdd.collect().forEach(System.out::println);
String timeStamp = new
SimpleDateFormat("yyyy::MM::dd::HH::mm::ss").format(new Date());
List<String> myList = new ArrayList<String>(Arrays.asList(timeStamp.split("\\s+")));
List<String> f = rdd.collect();
Map<List<String>, ?> rddMaps = ImmutableMap.of(f, 1);
Map<List<String>, ?> myListrdd = ImmutableMap.of(myList, 1);
JavaRDD<Map<List<String>, ?>> javaRDD = sc.parallelize(ImmutableList.of(rddMaps));
JavaEsSpark.saveToEs(javaRDD, "sample/docs");
}
});

Spark?
As far as I understand, spark streaming is for real time streaming data computation, like map, reduce, join and window. It seems no need to use such a powerful tool, in the case that what we need is just add a timestamp for event.
Logstash?
If this is the situation, Logstash may be more suitable for our case.
Logstash will record the timestamp when event coming and it also has persistent queue and Dead Letter Queues that ensure the data resiliency. It has the native support for push data to ES (after all they are belong to a serial of products), which make it is very easy to push data to.
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "logstash-%{type}-%{+YYYY.MM.dd}"
}
}
More
for more about Logstash, here is introduction.
here is a sample logstash config file.
Hope this is helpful.
Ref
Deploying and Scaling Logstash

If all you're using Spark Streaming for is getting the data from Kafka to Elasticsearch a neater way–and not needing any coding–would be to use Kafka Connect.
There is an Elasticsearch Kafka Connect sink. Depending on what you want to do with a Timestamp (e.g. for index routing, or to add as a field) you can use Single Message Transforms (there's an example of them here).

Getting Streaming window timestamp for spark

I am using Spark-streaming to receive data from a zero MQ Queue at an specific interval , enrich it and save it as parquet files . I want to compare data from one streaming window to another.(later in time using parquet files)
How can I find the timestamps a specific streaming window , which I can add as another filed while enrichment to facilitate my comparisons.
JavaStreamingContext javaStreamingContext = new JavaStreamingContext(sparkConf, new Duration(duration));
inputStream = javaStreamingContext.receiverStream(new StreamReceiver( hostName, port, StorageLevel.MEMORY_AND_DISK_SER()));
JavaDStream<myPojoFormat> enrichedData = inputStream.map(new Enricher());
In a nutshell I want time stamp of each streaming window .( Not record level but batch level)

You can use the transform method of JavaDStream which gets a Function2 s parameter. The Function2 gets a RDD and a Time object and returns a new RDD. The overall result will be a new JavaDStream in which RDD have been trasformed accord the logic you have chosen.

produce hfiles for multiple tables to bulk load in a single map reduce

I am using mapreduce and HfileOutputFormat to produce hfiles and bulk load them directly into the hbase table.
Now, while reading the input files, I want to produce hfiles for two tables and bulk load the outputs in a single mapreduce.
I searched the web and see some links about MultiHfileOutputFormat and couldn't find a real solution to that.
Do you think that it is possible?

My way is :
use HFileOutputFormat as well, when the job is completed , doBulkLoad, write into table1.
set a List puts in mapper, and a MAX_PUTS value in global.
when puts.size()>MAX_PUTS, do:
String tableName = conf.get("hbase.table.name.dic", table2);
HTable table = new HTable(conf, tableName);
table.setAutoFlushTo(false);
table.setWriteBufferSize(1024*1024*64);
table.put(puts);
table.close();
puts.clear();
notice:you mast have a cleanup function to write the left puts .

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark streaming - join JavaDStream feed (textFileStream) with a reference data file - java

You should persist the result of the aggregation of your first feed as a RDD on disk so the DStream could remove the batch data from memory and keep going. When your other DStream starts a batch, you can join it with the RDD created previously. See Spark documentation on join

Related

GCP - Bigquery to Kafka as streaming

Is it possible to import in Dataflow streaming pipeline written in Python the Java method `wrapBigQueryInsertError`?

How can I append timestamp to rdd and push to elasticsearch

Getting Streaming window timestamp for spark

produce hfiles for multiple tables to bulk load in a single map reduce

Categories

Resources