Can Spark Streaming do Anything Other Than Word Count? - java

I'm trying to get to grips with Spark Streaming but I'm having difficulty. Despite reading the documentation and analysing the examples I wish to do something more than a word count on a text file/stream/Kafka queue which is the only thing we're allowed to understand from the docs.
I wish to listen to an incoming Kafka message stream, group messages by key and then process them. The code below is a simplified version of the process; get the stream of messages from Kafka, reduce by key to group messages by message key then to process them.
JavaPairDStream<String, byte[]> groupByKeyList = kafkaStream.reduceByKey((bytes, bytes2) -> bytes);
groupByKeyList.foreachRDD(rdd -> {
List<MyThing> myThingsList = new ArrayList<>();
MyCalculationCode myCalc = new MyCalculationCode();
rdd.foreachPartition(partition -> {
while (partition.hasNext()) {
Tuple2<String, byte[]> keyAndMessage = partition.next();
MyThing aSingleMyThing = MyThing.parseFrom(keyAndMessage._2); //parse from protobuffer format
myThingsList.add(aSingleMyThing);
}
});
List<MyResult> results = myCalc.doTheStuff(myThingsList);
//other code here to write results to file
});
When debugging I see that in the while (partition.hasNext()) the myThingsList has a different memory address than the declared List<MyThing> myThingsList in the outer forEachRDD.
When List<MyResult> results = myCalc.doTheStuff(myThingsList); is called there are no results because the myThingsList is a different instance of the List.
I'd like a solution to this problem but would prefer a reference to documentation to help me understand why this is not working (as anticipated) and how I can solve it for myself (I don't mean a link to the single page of Spark documentation but also section/paragraph or preferably still, a link to 'JavaDoc' that does not provide Scala examples with non-functional commented code).

The reason you're seeing different list addresses is because Spark doesn't execute foreachPartition locally on the driver, it has to serialize the function and send it over the Executor handling the processing of the partition. You have to remember that although working with the code feels like everything runs in a single location, the calculation is actually distributed.
The first problem I see with you code has to do with your reduceByKey which takes two byte arrays and returns the first, is that really what you want to do? That means you're effectively dropping parts of the data, perhaps you're looking for combineByKey which will allow you to return a JavaPairDStream<String, List<byte[]>.
Regarding parsing of your protobuf, looks to me like you don't want foreachRDD, you need an additional map to parse the data:
kafkaStream
.combineByKey(/* implement logic */)
.flatMap(x -> x._2)
.map(proto -> MyThing.parseFrom(proto))
.map(myThing -> myCalc.doStuff(myThing))
.foreachRDD(/* After all the processing, do stuff with result */)

Related

Is it safe for a Flink application to have multiple data/key streams in s job all sharing the same Kafka source and sink?

(Goal Updated)
My goal on each data stream is:
filter different msgs
have different event time defined window session gaps
consumer from topic and produce to another topic
A fan-out -> fan-in like DAG.
var fanoutStreamOne = new StreamComponents(/*filter, flatmap, etc*/);
var fanoutStreamTwo = new StreamComponents(/*filter, flatmap, etc*/);
var fanoutStreamThree = new StreamComponents(/*filter, flatmap, etc*/);
var fanoutStreams = Set.of(fanoutStreamOne, fanoutStreamTwo, fanoutStreamThree)
var source = new FlinkKafkaConsumer<>(...);
var sink = new FlinkKafkaProducer<>(...);
// creates streams from same source to same sink (Using union())
new streamingJob(source, sink, fanoutStreams).execute();
I am just curious if this affects recovery/checkpoints or performance of the Flink application.
Has anyone had success with this implementation?
And should I have the watermark strategy up front before filtering?
Thank in advance!
Okay, the differenced time gaps are not possible, I think so. I tried it a year ago, with flink 1.7 , and I can't do it. The watermark is global to the application.
To the other problems, if you are using Kafka, yo can read from some topics using regex, and get the topic using the properly deserialization schema (here).
To filter the messages, I think you can use the filter functions with the dide output streams :) (here)

Java mono repeat call until collected results compete

I'm picking up Java/Reactor after moving over from C#. I'm well versed in the C# async-await approach to non-blocking calls and am struggling to adapt to Flux/Mono.
I'm implementing a solution where I need to make a call to ElasticSearch via the Java SDK, get results, apply additional filters to strip out ES results, and keep paging through ES until my final collection of results is complete.
The ES SDK doesn't support Reactor but there are examples of Java adapter code that takes the ES callback and converts to a mono (I see a direct correlation to the C# async-await here as this is a non-blocking call to ES). What I then struggle with is the next bit - I need to take the results from the ES mono, filter them.
I do this by calling out to other external services to get additional data based on the results from the ES call, so I need to know the ids of each page of content the ES mono result before I can apply the filtering (effectively a kind of block), then apply the in-memory filters and if I don't have enough content, then go back to ES to get the next page... repeat until I have enough data or there are no more results from ES.
This appears to be very difficult to achieve compared to C# but I probably just don't understand the Java paradigm correctly.
My problem is that I can't use "block()" as this throws an error in Reactor 3.2 so I don't really know how to "wait" until the mono calls to ES and external services are complete until continuing. In C#, this would be as simple as call to an Async method with an await to handle the implicit callbacks
My blocking version (works in IntellJ, fails when published via maven and then run in a webserver) is effectively:
do {
var sr = GetSearchRequest(xxxx);
this.elasticsearch.results(sr)
.map(r -> chunk.addAdd(r))
.block();
if (chunk.size() == 0 {
isComplete = true;
}
else {
var filtered = postFilterResults(chunk);
finalResults.add(filtered);
if (finalResults.size() = MAXIMUM_RESULTS) {
isComplete = true;
}
esPage = esPage + 1;
while (isComplete == false);
If I try to subscribe() or other non-blocking reaktor calls, then (obviously) the code skips over the "get ES" bit and hits the do-while, looping repeatedly until the callback from ES finally happens and the subscribed map is invoked.
I think I need to perform an "async block" for each ES call but I don't know how.
To answer my own question... The underlying issue IMO is that Flux/Mono simply is not like any existing programming style in that it absolutely forces you to work within the fluent style that reactor mandates. This is very similar to C# Linq but it's almost a "false friend" as even things like loops need to be in Reactor.
In this case, the key issue to solve is one of paging and to keep doing this within a loop. it is very unclear how to achieve this as a subscription to a flux "locks in" the original parameters so repeating the subscription call simply gets the same page again. The solution is to use the Flux.defer method which forces lazy building of the subscription on each repeated invoke. You then need Atomic integers to keep track of the page counter across different calls. Again, this is something that C# handles for you, so it can catch a .net developer out.
Something like:
//The response from the elasticsearch adapter is a Flux<T> but we do not want to filter
//results on a row by row basis as this incurs one call for each row to the DB/Network
//(as appropriate). We choose to batch these up
var result = new SearchResult();
var page = new AtomicInteger();
var chunkSize = new AtomicInteger();
//Use a defer so we recalculate the subscription to the search with the new page count
var results = Flux.defer(() -> elasticsearch.results(GetSearchRequest(request, lc, pf, page.get()))
.doOnComplete(() -> {
chunkSize.set(0);
page.getAndAdd(1);
})
.collectList()
.map(chunk -> {
chunkSize.set(chunk.size());
return chunk;
})
.map(chunk -> postFilterResults(request, chunk, pf))
.map(filtered -> result.getDocuments().addAll(filtered)));
//Repeat the deferred flux (recalculating each time) until we have enough content or we don't get anything from the search engine
return results
.repeat()
.takeUntil(r -> chunkSize.get() == 0 || result.getDocuments().size() >= this.elasticsearch.getMaximumSearchResults())
.take(this.elasticsearch.getMaximumSearchResults())
.collectList()
.flatMap(r -> {
result.setTotalHits(result.getDocuments().size());
return Mono.just(result);
});

Write to Cloud Storage every X messages from Pubsub

I am new to Cloud Dataflow / Apache Beam, so the concept/programming is still hazy to me.
What I want to do is that Dataflow listens to Pubsub and gets messages of this format in JSON:
{
"productId": "...",
"productName": "..."
}
And transform that to:
{
"productId": "...",
"productName": "...",
"sku": "...",
"inventory": {
"revenue": <some Double>,
"stocks": <some Integer>
}
}
So the steps needed are:
(IngestFromPubsub) Get records from Pubsub by listening to a topic (1 Pubsub message = 1 record)
(EnrichDataFromAPI)
a. Deserialize the payload's JSON string into Java object
b. By calling an external API, using the sku, I can enrich the data of each record by adding the inventory attribute.
c. Serialize the records again.
(WriteToGCS) Then every x number (can be parameterized) records, I need to write these in Cloud Storage.
Please consider also the trivial case that x=1.
(Does x=1, a good idea? I am afraid there will be too many Cloud Storage writes)
Even though I am a Python guy, I am already having difficulty doing this in Python, more so that I need to do write in Java. I am getting headache reading Beam's example in Java, it's too verbose and difficult to follow. All I understand is that each step is an .apply to the PCollection.
So far, here is the result of my puny effort:
public static void main(String[] args) {
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
options.setStreaming(true);
Pipeline pipeline = Pipeline.create(options);
pipeline
.apply("IngestFromPubsub", PubsubIO.readStrings().fromTopic(options.getTopic()))
// I don't really understand the next part, I just copied from official documentation and filled in some values
.apply(Window.<String>into(FixedWindows.of(Duration.millis(5000)))
.withAllowedLateness(Duration.millis(5000))
.triggering(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.millis(1000)))
.discardingFiredPanes()
)
.apply("EnrichDataFromAPI", ParDo.of(
new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
c.element();
// help on this part, I heard I need to use Jackson but I don't know, for API HttpClient is sufficient
// ... deserialize, call API, serialize again ...
c.output(enrichedJSONString);
}
}
))
.apply("WriteToGCS",
TextIO.write().withWindowedWrites().withNumShards(1).to(options.getOutput()))
;
PipelineResult result = pipeline.run();
}
Please fill in the missing parts, and also give me a tip on Windowing (e.g. what's the appropriate configuration etc.) and in which steps should I insert/apply it.
I don't think you need any of the windowing in your IngestFromPubsub and EnrichDataFromAPI. The purpose of windowing is to group your records that are nearby in time together into windows so you can compute aggregate computations over them. But since you are not doing any aggregate computations, and are interested in dealing with each record independently, you don't need windows.
Since you are always converting one input record to one output record, your EnrichDataFromAPI should be a MapElements. This should make the code easier.
There are resources out there for processing JSON in Apache Bean Java: Apache Beam stream processing of json data
You don't necessarily need to use Jackson to map the JSON to a Java object. You might be able to manipulate the JSON directly. You can use Java's native JSON API to parse/manipulate/serialize.

Two questions about Flink deserializing

I'm a very newbie of Flink and cluster computing. I spent all day trying to parse correctly on Flink a stupid stream from Kafka with NONE results: It's a bit frustrating...
I've in kafka a stream of JSON-LD messages identified with a string key. I simply would like to retrieve them in Flink and then separate messages with different keys.
1)
Initially I considered to send messages as String instead of JSON-LD. I though was easier...
I tried every deserialiser but none works. The simple deserialiser obsviously works but it completely ignore keys.
I believed I had to use (Flink apparently has just two deserialiser which support keys):
DataStream<Object> stream = env
.addSource(new FlinkKafkaConsumer010<>("topicTest", new TypeInformationKeyValueSerializationSchema(String.class, String.class, env.getConfig()), properties))
.rebalance();
stream.print();
But I obtain:
06/12/2017 02:09:12 Source: Custom Source(4/4) switched to FAILED
java.io.EOFException
at org.apache.flink.runtime.util.DataInputDeserializer.readUnsignedByte(DataInputDeserializer.java:306)
How can I receive stream messages without lose keys?
2)
My kafka producer is implemented in javascript, since Flink support JSONDeserialization I though to send in kafka directly JSON Object.
I'm not sure that's works correctly with JSON-LD but I've used:
json.parse(jsonld_message)
to serialize as json the message. Then I sent this with usual string key.
But in Flink this code doesn't work:
DataStream<ObjectNode> stream = env
.addSource(new FlinkKafkaConsumer010<>("topicTest", new JSONKeyValueDeserializationSchema(false), properties))
.rebalance();
stream.print();
raising a
JsonParserException.
I think first approach is simpler and I prefer it because allows to consider one problem at time (first: receive data, second: reconvert string in JSON-LD with external library I guess).
SOLVED:
Finally I decided to implement a custom deserializer implementing the KeyedDeserializedSchema interface.
In order to use Flink's TypeInformationKeyValueSerializationSchema to read data from Kafka it must be written in a compatible way. Assuming that your key and value are of type String, then the key and value must be written in a way that Flink's StringSerializer understands the data.
Consequently, you have to make sure that your Kafka producer writes the data in a compatible way. Otherwise Flink' won't be able to read the data.
** I faced similar issue. Ideally TypeInformationKeyValueSerializationSchema with String types for keys and values should have been able to read my kafka record which has both keys and values as Strings. but it was not able to and had a EOF exception as pointed out by above post.So this issue is easily reproducible and needs to be fixed. Please let me know if i can be of any help in this process.In the meantime implemented Custom Serializer using
Kafka Deserializer Schema
. Here is the code as there is little doc regarding it to read keys/values and additional things:
**
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.connectors.kafka.KafkaDeserializationSchema;
import org.apache.kafka.clients.consumer.ConsumerRecord;
public class CustomKafkaSerializer implements KafkaDeserializationSchema<Tuple2<String,String>> {
#Override
public boolean isEndOfStream(Tuple2<String,String> stringStringPair) {
return false;
}
#Override
public Tuple2<String,String> deserialize(ConsumerRecord<byte[], byte[]> consumerRecord) throws Exception {
String key = new String(consumerRecord.key());
String value = new String(consumerRecord.value());
return new Tuple2<>(key,value);
}
#Override
public TypeInformation<Tuple2<String,String>> getProducedType() {
return TypeInformation.of(new TypeHint<Tuple2<String, String>>(){});
}
}

How to store values from Trident/Storm in a List (using the Java API)

I'm trying to create a few Unit Tests to verify that certain parts of my Trident topology are doing what they are supposed to.
I'd like to be able to retrieve all the values resulting after running the topology and put them in a List so I can "see" them and check conditions on them.
FeederBatchSpout feederSpout = new FeederBatchSpout("some_time_field", "foo_id");
TridentTopology topology = new TridentTopology();
topology.newStream("spout1", feederSpout)
.groupBy(new Fields("some_time_field", "foo_id"))
.aggregate(new Fields("foo_id"), new FooAggregator(),
new Fields("aggregated_foos"))
// Soo... how do I retrieve the "aggregated_foos" from here?
I am running the topology as a TrackedTopology (got the code from another S.O. question, thank you #brianghig for asking it and #Thomas Kielbus for the reply)
This is how I "launch" the topology and how I feed sample values into it:
TrackedTopology tracked = Testing.mkTrackedTopology(cluster, topology.build());
cluster.submitTopology("unit_tests", config, tracked.getTopology());
feederSpout.feed(new Values(MyUtils.makeSampleFoo(1));
feederSpout.feed(new Values(MyUtils.makeSampleFoo(2));
When I do this, I can see in the log messages that the topology is running correctly, and that the values are calculated properly, but I'd like to "fish" the results out into a List (or any structure, at this point) so I can actually put some Asserts in my tests.
I've been trying [a s**ton] of different approaches, but none of them work.
The latest idea was adding a bolt after the aggregation so it would "persist" my values into a list:
Below you'll see the class that tries to go through all the tuples emitted by the aggregate and would put them in a list that I had previously initialized:
class FieldFetcherStateUpdater extends BaseStateUpdater<FieldFetcherState> {
final List<AggregatedFoo> results;
public FieldFetcherStateUpdater(List<AggregatedFoo> results) {
this.results = results;
}
#Override
public void updateState(FieldFetcherState state, List<TridentTuple> tuples,
TridentCollector collector) {
for (TridentTuple tuple : tuples) {
results.add((AggregatedFoo) tuple.getValue(0));
}
}
}
So now the code would look like:
// ...
List<AggregatedFoo> results = new ArrayList();
topology.newStream("spout1", feederSpout)
.groupBy(new Fields("some_time_field", "foo_id"))
.aggregate(new Fields("foo_id"), new FooAggregator(),
new Fields("aggregated_foos"))
.partitionPersist(new FieldFetcherFactory(),
new Fields("aggregated_foos"),
new FieldFetcherStateUpdater(results));
LOGGER.info("Done. Checkpoint results={}", results);
But nothing... The logs show Done. Checkpoint results=[] (empty list)
Is there a way to get that? I imagine it must be doable, but I haven't been able to figure out a way...
Any hint or link to pages or anything of the like will be appreciated. Thank you in advance.
You need to use a static member variable result. If you have multiple parallel tasks running (ie, parallelism_hint > 1) you also need to synchronize the write access to result.
In your case, result will be empty, because Storm internally, creates a new instance of your bolt (including a new instance of ArrayList). Using a static variable ensures, that you get access to the correct object (as there will be only one over all instances of your bolt).

Categories