I have a topology that does two kstream joins, the problem im facing is when trying to unit test with the TopologyTestDriver sending a couple of ConsumerRecords with pipeInput and then readOutput. It seems not to be working.
Im thinking this might be because the joins is using the internal rocksdb in the actual kafka which we dont use in the tests.
So i've been looking around for a solution for this but cant find any.
Note: This method of testing works perfectly fine when removing the kstream-kstream joins.
I have a topology that does two kstream joins, the problem im facing is when trying to unit test with the TopologyTestDriver sending a couple of ConsumerRecords with pipeInput and then readOutput. It seems not to be working.
By design, but unfortunately in your case, the TopologyTestDriver isn't a 100% accurate model of how the Kafka Streams engine works at runtime. Notably, there are some differences in the processing order of new, incoming events.
This can indeed cause problems when trying to test, for example, certain joins because these operations depend on a certain processing order (e.g., in a stream-table join, the table should already have an entry for key 'alice' before a stream-side event for 'alice' arrives, otherwise the join output for the stream-side 'alice' will not include any table-side data).
So i've been looking around for a solution for this but cant find any.
What I suggest is to use tests that spin up an embedded Kafka cluster, and then run your tests against that cluster using the "real" Kafka Streams engine (i.e., not the TopologyTestDriver). Effectively, this means you are changing your tests from unit tests to integration/system tests: your test will launch a full-fledged Kafka Streams topology that talks to the embedded Kafka cluster that runs on the same machine as your test.
See the integration tests for Kafka Streams in the Apache Kafka project, where EmbeddedKafkaCluster and IntegrationTestUtils are the center pieces for the tooling. A concrete test example for joins is StreamTableJoinIntegrationTest (there are a few join related integration tests) with its parent AbstractJoinIntegrationTest. (For what it's worth, there are further integration tests examples at https://github.com/confluentinc/kafka-streams-examples#examples-integration-tests, which includes tests that also cover Confluent Schema Registry when using Apache Avro as your data format, etc.)
However, unless I am mistaken, the integration tests and their tooling are not included in the test utilities artifact of Kafka Streams (i.e., org.apache.kafka:kafka-streams-test-utils). So you'd have to do some copy-pasting into your own code base.
Have you had a look at the Kafka Streams unit tests [1]? It's about piping in the data and checking the end result with a mock processor.
For example for the following stream join:
stream1 = builder.stream(topic1, consumed);
stream2 = builder.stream(topic2, consumed);
joined = stream1.outerJoin(
stream2,
MockValueJoiner.TOSTRING_JOINER,
JoinWindows.of(ofMillis(100)),
StreamJoined.with(Serdes.Integer(), Serdes.String(), Serdes.String()));
joined.process(supplier);
You can then start piping input items into the first or second topic and check with each successive piping of input, what the processor can check:
// push two items to the primary stream; the other window is empty
// w1 = {}
// w2 = {}
// --> w1 = { 0:A0, 1:A1 }
// w2 = {}
for (int i = 0; i < 2; i++) {
inputTopic1.pipeInput(expectedKeys[i], "A" + expectedKeys[i]);
}
processor.checkAndClearProcessResult(EMPTY);
// push two items to the other stream; this should produce two items
// w1 = { 0:A0, 1:A1 }
// w2 = {}
// --> w1 = { 0:A0, 1:A1 }
// w2 = { 0:a0, 1:a1 }
for (int i = 0; i < 2; i++) {
inputTopic2.pipeInput(expectedKeys[i], "a" + expectedKeys[i]);
}
processor.checkAndClearProcessResult(new KeyValueTimestamp<>(0, "A0+a0", 0),
new KeyValueTimestamp<>(1, "A1+a1", 0));
I hope this helps.
References:
[1] https://github.com/apache/kafka/blob/trunk/streams/src/test/java/org/apache/kafka/streams/kstream/internals/KStreamKStreamJoinTest.java#L279
Related
(Goal Updated)
My goal on each data stream is:
filter different msgs
have different event time defined window session gaps
consumer from topic and produce to another topic
A fan-out -> fan-in like DAG.
var fanoutStreamOne = new StreamComponents(/*filter, flatmap, etc*/);
var fanoutStreamTwo = new StreamComponents(/*filter, flatmap, etc*/);
var fanoutStreamThree = new StreamComponents(/*filter, flatmap, etc*/);
var fanoutStreams = Set.of(fanoutStreamOne, fanoutStreamTwo, fanoutStreamThree)
var source = new FlinkKafkaConsumer<>(...);
var sink = new FlinkKafkaProducer<>(...);
// creates streams from same source to same sink (Using union())
new streamingJob(source, sink, fanoutStreams).execute();
I am just curious if this affects recovery/checkpoints or performance of the Flink application.
Has anyone had success with this implementation?
And should I have the watermark strategy up front before filtering?
Thank in advance!
Okay, the differenced time gaps are not possible, I think so. I tried it a year ago, with flink 1.7 , and I can't do it. The watermark is global to the application.
To the other problems, if you are using Kafka, yo can read from some topics using regex, and get the topic using the properly deserialization schema (here).
To filter the messages, I think you can use the filter functions with the dide output streams :) (here)
I'm currently trying to run a batch processing job in groovy with Gmongo driver, the collection is about 8 gigs my problem is that my script tries to load everything in-memory, ideally I'd like to be able to process this in batch similar to what Spring Boot Batch does but in groovy scripts
I've tried batchSize(), but this function still retrieves the entire collection into memory only to apply it to my logic in batch-process.
here's my example
momngoDb.collection.find().collect() it -> {
//logic
}
according to official doc:
https://docs.mongodb.com/manual/tutorial/iterate-a-cursor/#read-operations-cursors
def myCursor = db.collection.find()
while (myCursor.hasNext()) {
print( myCursor.next() }
}
After deliberation I found this solution to works best for the following reasons.
Unlike the Cursor it doesn't retrieve documents on a singular basis for processing (which can be terribly slow)
Unlike the Gmongo batch funstion, it also doesn't try to upload the the entire collection in memory only to cut it up in batches for process, this tends to be heavy on machine resources.
code below is efficient and light on resource depending on your batch size.
def skipSize = 0
def limitSize = Integer.valueOf(1000) batchSize (if your going to hard code the batch size then you dont need the int convertion)
def dbSize = Db.collectionName.count()
def dbRunCount = (dbSize / limitSize).round()
dbRunCount.times { it ->
dstvoDsEpgDb.schedule.find()
.skip(skipSize)
.limit(limitSize)
.collect { event ->
//run your business logic processing
}
//calculate the next skipSize
skipSize += limitSize
}
If I run an example flink application like below:
DataStream ds;
ds.map(new MapFunction1()).print();
ds.map(new MapFunction2()).print();
Will flink send twice for each records from ds to downstream operators(MapFunction1 and MapFunction2) internally?
I know that data exchange in flink is happened in taskmanager level instead of operator level.
Yes, try:
StreamExecutionEnvironment environment =
StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Long> someIntegers = environment.generateSequence(0, 0);
someIntegers.map(aLong -> aLong + 1).print();
someIntegers.map(aLong -> aLong + 2).print();
environment.execute();
OutPut:
1> 1
1> 2
The job graph for this application looks like this, and the whole application runs in a single thread, in one taskmanager. I disabled operator chaining to get the Flink webui to generate this diagram, but if I hadn't done that, there'd have been no networking involved at all.
I'm trying to get to grips with Spark Streaming but I'm having difficulty. Despite reading the documentation and analysing the examples I wish to do something more than a word count on a text file/stream/Kafka queue which is the only thing we're allowed to understand from the docs.
I wish to listen to an incoming Kafka message stream, group messages by key and then process them. The code below is a simplified version of the process; get the stream of messages from Kafka, reduce by key to group messages by message key then to process them.
JavaPairDStream<String, byte[]> groupByKeyList = kafkaStream.reduceByKey((bytes, bytes2) -> bytes);
groupByKeyList.foreachRDD(rdd -> {
List<MyThing> myThingsList = new ArrayList<>();
MyCalculationCode myCalc = new MyCalculationCode();
rdd.foreachPartition(partition -> {
while (partition.hasNext()) {
Tuple2<String, byte[]> keyAndMessage = partition.next();
MyThing aSingleMyThing = MyThing.parseFrom(keyAndMessage._2); //parse from protobuffer format
myThingsList.add(aSingleMyThing);
}
});
List<MyResult> results = myCalc.doTheStuff(myThingsList);
//other code here to write results to file
});
When debugging I see that in the while (partition.hasNext()) the myThingsList has a different memory address than the declared List<MyThing> myThingsList in the outer forEachRDD.
When List<MyResult> results = myCalc.doTheStuff(myThingsList); is called there are no results because the myThingsList is a different instance of the List.
I'd like a solution to this problem but would prefer a reference to documentation to help me understand why this is not working (as anticipated) and how I can solve it for myself (I don't mean a link to the single page of Spark documentation but also section/paragraph or preferably still, a link to 'JavaDoc' that does not provide Scala examples with non-functional commented code).
The reason you're seeing different list addresses is because Spark doesn't execute foreachPartition locally on the driver, it has to serialize the function and send it over the Executor handling the processing of the partition. You have to remember that although working with the code feels like everything runs in a single location, the calculation is actually distributed.
The first problem I see with you code has to do with your reduceByKey which takes two byte arrays and returns the first, is that really what you want to do? That means you're effectively dropping parts of the data, perhaps you're looking for combineByKey which will allow you to return a JavaPairDStream<String, List<byte[]>.
Regarding parsing of your protobuf, looks to me like you don't want foreachRDD, you need an additional map to parse the data:
kafkaStream
.combineByKey(/* implement logic */)
.flatMap(x -> x._2)
.map(proto -> MyThing.parseFrom(proto))
.map(myThing -> myCalc.doStuff(myThing))
.foreachRDD(/* After all the processing, do stuff with result */)
I am using MapReduce (just map, really) to do a data processing task in four phases. Each phase is one MapReduce job. I need them to run in sequence, that is, don't start phase 2 until phase 1 is done, etc. Does anyone have experience doing this that can share?
Ideally we'd do this 4-job sequence overnight, so making it
cron-able would be a fine thing as well.
thank you
As Daniel mentions, the appengine-pipeline library is meant to solve this problem. I go over chaining mapreduce jobs together in this blog post, under the section "Implementing your own Pipeline jobs".
For convenience, I'll paste the relevant section here:
Now that we know how to launch the predefined MapreducePipeline, let’s take a look at implementing and running our own custom pipeline jobs. The pipeline library provides a low-level library for launching arbitrary distributed computing jobs within appengine, but, for now, we’ll talk specifically about how we can use this to help us chain mapreduce jobs together. Let’s extend our previous example to also output a reverse index of characters and IDs.
First, we define the parent pipeline job.
class ChainMapReducePipeline(mapreduce.base_handler.PipelineBase):
def run(self):
deduped_blob_key = (
yield mapreduce.mapreduce_pipeline.MapreducePipeline(
"test_combiner",
"main.map",
"main.reduce",
"mapreduce.input_readers.RandomStringInputReader",
"mapreduce.output_writers.BlobstoreOutputWriter",
combiner_spec="main.combine",
mapper_params={
"string_length": 1,
"count": 500,
},
reducer_params={
"mime_type": "text/plain",
},
shards=16))
char_to_id_index_blob_key = (
yield mapreduce.mapreduce_pipeline.MapreducePipeline(
"test_chain",
"main.map2",
"main.reduce2",
"mapreduce.input_readers.BlobstoreLineInputReader",
"mapreduce.output_writers.BlobstoreOutputWriter",
# Pass output from first job as input to second job
mapper_params=(yield BlobKeys(deduped_blob_key)),
reducer_params={
"mime_type": "text/plain",
},
shards=4))
This launches the same job as the first example, takes the output from that job, and feeds it into the second job, which reverses each entry. Notice that the result of the first pipeline yield is passed in to mapper_params of the second job. The pipeline library uses magic to detect that the second pipeline depends on the first one finishing and does not launch it until the deduped_blob_key has resolved.
Next, I had to create the BlobKeys helper class. At first, I didn’t think this was necessary, since I could just do:
mapper_params={"blob_keys": deduped_blob_key},
But, this didn’t work for two reasons. The first is that “generator pipelines cannot directly access the outputs of the child Pipelines that it yields”. The code above would require the generator pipeline to create a temporary dict object with the output of the first job, which is not allowed. The second is that the string returned by BlobstoreOutputWriter is of the format “/blobstore/”, but BlobstoreLineInputReader expects simply “”. To solve these problems, I made a little helper BlobKeys class. You’ll find yourself doing this for many jobs, and the pipeline library even includes a set of common wrappers, but they do not work within the MapreducePipeline framework, which I discuss at the bottom of this section.
class BlobKeys(third_party.mapreduce.base_handler.PipelineBase):
"""Returns a dictionary with the supplied keyword arguments."""
def run(self, keys):
# Remove the key from a string in this format:
# /blobstore/<key>
return {
"blob_keys": [k.split("/")[-1] for k in keys]
}
Here is the code for the map2 and reduce2 functions:
def map2(data):
# BlobstoreLineInputReader.next() returns a tuple
start_position, line = data
# Split input based on previous reduce() output format
elements = line.split(" - ")
random_id = elements[0]
char = elements[1]
# Swap 'em
yield (char, random_id)
def reduce2(key, values):
# Create the reverse index entry
yield "%s - %s\n" % (key, ",".join(values))
I'm unfamiliar with google-app-engine, however couldn't you put all of the job-configurations in a single main program and then run them in sequence? something like the following? I think this works in normal map-reduce programs, so if google-app-engine code isn't too different it should work fine.
Configuration conf1 = getConf();
Configuration conf2 = getConf();
Configuration conf3 = getConf();
Configuration conf4 = getConf();
//whatever configuration you do for the jobs
Job job1 = new Job(conf1,"name1");
Job job2 = new Job(conf2,"name2");
Job job3 = new Job(conf3,"name3");
Job job4 = new Job(conf4,"name4");
//setup for the jobs here
job1.waitForCompletion(true);
job2.waitForCompletion(true);
job3.waitForCompletion(true);
job4.waitForCompletion(true);
You need the appengine-pipeline project, which is meant for exactly this.