spark : duplicate stages while using join

spark : duplicate stages while using join - java

I am using the spark java api and I have come to notice this weird thing that I can't explain. As you can see in
which is the dag visualization of my program execution, no other stage uses the computation of stage 3 , also the three operation in stage 3 are exactly the first 3 operations of stage 2, so my question , why is stage 3 computed separately ? I have also run the program without the last join operation , which gives the following dag,
notice here that there is no parallel stage like in the previous one . I believe that due to this unexplained stage 3 , my program is slowing down .
PS : I am very new to spark and this is my first stackoverflow question, please let me know if it's off topic or requires more detail.

It looks like the join operation takes 2 inputs:
the result of the map which is the third operation on stage 2
the result of the flatMap which is the 6th operation of phase 2
Spark looks at a single stage at a time when it plans it's computations. It doesn't know that it will need to keep both sets of values. This can result in the 3rd step values being overwritten while the later steps are being computed. When it gets to the join which requires both sets of values, it will realize that it needs those missing values and will recompute them, which is why you're seeing the additional stage that reproduces the first part of stage 2.
You can tell it to save the intermediate values for later, by calling .cache() on the resulting RDD from the map operation and then joining on the RDD returned from the cache(). This will cause spark to make a best effort to maintain those values in memory. You may still see the new stage appear, but it should complete instantaneously if there was enough available memory to store the values.

Related

Kafka Streams: Should we advance stream time per key to test Windowed suppression?

I learnt from This blog and this tutorial that in order to test suppression with event time semantics, one should send dummy records to advance stream time.
I've tried to advance time by doing just that. But this does not seem to work unless time is advanced for a particular key.
I have a custom TimestampExtractor which associates my preferred "stream-time" with the records.
My stream topology pseudocode is as follows (I use the Kafka Streams DSL API):
source.mapValues(someProcessingLambda)
.flatMap(flattenRecordsLambda)
.groupByKey(Grouped.with(Serdes.ByteArray(), Serdes.ByteArray()))
.windowedBy(TimeWindows.of(Duration.ofMinutes(10)).grace(Duration.ZERO))
.aggregate(()->null, aggregationLambda)
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()));
My input is of the following format:
1 - {"stream_time":"2019-04-09T11:08:36.000-04:00", id:"1", data:"..."}
2 - {"stream_time":"2019-04-09T11:09:36.000-04:00", id:"1", data:"..."}
3 - {"stream_time":"2019-04-09T11:18:36.000-04:00", id:"2", data:"..."}
4 - {"stream_time":"2019-04-09T11:19:36.000-04:00", id:"2", data:"..."}
.
.
Now records 1 and 2 belong to a 10 minute window according to stream_time and 3 and 4 belong to another.
Within that window, records are aggregated as per id.
I expected that record 3 would signal that the stream has advanced and cause suppress to emit the data corresponding to 1st window.
However, the data is not emitted until I send a dummy record with id:1 to advance the stream time for that key.
Have I understood the testing instruction incorrectly? Is this expected behavior? Does the key of the dummy record matter?

I’m sorry for the trouble. This is indeed a tricky problem. I have some ideas for adding some operations to support this kind of integration testing, but it’s hard to do without breaking basic stream processing time semantics.
It sounds like you’re testing a “real” KafkaStreams application, as opposed to testing with TopologyTestDriver. My first suggestion is that you’ll have a much better time validating your application semantics with TopologyTestDriver, if it meets your needs.
It sounds to me like you might have more than one partition in your input topic (and therefore your application). In the event that key 1 goes to one partition, and key 3 goes to another, you would see what you’ve observed. Each partition of your application tracks stream time independently.
TopologyTestDriver works nicely because it only uses one partition, and also because it processes data synchronously. Otherwise, you’ll have to craft your “dummy” time advancement messages to go to the same partition as the key you’re trying to flush out.
This is going to be especially tricky because your “flatMap().groupByKey()” is going to repartition the data. You’ll have to craft the dummy message so that it goes into the right partition after the repartition. Or you could experiment with writing your dummy messages directly into the repartition topic.
If you do need to test with KafkaStreams instead of TopologyTestDriver, I guess the easiest thing is just to write a “time advancement” message per key, as you were suggesting in your question. Not because it’s strictly necessary, but because it’s the easiest way to meet all these caveats.
I’ll also mention that we are working on some general improvements to stream time handling in Kafka Streams that should simplify the situation significantly, but that doesn’t help you right now, of course.

Dataflow Distinct transform example

In my Dataflow pipeline am trying to use a Distinct transform to reduce duplicates. I would like to try applying this to fixed 1min windows initially and use another method to deal with duplicates across the windows. This latter point will likely work best if the 1min windows are real/processing time.
I expect a few 1000 elements, text strings of a few KiB each.
I set up the Window and Distinct transform like this:
PCollection<String>.apply("Deduplication global window", Window
.<String>into(new GlobalWindows())
.triggering(Repeatedly
.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1))
)
)
.withAllowedLateness(Duration.ZERO).discardingFiredPanes()
)
.apply("Deduplicate URLs in window", Distinct.<String>create());
But when I run this on GCP, I see the Distinct transform appears to emit more elements than it receives:
(So by definition they cannot be distinct unless it makes something up!)
More likely I guess I am not setting it up properly. Does anybody have an example of how to do it (I didn't really find much apart from the javadoc)? Thank you.

As you want to remove duplicates within 1 min window;
You can make use of Fixed Windows with the default trigger rather than using a Global window with a Processing time trigger.
Window.<String>into(FixedWindows.of(Duration.standardSeconds(60))));
This followed by the distinct transform will remove any repeated keys within the 1 min window based on event time.

Does spark recalculate an RDD that has been persisted to disk all over again in a new job?

Let's say my spark application is composed of 2 jobs.
Job-1: is composed of a single stage, the result of the stage is persisted
rdd1.persist(DISK_ONLY)
Job-2: uses the calculated rdd1. However when I look into Execution DAG i see that all the steps that lead up to rdd1 in job-1 is represented as blue boxes. Although the actual rdd is coloured as green.
Does that mean that the actions that lead up to the rdd are actually skipped?

No
It is actually a shortcoming of the SparkUI. It will only compute steps after rdd1. Though it will show the complete stage in blue.

Sort using predefined function

I've got a sorting problem. I have a function, let's say blackbox to simplify.
As input it takes two jobs (tasks) and as output it returns the one to be processed first. For example:
input(1,2) --> output: Job 2 is first.
Problem is, this blackbox sometimes takes bad decisions.
Example: Suppose we have 3 jobs : 0, 1 and 2. We test each job against the other to identify a processing order.
input(0,1) --> output: Job 0 is first
input(1,2) --> output: Job 1 is first
input(0,2) --> output: Job 2 is first (bad decision)
So here's the problem, normally using the two fist input, job 0 have to be processed before 2. But the balckbox says otherwise.
I want using this blackbox sort a set of jobs, taking into consideration this problem.
So, how can I sort the set of jobs ?

There is easy to identify that problem exists. You need to build direct graph of decision. If it contains cycles than you have a bad decision somewhere.
But It is impossible to find out which decision is bad. Any decision in cycle can be bad (or even several of them).
EDIT
You can remove some edges of graph to break loop (you can chose it any way you like). After that you task will be partially ordered (or maybe totally ordered I need to think about it).
EDIT 2
Here is wiki article which may help you Feedback arc set

Hadoop: Is it possible to have in memory structures in map function and aggregate them?

I am currently reading a paper and i have come to a point were the writers say that they have some arrays in memory for every map task and when the map task ends, they output that array.
This is the paper that i am referring to : http://research.google.com/pubs/pub36296.html
This looks somewhat a bit non-mapreduce thing to do, but i am trying to implement this project and i have come to a point were this is the only solution. I have tried many ways to use the common map reduce philosophy, which is process each line and output a key-value pair, but in that way i have for every line of input many thousands of context writes and its takes a long time to write them. So my map task is a bottleneck. These context writes cost a lot.
If i do it their way, i will have managed to reduce the number of key-value pairs dramatically. So i need to find a way to have in memory structures for every map task.
I can define these structures as static in the setup function, but i can find a way to tell when the map tasks ends, so that i can output that structure. I know it sounds a bit weird, but it is the only way to work efficiently.
This is what they say in that paper
On startup, each mapper loads the set of split points to be
considered for each ordered attribute. For each node n ∈ N
and attribute X, the mapper maintains a table Tn,X of key-
value pairs.
After processing all input data, the mappers out-
put keys of the form n, X and value v, Tn,X [v]
Here are some edits after Sean's answer :
I am using a combiner in my job. The thing is that these context.write(Text,Text) commands in my map function, are really time consuming. My input is csv files or arff files. In every line there is an example. My examples might have up to thousands of attributes. I am outputting for every attribute, key-value pairs in the form <(n,X,u),Y>, where is the name of the node (i am building a decision tree), X is the name of the attribute, u is the value of the attribute and Y are some statistics in Text format. As you can tell, if i have 100,000 attributes, i will have to have 100,000 context.write(Text,Text) commands for every example. Running my map task without these commands, it runs like the wind. If i add the context.write command, it takes forever. Even for a 2,000 thousand attribute training set. It really seems like i am writing in files and not in memory. So i really need to reduce those writes. Aggregating them in memory (in map function and not in the combiner) is necessary.

Adding a different answer since I see the point of the question now I think.
To know when the map task ends, well, you can override close(). I don't know if this is what you want. If you have 50 mappers, the 1/50th of the input each sees is not known or guaranteed. Is that OK for your use case -- you just need each worker to aggregate stats in memory for what it has seen and output?
Then your procedure is fine but probably would not make your in-memory data structure static -- nobody said two Mappers won't run in one JVM classloader.
A more common version of this pattern plays out in the Reducer where you need to collect info over some known subset of the keys coming in before you can produce one record. You can use a partitioner, and the fact the keys are sorted, to know you are seeing all of that subset on one worker, and, can know when it's done because a new different subset appears. Then it's pretty easy to collect data in memory while processing a subset, output the result and clear it when a new subset comes in.
I am not sure that works here since the bottleneck happens before the Reducer.

Without knowing a bit more about the details of what you are outputting, I can't be certain if this will help, but, this sounds like exactly what a combiner is designed to help with. It is like a miniature reducer (in fact, a combiner implementation is just another implementation of Reducer) attached to the output of a Mapper. Its purpose is to collect map output records in memory, and try to aggregate them before being written to disk and then collected by the Reducer.
The classic example is counting values. You can output "key,1" from your map and then add up the 1s in a reducer, but, this involves outputting "key,1" 1000 times from a mapper if the key appears 1000 times when "key,1000" would suffice. A combiner does that. Of course it only applies when the operation in question is associative/commutative and can be run repeatedly with no side effect -- addition is a good example.
Another answer: in Mahout we implement a lot of stuff that is both weird, a bit complex, and very slow if done the simple way. Pulling tricks like collecting data in memory in a Mapper is a minor and sometimes necessary sin, so, nothing really wrong with it. It does mean you really need to know the semantics that Hadoop guarantees, test well, and think about running out of memory if not careful.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.