Dataflow Distinct transform example - java

In my Dataflow pipeline am trying to use a Distinct transform to reduce duplicates. I would like to try applying this to fixed 1min windows initially and use another method to deal with duplicates across the windows. This latter point will likely work best if the 1min windows are real/processing time.
I expect a few 1000 elements, text strings of a few KiB each.
I set up the Window and Distinct transform like this:
PCollection<String>.apply("Deduplication global window", Window
.<String>into(new GlobalWindows())
.triggering(Repeatedly
.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1))
)
)
.withAllowedLateness(Duration.ZERO).discardingFiredPanes()
)
.apply("Deduplicate URLs in window", Distinct.<String>create());
But when I run this on GCP, I see the Distinct transform appears to emit more elements than it receives:
(So by definition they cannot be distinct unless it makes something up!)
More likely I guess I am not setting it up properly. Does anybody have an example of how to do it (I didn't really find much apart from the javadoc)? Thank you.

As you want to remove duplicates within 1 min window;
You can make use of Fixed Windows with the default trigger rather than using a Global window with a Processing time trigger.
Window.<String>into(FixedWindows.of(Duration.standardSeconds(60))));
This followed by the distinct transform will remove any repeated keys within the 1 min window based on event time.

Related

Prometheus query by label with range vectors

I'm defining a lot of counters in my app (using java micrometer) and in order to trigger alerts I tag the counters which I want to monitor with "error":"alert" so a query like {error="alert"} will generate multiple range vectors:
error_counter_component1{error="alert", label2="random"}
error_counter_component2{error="alert", label2="random2"}
error_counter_component3{error="none", label2="random3"}
I don't control the name of the counters I can only add the label to the counters I want to use in my alert. The alert that I want to have is if all the counters labeled with error="alert" increase more then 3 in one hour so I could use this kind of query: increase({error="alert"}[1h]) > 3 but I get the fallowing error in Prometheus: Error executing query: vector cannot contain metrics with the same labelset
Is there a way to merge two range vectors or should I include some kind of tag in the name of the counter? Or should I have a single counter for errors and the tags should specify the source something like this:
errors_counter{source="component1", use_in_alert="yes"}
errors_counter{source="component2", use_in_alerts="yes"}
errors_counter{source="component3", use_in_alerts="no"}
The version with source="componentX" label is much more fitting to prometheus data model. This is assuming the error_counter metric is really one metric and other than source label value it will have same labels etc. (for example it is emitted by the same library or framework).
Adding stuff like use_in_alerts label is not a great solution. Such label does not identify time series.
I'd say put a list of components to alert on somewhere where your alerting queries are constructed and dynamically create separate alerting rules (without adding such label to raw data).
Other solution is to have a separate pseudo metric that will obnly be used to provide metadata about components, like:
component_alert_on{source="component2"} 1
and. combine it in alerting rule to only alert on components you need. It can be generated in any possible way, but one possibility is to have it added in static recording rule. This has the con of complicating alerting query somehow.
But of course use_in_alerts label will also probably work (at least while you are only alerting on this metric).

Getting values from previous windows

I'm computing statistics (min, avg, etc.) on fixed windows of data. The data is streamed in as single points and are continuous (like temperature).
My current pipeline (simplified for this question) looks like:
read -> window -> compute stats (CombineFn) -> write
The issue with this is that each window's stats are incorrect since they don't have a baseline. By that, I mean that I want each window's statistics to include a single data point (the latest one) from the previous window's data.
One way to think about this is that each window's input PCollection should include the ones that would normally be in the window due to their timestamp, but also one extra point from the previous window's PCollection.
I'm unsure of how I should go about doing this. Here are some things I've thought of doing:
Duplicating the latest data point in every window with a modified timestamp such that it lands in the next window's timeframe
Similarly, create a PCollectionView singleton per window that includes a modified version of its latest data point, which will be consumed as a side input to be merged into the next window's input PCollection
One constraint is that if a window doesn't have any new data points, except for the one that was forwarded to it, it should re-forward that value to the next window.
It sounds like you may need to copy a value from one window into arbitrarily many future windows. The only way I know how to do this is via state and timers.
You could write a stateful DoFn that operates on globally windowed data and stores in its state the latest (by timestamp) element per window and fires a timer at each window boundary this element into the subsequent window. (You could possibly leverage the Latest combine operation to get the latest element per window rather than doing it manually.) Flattening this with your original data and then windowing should give you the values you desire.

spark : duplicate stages while using join

I am using the spark java api and I have come to notice this weird thing that I can't explain. As you can see in
which is the dag visualization of my program execution, no other stage uses the computation of stage 3 , also the three operation in stage 3 are exactly the first 3 operations of stage 2, so my question , why is stage 3 computed separately ? I have also run the program without the last join operation , which gives the following dag,
notice here that there is no parallel stage like in the previous one . I believe that due to this unexplained stage 3 , my program is slowing down .
PS : I am very new to spark and this is my first stackoverflow question, please let me know if it's off topic or requires more detail.
It looks like the join operation takes 2 inputs:
the result of the map which is the third operation on stage 2
the result of the flatMap which is the 6th operation of phase 2
Spark looks at a single stage at a time when it plans it's computations. It doesn't know that it will need to keep both sets of values. This can result in the 3rd step values being overwritten while the later steps are being computed. When it gets to the join which requires both sets of values, it will realize that it needs those missing values and will recompute them, which is why you're seeing the additional stage that reproduces the first part of stage 2.
You can tell it to save the intermediate values for later, by calling .cache() on the resulting RDD from the map operation and then joining on the RDD returned from the cache(). This will cause spark to make a best effort to maintain those values in memory. You may still see the new stage appear, but it should complete instantaneously if there was enough available memory to store the values.

ElasticSearch Multiple Scrolls Java API

I want to get all data from an index. Since the number of items is too large for memory I use the Scroll (nice function):
client.prepareSearch(index)
.setTypes(myType).setSearchType(SearchType.SCAN)
.setScroll(new TimeValue(60000))
.setSize(amountPerCall)
.setQuery(MatchAll())
.execute().actionGet();
Which works nice when calling:
client.prepareSearchScroll(scrollId)
.setScroll(new TimeValue(600000))
.execute().actionGet()
But, when I call the former method multiple times, I get the same scrollId multiple times, hence I cannot scroll multiple times - in parallel.
I found http://elasticsearch-users.115913.n3.nabble.com/Multiple-scrolls-simultanious-td4024191.html which states that it is possible - though I don't know his affiliation to ES.
Am I doing something wrong?
After searching some more, I got the impression that this (same scrollId) is by design. After the timeout has expired (which is reset after each call Elasticsearch scan and scroll - add to new index).
So you can only get one opened scroll per index.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html states:
Scrolling is not intended for real time user requests, but rather for
processing large amounts of data, e.g. in order to reindex the
contents of one index into a new index with a different configuration.
So it appears what I wanted is not an option, on purpose - possibly because of optimization.
Update
As stated creating multiple scrolls cannot be done, but this is only true when the query you use for scrolling is the same. If you scroll for, for instance, another type, index, or just another query, you can have multiple scrolls
You can scroll the same index in same time, this is what elasticsearch-hadoop does.
Just, don't forget that under the hood, an index is composed of multiple shards that own data, so you can scroll each shards in parallel by using:
.setPreference("_shards:1")

Setting ttl on a mongoDB collection - in application or shell?

I would like to set the ttl for a collection once, what is the idiomatic way of achieving this when building a java application that uses mongoDB? Do ppl simply apply settings like these in the shell? Or in the application code is it normal to check if a collection is already in the DB, if it is not then create it with the desired options?
Thanks!
I never do index building in my application code anymore.
I confess that I used to. Everytime my application started up I would ensure all my indexes, until suddenly one day a beginner developer got hold of my code and accidently deleted a character within one of my index sequences.
Consequently the entire cluster froze and went down due to processing, in the foreground, this index building. Fortunately I had a number of delayed and non-index building slaves to repair from but still, I lost about 12 hours all in all and in turn 12 hours of business.
I would recommend you don't do your index building in the application code but instead carfully within your mongo console. That goes for any operation like this, even TTL indexing.
You can set a TTL on a collection as documented here.
Using the Java driver, I would try:
theTTLCollection.ensureIndex(new BasicDBObject("status", 1), new BasicDBObject("expireAfterSeconds", 3600));
hth.
Setting a TTL
is an index operation, so I guess that it would not be wise performance wise to do it every time your code is running.

Categories