Getting values from previous windows - java

I'm computing statistics (min, avg, etc.) on fixed windows of data. The data is streamed in as single points and are continuous (like temperature).
My current pipeline (simplified for this question) looks like:
read -> window -> compute stats (CombineFn) -> write
The issue with this is that each window's stats are incorrect since they don't have a baseline. By that, I mean that I want each window's statistics to include a single data point (the latest one) from the previous window's data.
One way to think about this is that each window's input PCollection should include the ones that would normally be in the window due to their timestamp, but also one extra point from the previous window's PCollection.
I'm unsure of how I should go about doing this. Here are some things I've thought of doing:
Duplicating the latest data point in every window with a modified timestamp such that it lands in the next window's timeframe
Similarly, create a PCollectionView singleton per window that includes a modified version of its latest data point, which will be consumed as a side input to be merged into the next window's input PCollection
One constraint is that if a window doesn't have any new data points, except for the one that was forwarded to it, it should re-forward that value to the next window.

It sounds like you may need to copy a value from one window into arbitrarily many future windows. The only way I know how to do this is via state and timers.
You could write a stateful DoFn that operates on globally windowed data and stores in its state the latest (by timestamp) element per window and fires a timer at each window boundary this element into the subsequent window. (You could possibly leverage the Latest combine operation to get the latest element per window rather than doing it manually.) Flattening this with your original data and then windowing should give you the values you desire.

Related

Dataflow Distinct transform example

In my Dataflow pipeline am trying to use a Distinct transform to reduce duplicates. I would like to try applying this to fixed 1min windows initially and use another method to deal with duplicates across the windows. This latter point will likely work best if the 1min windows are real/processing time.
I expect a few 1000 elements, text strings of a few KiB each.
I set up the Window and Distinct transform like this:
PCollection<String>.apply("Deduplication global window", Window
.<String>into(new GlobalWindows())
.triggering(Repeatedly
.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1))
)
)
.withAllowedLateness(Duration.ZERO).discardingFiredPanes()
)
.apply("Deduplicate URLs in window", Distinct.<String>create());
But when I run this on GCP, I see the Distinct transform appears to emit more elements than it receives:
(So by definition they cannot be distinct unless it makes something up!)
More likely I guess I am not setting it up properly. Does anybody have an example of how to do it (I didn't really find much apart from the javadoc)? Thank you.
As you want to remove duplicates within 1 min window;
You can make use of Fixed Windows with the default trigger rather than using a Global window with a Processing time trigger.
Window.<String>into(FixedWindows.of(Duration.standardSeconds(60))));
This followed by the distinct transform will remove any repeated keys within the 1 min window based on event time.

What's exactly is the use of 'withIngestionTimestamps()' in Hazelcast Jet Pipeline?

I'm running a pipeline, source from Kafka topic and sink to an IMap. Everytime I write one, I come across the methods withIngestionTimestamps() and withoutTimestamps() and wondering how are they useful? I understand its all about the source adding time to the event. Question is how do I get to use it? I don't see any method to fetch the timestamp from the event?
My IMap have a possibility of getting filled with duplicate values. If I could make use of the withIngestionTimestamps() method to evaluate latest record and discard the old?
Jet uses the event timestamps to correctly apply windowing. It must decide which event belongs to which window and when the time has come to close a window and emit its aggregated result. The timestamps are present on the events as metadata and aren't exposed to the user.
However, if you want to apply your logic that refers to the wall-clock time, you can always call System.currentTimeMillis() to check it against the timestamp explicitly stored in the IMap value. That would be equivalent to using the processing time, which is quite similar to the ingestion time that Jet applies. Ingestion time is simply the processing time valid at the source vertex of the pipeline, so applying processing time at the sink vertex is just slightly different from that, and has the same practical properties.
Jet manages the event timestamp behind the scenes, it's visible only to processors. For example, the window aggregation will use the timestamp.
If you want to see the timestamp in the code, you have to include it in your item type. You have to go without timestamps from the source, add the ingestion timestamp using a map operator and let Jet know about it:
Pipeline p = Pipeline.create();
p.drawFrom(KafkaSources.kafka(...))
.withoutTimestamps()
.map(t -> tuple2(System.currentTimeMillis(), t))
.addTimestamps(Tuple2::f0, 2000)
.drainTo(Sinks.logger());
I used allowedLag of 2000ms. The reason for this is that the timestamps will be added in a vertex downstream of the vertex that assigned them. Stream merging can take place there and internal skew needs to be accounted for. For example it should account for the longest expected GC pause or network delay. See the note in addTimestamps method.

Kubernetes, Java and Grafana - How to display only the running containers?

I'm working on a setup where we run our Java services in docker containers hosted on a kubernetes platform.
On want to create a dashboard where I can monitor the heap usage of all instances of a service in my grafana. Writing metrics to statsd with the pattern:
<servicename>.<containerid>.<processid>.heapspace works well, I can see all heap usages in my chart.
After a redeployment, the container names change, so new values are added to the existing graph. My problem is, that the old lines continue to exist at the position of the last value received, but the containers are already dead.
Is there any simple solution for this in grafana? Can I just say: if you didn't receive data for a metric for more than X seconds, abort the chart line?
Update:
Upgrading to the newest Grafana Version and Setting "null" as value for "Null value" in Stacking and Null Value didn't work.
Maybe it's a problem with statsd?
I'm sending data to statsd in form of:
felix.javaclient.machine<number>-<pid>.heap:<heapvalue>|g
Is anything wrong with this?
This can happen for 2 reasons, because grafana is using the "connected" setting for null values, and/or (as is the case here) because statsd is sending the previously-seen value for the gauge when there are no updates in the current period.
Grafana Config
You'll want to make 2 adjustments to your graph config:
First, go to the "Display" tab and under "Stacking & Null value" change "Null value" to "null", that will cause Grafana to stop showing the lines when there is no data for a series.
Second, if you're using a legend you can go to the "Legend" tab and under "Hide series" check the "With only nulls" checkbox, that will cause items to only be displayed in the legend if they have a non-null value during the graph period.
statsd Config
The statsd documentation for gauge metrics tells us:
If the gauge is not updated at the next flush, it will send the
previous value. You can opt to send no metric at all for this gauge,
by setting config.deleteGauges
So, the grafana changes alone aren't enough in this case, because the values in graphite aren't actually null (since statsd keeps sending the last reading). If you change the statsd config to have deleteGauges: true then statsd won't send anything and graphite will contain the null values we expect.
Graphite Note
As a side note, a setup like this will cause your data folder to grow continuously as you create new series each time a container is launched. You'll definitely want to look into removing old series after some period of inactivity to avoid filling up the disk. If you're using graphite with whisper that can be as simple as a cron task running find /var/lib/graphite/whisper/ -name '*.wsp' -mtime +30 -delete to remove whisper files that haven't been modified in the last 30 days.
To do this, I would use
maximumAbove(transformNull(felix.javaclient.*.heap, 0), 0)
The transformNull will take any datapoint that is currently null, or unreported for that instant in time, and turn it into a 0 value.
The maximumAbove will only display the series' that have a maximum value above 0 for the selected time period.
Using maximumAbove, you can see all history containers, if you wish to see only the currently running containers, you should use just that: currentAbove

Is it possible to dynamically generate BigQuery table names based on the timestamps of the elements of a window?

For example, if I have a Dataflow streaming job with 5 minutes window that reads from PubSub, I understand that if I assign a two days past timestamp to an element, there will be a window with this element, and if I use the example that outputs daily tables to BigQuery described in BigQueryIO.java, the job will write the two days past element in a BigQuery table with the actual date.
I would like to write past elements to BigQuery tables with the timestamp of the elements of the window instead of the time of the current window, is it possible?
Now I'm following the example described in DataflowJavaSDK/sdk/src/main/java/com/google/cloud/dataflow/sdk/io/BigQueryIO.java:
PCollection<TableRow> quotes = ...
quotes.apply(Window.<TableRow>info(CalendarWindows.days(1)))
.apply(BigQueryIO.Write
.named("Write")
.withSchema(schema)
.to(new SerializableFunction<BoundedWindow, String>() {
public String apply(BoundedWindow window) {
String dayString = DateTimeFormat.forPattern("yyyy_MM_dd").parseDateTime(
((DaysWindow) window).getStartDate());
return "my-project:output.output_table_" + dayString;
}
}));
If I understand correctly, you would like to make sure that BigQuery tables are created according to inherent timestamps of the elements (quotes), rather than wall-clock time when your pipeline runs.
TL;DR the code should already do what you want; if it's not, please post more details.
Longer explanation:
One of the key innovations in processing in Dataflow is event-time processing. This means that data processing in Dataflow is almost completely decoupled from when the processing happens - what matters is when the events being processed happened. This is a key element of enabling exactly the same code to run on batch or streaming data sources (e.g. processing real-time user click events using the same code that processes historical click logs). It also enables flexible handling of late-arriving data.
Please see The world beyond batch, the section "Event time vs. processing time" for a description of this aspect of Dataflow's processing model (the whole article is very much worth a read). For a deeper description, see the VLDB paper. This is also described in a more user-facing way in the official documentation on windowing and triggers.
Accordingly, there is no such thing as a "current window" because the pipeline may be concurrently processing many different events that happened at different times and belong to different windows. In fact, as the VLDB paper points out, one of the important parts of the execution of a Dataflow pipeline is "group elements by window".
In the pipeline you showed, we will group the records you want to write to BigQuery into windows using provided timestamps on the records, and write each window to its own table, creating the table for newly encountered windows if necessary. If late data arrives into the window (see documentation on windowing and triggers for a discussion of late data), we will append to the already existing table.
The abovementioned code did not work for me anymore. There is an updated example in the Google docs though where DaysWindow is replaced by IntervalWindow which worked for me:
PCollection<TableRow> quotes = ...
quotes.apply(Window.<TableRow>into(CalendarWindows.days(1)))
.apply(BigQueryIO.Write
.named("Write")
.withSchema(schema)
.to(new SerializableFunction<BoundedWindow, String>() {
public String apply(BoundedWindow window) {
// The cast below is safe because CalendarWindows.days(1) produces IntervalWindows.
String dayString = DateTimeFormat.forPattern("yyyy_MM_dd")
.withZone(DateTimeZone.UTC)
.print(((IntervalWindow) window).start());
return "my-project:output.output_table_" + dayString;
}
}));

ElasticSearch Multiple Scrolls Java API

I want to get all data from an index. Since the number of items is too large for memory I use the Scroll (nice function):
client.prepareSearch(index)
.setTypes(myType).setSearchType(SearchType.SCAN)
.setScroll(new TimeValue(60000))
.setSize(amountPerCall)
.setQuery(MatchAll())
.execute().actionGet();
Which works nice when calling:
client.prepareSearchScroll(scrollId)
.setScroll(new TimeValue(600000))
.execute().actionGet()
But, when I call the former method multiple times, I get the same scrollId multiple times, hence I cannot scroll multiple times - in parallel.
I found http://elasticsearch-users.115913.n3.nabble.com/Multiple-scrolls-simultanious-td4024191.html which states that it is possible - though I don't know his affiliation to ES.
Am I doing something wrong?
After searching some more, I got the impression that this (same scrollId) is by design. After the timeout has expired (which is reset after each call Elasticsearch scan and scroll - add to new index).
So you can only get one opened scroll per index.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html states:
Scrolling is not intended for real time user requests, but rather for
processing large amounts of data, e.g. in order to reindex the
contents of one index into a new index with a different configuration.
So it appears what I wanted is not an option, on purpose - possibly because of optimization.
Update
As stated creating multiple scrolls cannot be done, but this is only true when the query you use for scrolling is the same. If you scroll for, for instance, another type, index, or just another query, you can have multiple scrolls
You can scroll the same index in same time, this is what elasticsearch-hadoop does.
Just, don't forget that under the hood, an index is composed of multiple shards that own data, so you can scroll each shards in parallel by using:
.setPreference("_shards:1")

Categories