What's exactly is the use of 'withIngestionTimestamps()' in Hazelcast Jet Pipeline? - java

I'm running a pipeline, source from Kafka topic and sink to an IMap. Everytime I write one, I come across the methods withIngestionTimestamps() and withoutTimestamps() and wondering how are they useful? I understand its all about the source adding time to the event. Question is how do I get to use it? I don't see any method to fetch the timestamp from the event?
My IMap have a possibility of getting filled with duplicate values. If I could make use of the withIngestionTimestamps() method to evaluate latest record and discard the old?

Jet uses the event timestamps to correctly apply windowing. It must decide which event belongs to which window and when the time has come to close a window and emit its aggregated result. The timestamps are present on the events as metadata and aren't exposed to the user.
However, if you want to apply your logic that refers to the wall-clock time, you can always call System.currentTimeMillis() to check it against the timestamp explicitly stored in the IMap value. That would be equivalent to using the processing time, which is quite similar to the ingestion time that Jet applies. Ingestion time is simply the processing time valid at the source vertex of the pipeline, so applying processing time at the sink vertex is just slightly different from that, and has the same practical properties.

Jet manages the event timestamp behind the scenes, it's visible only to processors. For example, the window aggregation will use the timestamp.
If you want to see the timestamp in the code, you have to include it in your item type. You have to go without timestamps from the source, add the ingestion timestamp using a map operator and let Jet know about it:
Pipeline p = Pipeline.create();
p.drawFrom(KafkaSources.kafka(...))
.withoutTimestamps()
.map(t -> tuple2(System.currentTimeMillis(), t))
.addTimestamps(Tuple2::f0, 2000)
.drainTo(Sinks.logger());
I used allowedLag of 2000ms. The reason for this is that the timestamps will be added in a vertex downstream of the vertex that assigned them. Stream merging can take place there and internal skew needs to be accounted for. For example it should account for the longest expected GC pause or network delay. See the note in addTimestamps method.

Related

Kafka Streams: Should we advance stream time per key to test Windowed suppression?

I learnt from This blog and this tutorial that in order to test suppression with event time semantics, one should send dummy records to advance stream time.
I've tried to advance time by doing just that. But this does not seem to work unless time is advanced for a particular key.
I have a custom TimestampExtractor which associates my preferred "stream-time" with the records.
My stream topology pseudocode is as follows (I use the Kafka Streams DSL API):
source.mapValues(someProcessingLambda)
.flatMap(flattenRecordsLambda)
.groupByKey(Grouped.with(Serdes.ByteArray(), Serdes.ByteArray()))
.windowedBy(TimeWindows.of(Duration.ofMinutes(10)).grace(Duration.ZERO))
.aggregate(()->null, aggregationLambda)
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()));
My input is of the following format:
1 - {"stream_time":"2019-04-09T11:08:36.000-04:00", id:"1", data:"..."}
2 - {"stream_time":"2019-04-09T11:09:36.000-04:00", id:"1", data:"..."}
3 - {"stream_time":"2019-04-09T11:18:36.000-04:00", id:"2", data:"..."}
4 - {"stream_time":"2019-04-09T11:19:36.000-04:00", id:"2", data:"..."}
.
.
Now records 1 and 2 belong to a 10 minute window according to stream_time and 3 and 4 belong to another.
Within that window, records are aggregated as per id.
I expected that record 3 would signal that the stream has advanced and cause suppress to emit the data corresponding to 1st window.
However, the data is not emitted until I send a dummy record with id:1 to advance the stream time for that key.
Have I understood the testing instruction incorrectly? Is this expected behavior? Does the key of the dummy record matter?
I’m sorry for the trouble. This is indeed a tricky problem. I have some ideas for adding some operations to support this kind of integration testing, but it’s hard to do without breaking basic stream processing time semantics.
It sounds like you’re testing a “real” KafkaStreams application, as opposed to testing with TopologyTestDriver. My first suggestion is that you’ll have a much better time validating your application semantics with TopologyTestDriver, if it meets your needs.
It sounds to me like you might have more than one partition in your input topic (and therefore your application). In the event that key 1 goes to one partition, and key 3 goes to another, you would see what you’ve observed. Each partition of your application tracks stream time independently.
TopologyTestDriver works nicely because it only uses one partition, and also because it processes data synchronously. Otherwise, you’ll have to craft your “dummy” time advancement messages to go to the same partition as the key you’re trying to flush out.
This is going to be especially tricky because your “flatMap().groupByKey()” is going to repartition the data. You’ll have to craft the dummy message so that it goes into the right partition after the repartition. Or you could experiment with writing your dummy messages directly into the repartition topic.
If you do need to test with KafkaStreams instead of TopologyTestDriver, I guess the easiest thing is just to write a “time advancement” message per key, as you were suggesting in your question. Not because it’s strictly necessary, but because it’s the easiest way to meet all these caveats.
I’ll also mention that we are working on some general improvements to stream time handling in Kafka Streams that should simplify the situation significantly, but that doesn’t help you right now, of course.

Getting values from previous windows

I'm computing statistics (min, avg, etc.) on fixed windows of data. The data is streamed in as single points and are continuous (like temperature).
My current pipeline (simplified for this question) looks like:
read -> window -> compute stats (CombineFn) -> write
The issue with this is that each window's stats are incorrect since they don't have a baseline. By that, I mean that I want each window's statistics to include a single data point (the latest one) from the previous window's data.
One way to think about this is that each window's input PCollection should include the ones that would normally be in the window due to their timestamp, but also one extra point from the previous window's PCollection.
I'm unsure of how I should go about doing this. Here are some things I've thought of doing:
Duplicating the latest data point in every window with a modified timestamp such that it lands in the next window's timeframe
Similarly, create a PCollectionView singleton per window that includes a modified version of its latest data point, which will be consumed as a side input to be merged into the next window's input PCollection
One constraint is that if a window doesn't have any new data points, except for the one that was forwarded to it, it should re-forward that value to the next window.
It sounds like you may need to copy a value from one window into arbitrarily many future windows. The only way I know how to do this is via state and timers.
You could write a stateful DoFn that operates on globally windowed data and stores in its state the latest (by timestamp) element per window and fires a timer at each window boundary this element into the subsequent window. (You could possibly leverage the Latest combine operation to get the latest element per window rather than doing it manually.) Flattening this with your original data and then windowing should give you the values you desire.

Is it possible to dynamically generate BigQuery table names based on the timestamps of the elements of a window?

For example, if I have a Dataflow streaming job with 5 minutes window that reads from PubSub, I understand that if I assign a two days past timestamp to an element, there will be a window with this element, and if I use the example that outputs daily tables to BigQuery described in BigQueryIO.java, the job will write the two days past element in a BigQuery table with the actual date.
I would like to write past elements to BigQuery tables with the timestamp of the elements of the window instead of the time of the current window, is it possible?
Now I'm following the example described in DataflowJavaSDK/sdk/src/main/java/com/google/cloud/dataflow/sdk/io/BigQueryIO.java:
PCollection<TableRow> quotes = ...
quotes.apply(Window.<TableRow>info(CalendarWindows.days(1)))
.apply(BigQueryIO.Write
.named("Write")
.withSchema(schema)
.to(new SerializableFunction<BoundedWindow, String>() {
public String apply(BoundedWindow window) {
String dayString = DateTimeFormat.forPattern("yyyy_MM_dd").parseDateTime(
((DaysWindow) window).getStartDate());
return "my-project:output.output_table_" + dayString;
}
}));
If I understand correctly, you would like to make sure that BigQuery tables are created according to inherent timestamps of the elements (quotes), rather than wall-clock time when your pipeline runs.
TL;DR the code should already do what you want; if it's not, please post more details.
Longer explanation:
One of the key innovations in processing in Dataflow is event-time processing. This means that data processing in Dataflow is almost completely decoupled from when the processing happens - what matters is when the events being processed happened. This is a key element of enabling exactly the same code to run on batch or streaming data sources (e.g. processing real-time user click events using the same code that processes historical click logs). It also enables flexible handling of late-arriving data.
Please see The world beyond batch, the section "Event time vs. processing time" for a description of this aspect of Dataflow's processing model (the whole article is very much worth a read). For a deeper description, see the VLDB paper. This is also described in a more user-facing way in the official documentation on windowing and triggers.
Accordingly, there is no such thing as a "current window" because the pipeline may be concurrently processing many different events that happened at different times and belong to different windows. In fact, as the VLDB paper points out, one of the important parts of the execution of a Dataflow pipeline is "group elements by window".
In the pipeline you showed, we will group the records you want to write to BigQuery into windows using provided timestamps on the records, and write each window to its own table, creating the table for newly encountered windows if necessary. If late data arrives into the window (see documentation on windowing and triggers for a discussion of late data), we will append to the already existing table.
The abovementioned code did not work for me anymore. There is an updated example in the Google docs though where DaysWindow is replaced by IntervalWindow which worked for me:
PCollection<TableRow> quotes = ...
quotes.apply(Window.<TableRow>into(CalendarWindows.days(1)))
.apply(BigQueryIO.Write
.named("Write")
.withSchema(schema)
.to(new SerializableFunction<BoundedWindow, String>() {
public String apply(BoundedWindow window) {
// The cast below is safe because CalendarWindows.days(1) produces IntervalWindows.
String dayString = DateTimeFormat.forPattern("yyyy_MM_dd")
.withZone(DateTimeZone.UTC)
.print(((IntervalWindow) window).start());
return "my-project:output.output_table_" + dayString;
}
}));

Some questions regarding architecture/design to this usecase?

My application needs to work as middleware where it has got orders(in form of xml) from various
customers which contains the supplier id. Once it get the order, it needs to send order request
to different suppliers in the form of xml.i am double minded about three aspects of it. Here they are:-
Questions:
What i am planning at high level is as soon as request come, put it on jms queue.(Now i am not sure
should i create queue for each supplier or one queue should be sufficient. I think one queue will be sufficient.
as maintaining large number of queues will be overhead.). Advantage of maintaining separate queue per supplier is message can be processed faster as there will be separate producer on each queue.
Before putting the object on queue
i need to do some business validations. Also the structure of input xml i am receiving and output xml i need to send to supplier is different. For this i am planning to convert the input xml to java object then put on queue
so that validation can be done with ease at consumer side. Another thought is dont convert the xml into java object, just get all elements
value thru xpath/xstream api and validate them and put xml string as it is on queue because. Then at consumer side convert xml to java object then to different xml format. Is there a way of doing it?
Now my requirement is consumer on queue process the messages on queue after every 5 hours and send the xml request
to suppliers. I am planning to use quartz scheduler here which will pick the job one by one and send to corresponding
supplier based on supplierId. Here is my question is if my job pick the message one by one and then send it to supplier.
it will be too slow . I am planning to handle it where quartz job create ThreadPool with size of say ten threads at time
which concurrently process the messages from queue(So here will be multiple consumers on queue. I think thats not valid for queue. Do i need topic here instead of queue?). Is second approach is better or there is some better than this?
i am expecting a load of 50k request per hour which mean around 15 request per second
Your basic requirement is ,
Get order from customer as XML ( you have not told how you are receiving)
Do basic Business validation .
Send the Orders to Suppliers
And you would be excepting 50k Request ( You haven't provided the approximate an Order size).
Assuming your Average order size 10K, it would be around 500 MB required just to hold it in Queue ( irrespective of number of queues) . i am not sure which environment you are running.
For Point #1
I would choose single Queue instead of multiple Queue
- Choose the appropriate persistent store.
I am assuming you would be using Distributed Queue , so that it can be easily scale while adding clusters.
For Point #2
I would be converting in POJO (Your own format ) and perform business validation. So that later if you want to extend the business validation to ruler or any other conversion it would be easy to maintain.
- basically get the input in any form ( XML / POJO / JSON ...) and convert into Middle format ( you can write custome validator / conversion utility on top of Middle fomart) . And have Keep Mappings between the Common format to input as well output. So that you can write formatters and use them. which will not impact in future while changing format for any specific supplier. Try to externalize the format mapping.
For Point # 3
in your case, A Order needs to be processed by only once. So i would go with Queue. and you can have multiple Message Listeners . Message listeners deliver order in asynchronous. So you can have multiple Listeners for an Queue. And each listeners would run separate thread.
Is there a problem to send the orders as soon as it received ? It would be good for you as well as the supplier to avoid heavy load at particular time.
Since you are the middleware, you should handle data quick at the point of contact, to get your hands free for more incoming requests. Therefore you must find a way to distinquish the incoming data as quick and memory low as possible. Leave the processing of the data to modules more specific to the problem. A receptionist just directs the guests in the right spot.
If you really have to read and understand the received data in your specialized worker later on, use a threadpool. This way you can process the data parallelly without too much worry about outofmem. Just choose your pool size smartly and use only one. You could use a listener pattern to signal new incoming data to the worker multiton. You should avoid jaxb or better the complete deserialization of the data if possible. It eats up memory like hell.
I would not use jmx because you "messages" are relevant for only one listener.
If it is possible send the mail as soon as the worker is done with its work. If not, use a storage. This way you can later proove you processed the data and if something went wrong or you have to update your software, you do not have to worry about volatile data.

Multiple iCalendar VEvents or VTODOs for the same meeting

My application needs to have a feature which would allow the creation of sort of like a project, where you put the total number of hours you need to work on it, the start date, the end date and how long each activity takes(additional constraints might be included as well).
What is the best way to create multiple VEvents according to those constraints with an option to change those VEvents? Also what's the best method to check the current iCalendar if the date is busy or not? Can I somehow retrieve all the busy dates from ics file and then just kind of check if the time gap is free or busy?
Typically, projects are described in terms of tasks (VTODO) instead of events. See http://www.calconnect.org/7_things_tasks.shtml for an introduction on tasks. This document also describes how tasks can be grouped together.
The second part of your question is a bit fuzzy. With a library like ical4j, you can make freebusy requests within a stream of vtodos. The other option would be to rely on a CalDAV server to store those.

Categories