Apache Flink: Ordered timestamps with parallelism

Apache Flink: Ordered timestamps with parallelism - java

I have a datastream in which the order of the events is important. The time characteristic is set to EventTime as the incoming records have a timestamp within them.
In order to guarantee the ordering, I set the parallelism for the program to 1. Could that become a problem, performance wise, when my program gets more complex?
If I understand correctly, I need to assign watermarks to my events, if I want to keep the stream ordered by timestamp. This is quite simple. But I'm reading that even that doesn't guarantee order? Later on, I want to do stateful computations over that stream. So, for that I use a FlatMap function, which needs the stream to be keyed. But if I key the stream, the order is lost again. AFAIK this is because of different stream partitions, which are "caused" by parallelism.
I have two questions:
Do I need parallelism? What factors do I need to consider here?
How would I achieve "ordered parallelism" with what I described above?

Several points to consider:
Setting the parallelism to 1 for the entire job will prevent scaling your application, which will affect performance. Whether this actually matters depends on your application requirements, but it would certainly be limitation, and could be a problem.
If the aggregates you've mentioned are meant to be computed globally across all the event records then operating in parallel will require doing some pre-aggregation in parallel. But in this case you will then have to reduce the parallelism to 1 in the later stages of your job graph in order to produce the ultimate (global) results.
If on the other hand these aggregates are to be computed independently for each value of some key, then it makes sense to consider keying the stream and to use that partitioning as the basis for operating in parallel.
All of the operations you mention require some state, whether computing max, min, averages, or uptime and downtime. For example, you can't compute the maximum without remembering the maximum encountered so far.
If I understand correctly how Flink's NiFi source connector works, then if the source is operating in parallel, keying the stream will result in out-of-order events.
However, none of the operations you've mentioned require that the data be delivered in-order. Computing uptime (and downtime) on an out-of-order stream will require some buffering -- these operations will need to wait for out-of-order data to arrive before they can produce results -- but that's certainly doable. That's exactly what watermarks are for; they define how long to wait for out-of-order data. You can use an event-time timer in a ProcessFunction to arrange for an onTimer callback to be called when all earlier events have been processed.
You could always sort the keyed stream. Here's an example.
The uptime/downtime calculation should be easy to do with Flink's CEP library (which sorts its input, btw).
UPDATE:
It is true that after applying a ProcessFunction to a keyed stream the stream is no longer keyed. But in this case you could safely use reinterpretAsKeyedStream to inform Flink that the stream is still keyed.
As for CEP, this library uses state on your behalf, making it easier to develop applications that need to react to patterns.

Related

Is there a way to get Strong Consistency with RocksDb in Java?

I have a program which accesses a single RocksDB using multiple threads.
Our workflow for a given document is to read the cache, do some work, then update the cache.
My code uses chained CompletableFutures to process multiple documents in order (and processes the first document before starting the subsequent document). So my RocksDB workload consists of (read, write) repeated several times for the same key.
Most of the time we get the correct value from the cache for each run through the workflow, but occasionally we will get stale data. Each operation could run on one of many threads in the Executor, but they will never run in parallel for the same key.
Is there a way to ensure that we get strong consistency? I wrote a unit test to confirm that this happens, and it happens between 1-3% of the time. I even added a read-after-write, and that reduced the inconsistency, but did not eliminate it.

Not sure what you are referring to as strong consistency is rocksdb is strongly consistent - there is no across the network replication going on where you would see eventual consistency
if you want to get a snapshotted read use a snapshot sequence identifier when doing your reads
Sounds more like a threading issue where your reads and writes are happening in non-determenistic order

Thread safely loop through ConcurrentHashMap with no blocking

I'm working on an ultra low latency and high performance application.
The core is single threaded, so don't need to worry about concurrency.
I'm developing a schedule log function which log messages periodically to prevent same messages flush in the log.
So the log class contains a ConcurrentHashMap, one thread update it (put new key or update existing value), another thread periodically loop through the map to log all the messages.
My concern is since need to log when loop through the Map which may take time, would it block the thread trying to update the Map? Any blocking is not acceptable since our application core is single threaded.
And is there any other data structure other than ConcurrentHashMap I can use to reduce the memory footprint?
Is there a thread-safe way to iterate the Map without blocking in the read-only case? Even the data iterated may be stale is still acceptable.

According to the java API docs, it says that :
[...] even though all operations are thread-safe, retrieval operations do not entail locking [...]
Moreover, the entrySet() method documentation tells us that:
The [returned] set is backed by the map, so changes to the map are reflected in the set, and vice-versa.
That would mean that modification of the map is possible while an iteration is done over it, meaning that it indeed doesn't block the whole map.

There may be other structures that would allow you to reduce the memory footprint, reduce latency, and perhaps offer a more uniform & consistent performance profile.
One of these is a worker pattern where your main worker publishes to a nonblocking queue and the logger pulls from that queue. This should decouple the two processes, allow for multiple concurrent publishers and allow you to scale out loggers.
One possible data structure for this is ConcurrentLinked Queue. I have very little experience with Java so I'm not sure how the performance profile of this differs from concurrent hashmap; but this is a very common pattern in distributed systems and in golang.

Java Concurrency ReadWriteLock with my own timestamp

ReentrantReadWriteLock is perfect for read-write scenario based on timestamp programmatically at the time of reception.
PUT(KEY=1,VALUE=1)
PUT(KEY=1,VALUE=2)
GET(KEY=1)
PUT(KEY=1,VALUE=1)
...
Java ReentrantReadWriteLock will automatically sync all of them in order based on timestamp offered by Java itself.
However, how I need to use external timestamp which is offered along with each request.
PUT(KEY=1,VALUE=1,TIMESTAMP=13000000000000)
PUT(KEY=1,VALUE=2,TIMESTAMP=13500000000000)
GET(KEY=1,TIMESTAMP=14000000000000)
PUT(KEY=1,VALUE=1,TIMESTAMP=15000000000000)
...
How to design ReadWriteLock ordered by external timestamp?

Short answer:
Synchronize the channel through which you receive the timestamped data or use an ordered concurrent data structure like ConcurrentSkipListMap on the receiving end. Meanwhile question whether you need to maintain this ordering at all.
P.S. ReentrantReadWriteLock doesn't use comparable entities like timestamps to establish ordering, and its algorithm for fair scheduling doesn't include reordering of entries that you can reuse. ReentrantReadWriteLock uses CLH-based lock queue through AbstractQueuedSynchronizer.
Long answer:
While what you want to do may very well be something that you can't avoid, it's always good to question whether you really need various flavors of precision and consistency in concurrent and/or distributed systems.
Why are you concerned with this problem?
It sounds like you want to preserve fairness by using the data ordering from another layer in your system. Do you need these two layers to be separate, maybe because one of them is out of your control, or because they need to stay semantically separated? If that's the case, you can ask yourself a couple more questions here.
Is it absolutely necessary to have this ordering maintained for every request?
Is this ordering a crucial part of your business logic?
Is it actually established by a component in the upper layer that you care about?
If it is based on the time of arrival of some requests through a network outside of your control, like the Internet, chances are that you don't really care about this ordering, and relaxing your consistency requirements will probably result in higher throughput. It's not rare to see that unfairly displaced requests in a highly concurrent, unfair environment are served faster than requests in a fair environment that has problems utilizing its resources.
If it is based on a single, super-fast timestamp issuer that establishes a total order over all of the requests, you may be able to modify your system, so that the timestamp issuer becomes the single producer that serves the requests to the second layer of your system through a Disruptor or an ArrayBlockingQueue.
Would you actually receive out of order requests?
You may be solving a problem that you'll never actually face, or that you will face somewhere far in the future, and in the meantime your time could be better spent somewhere else.
If that's not the case and you actually expect to receive out of order (in your external timestamp order) requests, then the communication channel between your layers is one of the components that introduce "disorder" in your system. This may be so because someone deliberately wanted to trade consistency for throughput, or it may be because that channel doesn't fit in the needs of your system without additional work.
Consider whether it's easier to enforce strict consistency in it, or it's easier to keep it as is and order the requests after they have been received.
We already touched on the former with the approach illustrated for the SingleProducerTimestampIssuer™ - it may also be unfeasible if you don't have control over the channel.
For the latter, you can try using an ordered concurrent data structure.
A ConcurrentSkipListMap mapping timestamps to requests may be a good solution. If you are not afraid of trying to apply an idea from a paper, you may want to take a look at Concurrent Programming Without Locks and Fast Concurrent Data-Structures Through
Explicit Timestamping

Observable to batch like Lmax Disruptor

Those who are familiar with lmax ring buffer (disruptor) know that one of the biggest advanatages of that data structure is that it batches incomming events and when we have a consumer that can take advantage of batching that makes the system automatically adjustable to the load, the more events you throw at it the better.
I wonder couldnt we achieve the same effect with an Observable (targeting the batching feature). I've tried out Observable.buffer but this is very different, buffer will wait and not emit the batch while the expected number of events didnt arrive. what we want is quite different.
given the subsriber is waiting for a batch from Observable<Collection<Event>>, when a single item arrives at stream it emits a single element batch which gets processed by subscriber, while it is processing other elements are arriving and getting collected into next batch, as soon as subscriber finishes with the execution it gets the next batch with as many events as had arrived since it started last processing...
So as a result if our subscriber is fast enough to process one event at a time it will do so, if load gets higher it will still have the same frequency of processing but more events each time (thus solving backpressure problem)... unlike buffer which will stick and wait for batch to fill up.
Any suggestions? or shall i go with ring buffer?

RxJava and Disruptor represent two different programming approaches.
I'm not experienced with Disruptor but based on video talks, it is basically a large buffer where producer emit data like a firehose and consumers spin/yield/block until data is available.
RxJava, on the other hand, aims at non-blocking event delivery. We too have ringbuffers, notably in observeOn which acts as the async-boundary between producers and consumers, but these are much smaller and we avoid buffer overflows and buffer bloat by applying the co-routines approach. Co-routines boil down to callbacks sent to your callbacks so yo can callback our callbacks to send you some data at your pace. The frequency of such requests determines the pacing.
There are data sources that don't support such co-op streaming and require one of the onBackpressureXXX operators that will buffer/drop values if the downstream doesn't request fast enough.
If you think you can process data in batches more efficiently than one-by-one, you can use the buffer operator which has overloads to specify time duration for the buffers: you can have, for example, 10 ms worth of data, independent of how many values arrive in this duration.
Controlling the batch-size via request frequency is tricky and may have unforseen consequences. The problem, generally, is that if you request(n) from a batching source, you indicate you can process n elements but the source now has to create n buffers of size 1 (because the type is Observable<List<T>>). In contrast, if no request is called, the operator buffers the data resulting in longer buffers. These behaviors introduce extra overhead in the processing if you really could keep up and also has to turn the cold source into a firehose (because otherwise what you have is essentially buffer(1)) which itself can now lead to buffer bloat.

How to save state of a very complex and huge data processing?

Consider an implementation of A* algorithm.- for example:
A* implementation
Assume the input graph was very huge and solving this code was long enough that I thought of failure recovery in event this code crashed in between. Failures could be any - software / hardware etc.
I am not looking for code, but just a few pointers into what are common solutions to such a problem of recovery

There are several options:
You can rewrite your algorithm to support error recovery.
For example you can split it onto tasks and submit these tasks into queue.
So main part of algorithm just gets tasks from queue and executes them.
During execution, tasks may submit additional tasks.
So, to recovery, you just need to repeat failed task execution.
Perform bytecode manipulation.
Take a look to Javaflow approach.
You can suspend your code execution at a certain point
and then you can resume it.
If something goes wrong, you just try to repeat resuming from last point.
Note that in some cases there are troubles with algorithm implementation,
so restoring is just impossible.
But when something wrong with external components
(for example, you store something in the database)
repeating may help.
For example, database may be down or there is writing conflict with another transaction.

When you have a potential failure of a large dataset, the normal thing to use is a redundant database. If you graph data, you might like to use neo4j which now has a pretty interface but also supports redundancy and can be used embedded to minimise latency.
If you just need high throughput persisted replication, Java Chronicle supports 5-20 million messages per second over TCP replication (up to the limit of your network bandwidth)
If none of the 150+ no sql database suit you needs you would still need to implement something like them http://nosql-database.org/

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.