Consider the following example class, instances of which are being stored in a ListState:
class BusinessObject {
long clientId;
String region;
Instant lastDealDate;
bool isActive;
}
The application requires that this object shouldn't be in the flink state if it has been 1 year since the last deal was made (lastDealDate) with a particular client and the client is not active i.e. isActive == false
What would be the proper way of going about this and letting flink know of these 2 factors so it removes those entries automatically? Currently, I read all the items in the state, clear the state and then add back the relevant ones, however this will start to take a long time as the number of clients increases and the size of the state grows large. Most of my searches online talk about using time-to-live and setting it for my state via descriptor. However, my logic can't rely on processing/event/ingestion time and I need to also check if isActive is false.
Extra info: the context is not keyed and the backend is RocksDB. The reason a ListState is used is because all of the relavant state/history as per the above conditions needs to be dumped every day.
Any suggestions?
With the RocksDB state backend, Flink can append to ListState without going through serialization/deserialization, but any read or modification other than an append is expensive because of ser/de.
You'll be better off if you can rework things so that these BusinessObjects are stored in MapState, even if you occasionally have to iterate over the entire map. Each key/value pair in the MapState will be a separate RocksDB entry, and you'll be able to individually create/update/delete them without having to go through ser/de for the entire map (unless you do have to scan it). (For what it's worth, iterating over MapState in RocksDB proceeds through the map in serialized-key-sorted order.)
MapState is only available as keyed (or broadcast) state, so this change would require you to key the stream. Using keyBy does force a network shuffle (and ser/de), so it will be expensive, but not as expensive as using ListState.
Related
I am trying to understand what is the best way to achieve current time timestamps using Flink when producing a new record to Kafka
Does flink automatically fill the produced event with metadata containing the timestamp of the current time? Is that the best practice for the consumers or should we put the current time inside the event?
If I really want to put the current time of a processed event, how should I do it in Java? I am running flink in kubernetes, so I don't know if a simple current_time() call would be the ideal way of doing it, because task managers may be in different nodes, and I am not sure if the clock in each of them are going to be in sync.
When initializing a KafkaSink you have to provide a KafkaRecordSerializationSchema, in the serialize method you can set the timestamp associated to each element when building the org.apache.kafka.clients.producer.ProducerRecord. The timestamp the serialize method receives will depend on your pipeline configuration. You can get more information about assigning timestamps and how Flink handles time in here: https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/datastream/event-time/generating_watermarks/
If you are not setting it, Kafka will automatically assign a timestamp to each record when receiving it (the ingestion time, which will basically be the processing time plus a slight delay).
In any case, achieving perfectly ordered processing time timestamps in a distributed application will face the problem you describe. Different nodes will have different clocks, even if all are synchronized using NTP. It is a big problem in distributed systems that requires significant effort to solve (if even possible).
A pragmatic approach that may be good enough is to just have all records that belong to the same key timestamped by the same node, this way you will have most of the time a perfectly ordered timestamp. Be aware that a rebalance or a correction of the clock (which NTP does periodically) will break these perfectly ordered timestamps per key for some records from time to time. If you have a KeyedStream and you assign the timestamp in a keyed map or let Kafka do it, you will get these mostly-ordered timestamps per key.
Does flink automatically fill the produced event with metadata
containing the timestamp of the current time? Is that the best
practice for the consumers or should we put the current time inside
the event?
Yes, the timestamp is set to whatever value TimestampAssigner returned for that record. Thanks to this, Flink transformations can preserve original records timestamps.
I am running flink in kubernetes, so I don't know if a simple
current_time() call would be the ideal way of doing it, because task managers may be in different nodes, and I am not sure if the clock in each of them are going to be in sync.
I can assure you that they won't be in sync. That's why, to simplify things in distributed systems, we don't really rely on wall-clock but on event time.
Since there's a limit for Hadoop counter size(and we dont want to increase it for just one job), I am creating a map(Map) which will increment the key if some conditions are met(Same as counters). There is already a DoFn (returning custom made object) which is processing the data so I am interested in passing a map into it and grouping it outside based on keys.
I think concurrenthashmap might work but unable to implement the same.
I have a clustered system set up with Hazelcast to store my data. Each node in the cluster is responsible for connecting to a service on localhost and piping data from this service into the Hazelcast cluster.
I would like this data to be stored primarily on the node that received it, and also processed on that node. I'd like the data to be readable and writable on other nodes with moderately less performance requirements.
I started with a naive implementation that does exactly as I described with no special considerations. I noticed performance suffered quite a bit (we had a separate implementation using Infinispan to compare it with). Generally speaking, there is little logical intersection between the data I'm processing from each individual service. It's stored in a Hazelcast cluster so it can be read and occasionally written from all nodes and for failover scenarios. I still need to read the last good state of the failed node if either the Hazelcast member fails on that node or the local service fails on that node.
So my first attempt at co-locating the data and reducing network chatter was to key much of the data with a serverId (number from 1 to 3 on, say, a 3-node system) and include this in the key. The key then implements PartitionAware. I didn't notice an improvement in performance so I decided to execute the logic itself on the cluster and key it the same way (with a PartitionAware/Runnable submitted to a DurableExecutorService). I figured if I couldn't select which member the logic could be processed on, I could at least execute it on the same member consistently and co-located with the data.
That made performance even worse as all data and all execution tasks were being stored and run on a single node. I figured this meant node #1 was getting partitions 1 to 90, node #2 was getting 91 to 180, and node #3 was getting 181 to 271 (or some variant of this without complete knowledge of the key hash algorithm and exactly how my int serverId translates to a partition number). So hashing serverId 1, 2, 3 and resulted in e.g. the oldest member getting all the data and execution tasks.
My next attempt was to set backup count to (member count) - 1 and enable backup reads. That improved things a little.
I then looked into ReplicatedMap but it doesn't support indexing or predicates. One of my motivations to moving to Hazelcast was its more comprehensive support (and, from what I've seen, better performance) for indexing and querying map data.
I'm not convinced any of these are the right approaches (especially since mapping 3 node numbers to partition numbers doesn't match up to how partitions were intended to be used). Is there anything else I can look at that would provide this kind of layout, with one member being a preferred primary for data and still having readable backups on 1 or more other members after failure?
Thanks!
Data grids provide scalability, you can add or remove storage nodes to adjust capacity, and for this to work the grid needs to be able to rebalance the data load. Rebalancing means moving some of the data from one place to another. So as a general rule, the placement of data is out of your control and may change while the grid runs.
Partition awareness will keep related items together, if they move they move together. A runnable/callable accessing both can satisfy this from the one JVM, so will be more efficient.
There are two possible improvements if you really need data local to a particular node, read-backup-data or near-cache. See this answer.
Both or either will help reads, but not writes.
For not very big amount of data we store all keys in one bin with List.
But there are limitations on the size of bin.
Function scanAll with ScanCallback in Java client, actually works very slowly, so we cannot afford it in our project. Aerospike works fast when you give him the Key.
Now we have some sets where are a lot of records and keys. What is the best way to store all keys, or maybe there are some way to get it fast and without scanAll ?
Scanning small sets is currently an inefficient operation, because there are 4K logical partitions, and a scan thread has to reduce each of those partitions during the scan. Small sets don't necessarily have records in all the partitions, so you're paying for the overhead of scanning those regardless. This is likely to change in future versions, but is the case for now.
There are two ways to get all the records in a set faster:
If you actually know what the key space is like, you can iterate over batch-reads to fetch them (which can also be done in parallel). Trying to access a non-existent key in a batch-read does not cause an error, it just comes back with no value in the specific index.
Alternatively, you can add a bin that has the set name, and create a secondary index over that bin, then query for all the records WHERE setname=XYZ. This will come back much faster than the scan, for a small set.
I'm hoping for some advice or suggestions on how best to handle multi threaded access to a value store.
My local value storage is designed to hold onto objects which are currently in use. If the object is not in use then it is removed from the store.
A value is pumped into my store via thread1, its entry into the store is announced to listeners, and the value is stored. Values coming in on thread1 will either be totally new values or updates for existing values.
A timer is used to periodically remove any value from the store which is not currently in use and so all that remains of this value is its ID held locally by an intermediary.
Now, an active element on thread2 may wake up and try to access a set of values by passing a set of value IDs which it knows about. Some values will be stored already (great) and some may not (sadface). Those values which are not already stored will be retrieved from an external source.
My main issue is that items which have not already been stored and are currently being queried for may arrive in on thread1 before the query is complete.
I'd like to try and avoid locking access to the store whilst a query is being made as it may take some time.
It seems that you are looking for some sort of cache. Did you try to investigate existing cache implementation, maybe some of them will do?
For example Guava cache implementations seems to cover a lot of your requirements - http://code.google.com/p/guava-libraries/wiki/CachesExplained.