We have the following situation:
Existing topic with 9 partitions in Kafka contains multiple record types. These are partitioned according to a custom header (key = null) which is basically a string UUID.
Data is consumed via Kstreams, filtered by the type that interests us and repartitioned into a new topic containing only specific record types. The new topic contains 12 partitions and has key=<original id in header>. The increased partition count is to allow more consumers to process this data.
This is where things seem to get a little weird.
In the original topic, we have millions of the relevant records. In each of the 9 partitions, we see relatively monotonically increasing record times, which is to be expected as the partitions should be assigned relatively randomly due to the high cardinality of the partition key.
In the new topic, we're seeing something like the following:
Seemingly the record timestamps are jumping all over the place. Some discrepancies are to be expected seeing how the partitioning in the original (as well as the new) topic isn't exactly round-robin. We're seeing a few partitions in our original topic which have offsets that are ~1-2M higher/lower than others, but seeing how we have many millions of records of ingest daily, I can't explain the one record with a time stamp of 5/28/2022 between 6/17/2022 and 6/14/2022.
What could explain this behaviour?
Edit:
Looking at the consumer group offsets, I've found something interesting:
I was reingesting the data with multiple consumers and noted that they have severly different lags per partition. I don't quite understand why this discrepancy would be so large. Going to investigate further...
Edit:
To add some more detail, the workflow of the Streams app is as follows:
SpecificAvroSerde<MyEvent> specificAvroSerde = new SpecificAvroSerde<>();
specificAvroSerde.configure(Collections.singletonMap(AbstractKafkaSchemaSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, SCHEMA_REGISTRY_URL), /*isKey*/ false);
streamsBuilder
.stream("events", Consumed.with(Serdes.Void(), Serdes.ByteArray()))
.transform(new FilterByTypeHeaderTransformerSupplier(topicProperties))
.transform(new MyEventAvroTransformerSupplier())
.to(topicProperties.getOutputTopic(), Produced.with(Serdes.UUID(), specificAvroSerde));
where the FilterByTypeHeaderTransformerSupplier instantiates a transformer that does, in essence:
public KeyValue<Void, byte[]> transform(Void key, byte[] value) {
// checks record headers
if (matchesFilter()) {
return KeyValue.pair(key, value);
}
// skip since it is not an event type that interests us
return null;
}
while the other transformer does the following (which doesn't have great performance but does the job for now):
public KeyValue<UUID, MyAvroEvent> transform(Void key, byte[] value) {
MyEvent event = objectMapper.readValue(value, MyEvent.class);
MyAvroEvent avroRecord = serializeAsAvro(event);
return KeyValue.pair(event.getEventId(), avroRecord);
}
hence I use the default timestamp extractor (FailOnInvalidTimestamp).
Most notably, as can be seen, I'm adding a key to this record: however, this key is the same one that was previously used to partition the data (in the existing 9 partitions, however).
I'll try removing this key first to see if the behaviour changes, but I'm kind of doubtful that that's the reason, especially since it's the same partition key value that was used previously.
I still haven't found the reason for the wildly differing consumer offsets, unfortunately. I very much hope that I don't have to have a single consumer reprocess this once to catch up, since that would take a very long time...
Edit 2:
I believe I've found the cause of this discrepancy. The original records were produced using Spring Cloud Stream - these records included headers such as e.g "scst_partition=4". However, the hashing the was used for the producer back then used Java based hashing (e.g. "keyAsString".hashCode() % numPartitions), while the Kafka Clients use:
Utils.toPositive(Utils.murmur2(keyAsBytes))
As a result, we're seeing behaviour where records in e.g. source partition 0 could land in any one of the new partitions. Hence, small discrepancies in the source distribution could lead to rather large fluctuations in record ordering in the new partitions.
I'm not quite sure how to deal with this in a sensible manner. Currently I've tried using a simple round-robin partitioning in the target topic to see if the distribution is a bit more even in that case.
The reason why this is a problem is that this data will be put on an object storage via e.g. Kafka Connect. If I want this data stored in e.g. a daily format, then old data coming in all the time would cause buffers that should've been closed a long time ago to be kept open, increasing memory consumption. It doesn't make sense to use any kind of windowing for late data in this case, seeing how it's not a real-time aggregation but simply consumption of historical data.
Ideally for the new partitioning I'd want something like: given the number of partitions in the target topic is a multiple of the number of partitions in the source topic, have records in partition 0 go to either partition 0 or 9, from 1 to either 1 or 10, etc. (perhaps even randomly)
This would require some more work in the form of a custom partitioner, but I can't foresee if this would cause other problems down the line.
I've also tried setting the partition Id header ("kafka_partitionId" - as far as I know, documentation here isn't quite easy to find) but it is seemingly not used.
I'll investigate a bit further...
Final edit:
For what it's worth, the problem boiled down to the following two issues:
My original data, written by Spring Cloud Stream, was partitioned differently that how a vanilla Kafka Producer (which Kafka Streams internally uses) would. This led to data jumping all over the place from a "record-time" point of view.
Due to the above, I had to choose a number of partitions that was a multiple of the previous number of partitions as well as use a custom partitioner which does it the "spring cloud stream".
The requirement that the new number be a multiple of the previous one is a result of modular arithmetic. If I wished to have deterministic partitioning for my existing data, having a multiple would allow data to go into one of two possible new partitions as opposed to only one as in the previous case.
E.g. with 9 -> 18 partitions:
id 1 -> previously hashed to partition 0, now hashes to either 0 or 9 (mod 18)
id 2 -> previously hashed to partition 1, now hashes to either 1 or 10 (mod 18)
Hence my requirement for higher paralellism is met and the data inside a single partition is ordered as desired, since a target partition is only supplied from at most one source partition.
I'm sure there might have been a simpler way to go about this all, but this works for now.
For further context/info, see also this Q&A.
Related
I have a clustered system set up with Hazelcast to store my data. Each node in the cluster is responsible for connecting to a service on localhost and piping data from this service into the Hazelcast cluster.
I would like this data to be stored primarily on the node that received it, and also processed on that node. I'd like the data to be readable and writable on other nodes with moderately less performance requirements.
I started with a naive implementation that does exactly as I described with no special considerations. I noticed performance suffered quite a bit (we had a separate implementation using Infinispan to compare it with). Generally speaking, there is little logical intersection between the data I'm processing from each individual service. It's stored in a Hazelcast cluster so it can be read and occasionally written from all nodes and for failover scenarios. I still need to read the last good state of the failed node if either the Hazelcast member fails on that node or the local service fails on that node.
So my first attempt at co-locating the data and reducing network chatter was to key much of the data with a serverId (number from 1 to 3 on, say, a 3-node system) and include this in the key. The key then implements PartitionAware. I didn't notice an improvement in performance so I decided to execute the logic itself on the cluster and key it the same way (with a PartitionAware/Runnable submitted to a DurableExecutorService). I figured if I couldn't select which member the logic could be processed on, I could at least execute it on the same member consistently and co-located with the data.
That made performance even worse as all data and all execution tasks were being stored and run on a single node. I figured this meant node #1 was getting partitions 1 to 90, node #2 was getting 91 to 180, and node #3 was getting 181 to 271 (or some variant of this without complete knowledge of the key hash algorithm and exactly how my int serverId translates to a partition number). So hashing serverId 1, 2, 3 and resulted in e.g. the oldest member getting all the data and execution tasks.
My next attempt was to set backup count to (member count) - 1 and enable backup reads. That improved things a little.
I then looked into ReplicatedMap but it doesn't support indexing or predicates. One of my motivations to moving to Hazelcast was its more comprehensive support (and, from what I've seen, better performance) for indexing and querying map data.
I'm not convinced any of these are the right approaches (especially since mapping 3 node numbers to partition numbers doesn't match up to how partitions were intended to be used). Is there anything else I can look at that would provide this kind of layout, with one member being a preferred primary for data and still having readable backups on 1 or more other members after failure?
Thanks!
Data grids provide scalability, you can add or remove storage nodes to adjust capacity, and for this to work the grid needs to be able to rebalance the data load. Rebalancing means moving some of the data from one place to another. So as a general rule, the placement of data is out of your control and may change while the grid runs.
Partition awareness will keep related items together, if they move they move together. A runnable/callable accessing both can satisfy this from the one JVM, so will be more efficient.
There are two possible improvements if you really need data local to a particular node, read-backup-data or near-cache. See this answer.
Both or either will help reads, but not writes.
Right now I have one stream application in SCDF that is pulling data from multiple tables in a database and replicating it to another database. Currently, our goal is to reduce the amount of work that a given stream is doing, so we want to split the stream out into multiple streams and continue replicating the data into the second database.
Are there any recommended design patterns for funneling the processing of these various streams into one?
If I understand this requirement correctly, you'd want to split the ingest piece by DB/Table per App and then merge them all into a single "payload type" for downstream processing.
If you really do want to split the ingest by DB/Table, you can, but you may want to consider the pros/cons, though. One obvious benefit is granularity and that you can independently update the App in isolation, and maybe also reusability. Of course, it brings other challenges. Maintenance, fixes, and releases for individual apps to name a few.
That said, you can fan-in data to a single consumer. Here's an example:
foo1 = jdbc | transform | hdfs
foo2 = jdbc > :foo1.jdbc
foo3 = jdbc > :foo1.jdbc
foo4 = jdbc > :foo1.jdbc
Here, foo1 is the primary pipeline reading data from a particular DB/Table combination. Likewise, foo2, foo3, and foo4 could read from other DB/Table combinations. However, these 3 streams are writing the consumed data to a named-destination, which in this case happens to be foo1.jdbc (aka: topic name). This destination is automatically created by SCDF when deploying the foo1 pipeline; specifically to connect "jdbc" and "transform" Apps with the foo1.jdbc topic.
In summary, we are routing the different table data to land in the same destination, so the downstream App, in this case, the transform processor gets the data from different tables.
If the correlation of data is important, you can partition the data at the producer by a unique key (e.g., customer-id = 1001) at each jdbc source, so context-specific information land at the same transform processor instance (assuming you've "n" number of processor instances for scaled-out processing).
I'm learning spring-batch. I'm currently working with biological data that look like this:
interface Variant {
public String getChromosome();
public int getPosition();
public Set<String> getGenes();
}
(A Variant is a position on the genome which may overlap somes genes).
I've already written some Itemreaders/Itemwriters
Now I would like to run some analysis per gene. Thus I would like to split my workflow for each gene (gene1, gene2,... geneN) to do some statistics about all the variants linked to one gene.
What is the best way to implement a Partioner for this (is it the correct class anyway ?) ? All the examples I've seen use some 'indexes' or a finite number of gridSize ? Furthermore, does the map returned by partiton(gridsize) must have less than gridSize items or can I returned a 'big' map and spring-batch is able to run no more than gridSize jobs in parallel ? how can join the data at the end ?
thanks
EDIT: or may be I should look at MultiResourceItemWriter ?
When using Spring Batch's partitioning capabilities, there are two main classes involved, the Partitioner and the PartitionHandler.
Partitioner
The Partitioner interface is responsible for dividing up the data to be processed into partitions. It has a single method Partitioner#partition(int gridSize) that is responsible for analyzing the data that is to be partitioned and returning a Map with one entry per partition. The gridSize parameter is really just a piece of input into the overall calculation that can be used or ignored. For example, if the gridSize is 5, I may choose to return exactly 5 partitions, I may choose to overpartition and return some multiple of 5, or I may analyze the data and realize that I only need 3 partitions and completely ignore the gridSize value.
PartionHandler
The PartitionHandler is responsible for the delegation of the partitions returned by the Partitioner to workers. Within the Spring ecosystem, there are three provided PartitionHandler implementations, a TaskExecutorPartitionHandler that delegates the work to threads internal to the current JVM, a MessageChannelPartitionHandler that delegates work to remote workers listening on some form of messaging middleware, and a DeployerPartitionHandler out of the Spring Cloud Task project that launches new workers dynamically to execute the provided partitions.
With all the above laid out, to answer your specific questions:
What is the best way to implement a Partioner for this (is it the correct class anyway ?) ? That typically depends on the data your partitioning and the store it's in. Without further insights into how you are storing the gene data, I can't really comment on what the best approach is.
Does the map returned by partiton(gridsize) must have less than gridSize items or can I returned a 'big' map and spring-batch is able to run no more than gridSize jobs in parallel ? You can return as many items in the Map as you see fit. As mentioned above, the gridSize is really meant as a guide.
How can join the data at the end ? A partitioned step is expected to have each partition processed independently of each other. If you want some form of join at the end, you'll typically do that in a step after the partition step.
I have 5-partitions-RDD and 5 workers/executors.
How can I ask Spark to save each RDD's partition on the different worker (IP)?
Am I right if I say Spark can save few partitions on one worker, and 0 partitions on other workers?
Means, I can specify the number of partitions, but Spark still can cache everything on a single node.
Replication is not an option since RDD is huge.
Workarounds I have found
getPreferredLocations
RDD's getPreferredLocations method does not provide a 100% warranty that partition will be stored on a specified node. Spark will try during spark.locality.wait, but afterward, Spark will cache partition on a different node.
As a workarround, you can set very high value to spark.locality.wait and override getPreferredLocations. The bad news - you can not do that with Java, you need to write Scala code. At least Scala internals wrapped with Java code. I.e:
class NodeAffinityRDD[U: ClassTag](prev: RDD[U]) extends RDD[U](prev) {
val nodeIPs = Array("192.168.2.140","192.168.2.157","192.168.2.77")
override def getPreferredLocations(split: Partition): Seq[String] =
Seq(nodeIPs(split.index % nodeIPs.length))
}
SparkContext's makeRDD
SparkContext has makeRDD method. This method lack documentation. As I understand, I can specify preferred locations, and then set a high value to spark.locality.wait. The bad news - preferred location will be discarded on the first shuffle/join/cogroup operation.
Both approaches have the drawback of too high spark.locality.wait can cause your cluster to starve if some of the nodes will be unavailable.
P.S. More context
I have up to 10,000 of sales-XXX.parquet files, each represents sales of different goods in the different regions. Each sales-XXX.parquet could vary from a few KBs to a few GBs. All sales-XXX.parquets together could take up to tens or hundreds of GBs at HDFS.
I need a full-text search through all sales. I have to index each sales-XXX.parquet one-by-one with Lucene. And now I have two options:
Keep Lucene indexes in Spark. There is already solution for this, but it looks pretty suspicious. Is there any better solutions?
Keep Lucene indexes at the local file system. Then I can map-reduce on the results of each worker's index lookup. But this approach requires each worker node keeps an equal amount of data. How could I ensure Spark will keep equal amount of data on each worker node?
Recently I have being trying out Spark and do far I have observed quite interesting results, but currently I am stuck with famous groupByKey OOM problem. Basically what the job does it tries to search in the large datasets the periods where measured value is increasing consecutively for at least N times. I managed to get rid of the problem by writing the results to the disk, but the application is running much slower now (which is expected due to the disk IO). Now the question: is there any other memory efficient strategy where I can run sorted data and check whether adjacent values(for the same key) are increasing in at least N consecutive observations, without recurring to groupByKey method?
I have designed an algorithm to do it with reduceByKey, but there is one problem, reduce seems to ignore data ordering and yells completely wrong results at the end.
Any ideas appreciated.
There are a few ways you can approach this problem:
repartitionAndSortWithinPartitions with custom partitioner and ordering:
keyBy (name, timestamp) pairs
create custom partitioner which considers only the name
repartitionAndSortWithinPartitions using custom partitioner
use mapPartitions to iterate over data and yield matching sequences
sortBy(Key) - this is similar to the first solution but provides higher granularity at the cost of additional post-processing.
keyBy (name, timestamp) pairs
sortByKey
process individual partitions using mapPartitionsWithIndex keeping track of leading / trailing patterns for each partition
adjust final results to include patterns which span over more than one partitions
create fixed sized windows over sorted data using sliding from mllib.rdd.RDDFunctions.
sortBy (name, timestamp)
create sliding RDD and filter windows which cover multiple names
check if any window contains desired pattern.