I'm learning spring-batch. I'm currently working with biological data that look like this:
interface Variant {
public String getChromosome();
public int getPosition();
public Set<String> getGenes();
}
(A Variant is a position on the genome which may overlap somes genes).
I've already written some Itemreaders/Itemwriters
Now I would like to run some analysis per gene. Thus I would like to split my workflow for each gene (gene1, gene2,... geneN) to do some statistics about all the variants linked to one gene.
What is the best way to implement a Partioner for this (is it the correct class anyway ?) ? All the examples I've seen use some 'indexes' or a finite number of gridSize ? Furthermore, does the map returned by partiton(gridsize) must have less than gridSize items or can I returned a 'big' map and spring-batch is able to run no more than gridSize jobs in parallel ? how can join the data at the end ?
thanks
EDIT: or may be I should look at MultiResourceItemWriter ?
When using Spring Batch's partitioning capabilities, there are two main classes involved, the Partitioner and the PartitionHandler.
Partitioner
The Partitioner interface is responsible for dividing up the data to be processed into partitions. It has a single method Partitioner#partition(int gridSize) that is responsible for analyzing the data that is to be partitioned and returning a Map with one entry per partition. The gridSize parameter is really just a piece of input into the overall calculation that can be used or ignored. For example, if the gridSize is 5, I may choose to return exactly 5 partitions, I may choose to overpartition and return some multiple of 5, or I may analyze the data and realize that I only need 3 partitions and completely ignore the gridSize value.
PartionHandler
The PartitionHandler is responsible for the delegation of the partitions returned by the Partitioner to workers. Within the Spring ecosystem, there are three provided PartitionHandler implementations, a TaskExecutorPartitionHandler that delegates the work to threads internal to the current JVM, a MessageChannelPartitionHandler that delegates work to remote workers listening on some form of messaging middleware, and a DeployerPartitionHandler out of the Spring Cloud Task project that launches new workers dynamically to execute the provided partitions.
With all the above laid out, to answer your specific questions:
What is the best way to implement a Partioner for this (is it the correct class anyway ?) ? That typically depends on the data your partitioning and the store it's in. Without further insights into how you are storing the gene data, I can't really comment on what the best approach is.
Does the map returned by partiton(gridsize) must have less than gridSize items or can I returned a 'big' map and spring-batch is able to run no more than gridSize jobs in parallel ? You can return as many items in the Map as you see fit. As mentioned above, the gridSize is really meant as a guide.
How can join the data at the end ? A partitioned step is expected to have each partition processed independently of each other. If you want some form of join at the end, you'll typically do that in a step after the partition step.
Related
We have the following situation:
Existing topic with 9 partitions in Kafka contains multiple record types. These are partitioned according to a custom header (key = null) which is basically a string UUID.
Data is consumed via Kstreams, filtered by the type that interests us and repartitioned into a new topic containing only specific record types. The new topic contains 12 partitions and has key=<original id in header>. The increased partition count is to allow more consumers to process this data.
This is where things seem to get a little weird.
In the original topic, we have millions of the relevant records. In each of the 9 partitions, we see relatively monotonically increasing record times, which is to be expected as the partitions should be assigned relatively randomly due to the high cardinality of the partition key.
In the new topic, we're seeing something like the following:
Seemingly the record timestamps are jumping all over the place. Some discrepancies are to be expected seeing how the partitioning in the original (as well as the new) topic isn't exactly round-robin. We're seeing a few partitions in our original topic which have offsets that are ~1-2M higher/lower than others, but seeing how we have many millions of records of ingest daily, I can't explain the one record with a time stamp of 5/28/2022 between 6/17/2022 and 6/14/2022.
What could explain this behaviour?
Edit:
Looking at the consumer group offsets, I've found something interesting:
I was reingesting the data with multiple consumers and noted that they have severly different lags per partition. I don't quite understand why this discrepancy would be so large. Going to investigate further...
Edit:
To add some more detail, the workflow of the Streams app is as follows:
SpecificAvroSerde<MyEvent> specificAvroSerde = new SpecificAvroSerde<>();
specificAvroSerde.configure(Collections.singletonMap(AbstractKafkaSchemaSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, SCHEMA_REGISTRY_URL), /*isKey*/ false);
streamsBuilder
.stream("events", Consumed.with(Serdes.Void(), Serdes.ByteArray()))
.transform(new FilterByTypeHeaderTransformerSupplier(topicProperties))
.transform(new MyEventAvroTransformerSupplier())
.to(topicProperties.getOutputTopic(), Produced.with(Serdes.UUID(), specificAvroSerde));
where the FilterByTypeHeaderTransformerSupplier instantiates a transformer that does, in essence:
public KeyValue<Void, byte[]> transform(Void key, byte[] value) {
// checks record headers
if (matchesFilter()) {
return KeyValue.pair(key, value);
}
// skip since it is not an event type that interests us
return null;
}
while the other transformer does the following (which doesn't have great performance but does the job for now):
public KeyValue<UUID, MyAvroEvent> transform(Void key, byte[] value) {
MyEvent event = objectMapper.readValue(value, MyEvent.class);
MyAvroEvent avroRecord = serializeAsAvro(event);
return KeyValue.pair(event.getEventId(), avroRecord);
}
hence I use the default timestamp extractor (FailOnInvalidTimestamp).
Most notably, as can be seen, I'm adding a key to this record: however, this key is the same one that was previously used to partition the data (in the existing 9 partitions, however).
I'll try removing this key first to see if the behaviour changes, but I'm kind of doubtful that that's the reason, especially since it's the same partition key value that was used previously.
I still haven't found the reason for the wildly differing consumer offsets, unfortunately. I very much hope that I don't have to have a single consumer reprocess this once to catch up, since that would take a very long time...
Edit 2:
I believe I've found the cause of this discrepancy. The original records were produced using Spring Cloud Stream - these records included headers such as e.g "scst_partition=4". However, the hashing the was used for the producer back then used Java based hashing (e.g. "keyAsString".hashCode() % numPartitions), while the Kafka Clients use:
Utils.toPositive(Utils.murmur2(keyAsBytes))
As a result, we're seeing behaviour where records in e.g. source partition 0 could land in any one of the new partitions. Hence, small discrepancies in the source distribution could lead to rather large fluctuations in record ordering in the new partitions.
I'm not quite sure how to deal with this in a sensible manner. Currently I've tried using a simple round-robin partitioning in the target topic to see if the distribution is a bit more even in that case.
The reason why this is a problem is that this data will be put on an object storage via e.g. Kafka Connect. If I want this data stored in e.g. a daily format, then old data coming in all the time would cause buffers that should've been closed a long time ago to be kept open, increasing memory consumption. It doesn't make sense to use any kind of windowing for late data in this case, seeing how it's not a real-time aggregation but simply consumption of historical data.
Ideally for the new partitioning I'd want something like: given the number of partitions in the target topic is a multiple of the number of partitions in the source topic, have records in partition 0 go to either partition 0 or 9, from 1 to either 1 or 10, etc. (perhaps even randomly)
This would require some more work in the form of a custom partitioner, but I can't foresee if this would cause other problems down the line.
I've also tried setting the partition Id header ("kafka_partitionId" - as far as I know, documentation here isn't quite easy to find) but it is seemingly not used.
I'll investigate a bit further...
Final edit:
For what it's worth, the problem boiled down to the following two issues:
My original data, written by Spring Cloud Stream, was partitioned differently that how a vanilla Kafka Producer (which Kafka Streams internally uses) would. This led to data jumping all over the place from a "record-time" point of view.
Due to the above, I had to choose a number of partitions that was a multiple of the previous number of partitions as well as use a custom partitioner which does it the "spring cloud stream".
The requirement that the new number be a multiple of the previous one is a result of modular arithmetic. If I wished to have deterministic partitioning for my existing data, having a multiple would allow data to go into one of two possible new partitions as opposed to only one as in the previous case.
E.g. with 9 -> 18 partitions:
id 1 -> previously hashed to partition 0, now hashes to either 0 or 9 (mod 18)
id 2 -> previously hashed to partition 1, now hashes to either 1 or 10 (mod 18)
Hence my requirement for higher paralellism is met and the data inside a single partition is ordered as desired, since a target partition is only supplied from at most one source partition.
I'm sure there might have been a simpler way to go about this all, but this works for now.
For further context/info, see also this Q&A.
I have a clustered system set up with Hazelcast to store my data. Each node in the cluster is responsible for connecting to a service on localhost and piping data from this service into the Hazelcast cluster.
I would like this data to be stored primarily on the node that received it, and also processed on that node. I'd like the data to be readable and writable on other nodes with moderately less performance requirements.
I started with a naive implementation that does exactly as I described with no special considerations. I noticed performance suffered quite a bit (we had a separate implementation using Infinispan to compare it with). Generally speaking, there is little logical intersection between the data I'm processing from each individual service. It's stored in a Hazelcast cluster so it can be read and occasionally written from all nodes and for failover scenarios. I still need to read the last good state of the failed node if either the Hazelcast member fails on that node or the local service fails on that node.
So my first attempt at co-locating the data and reducing network chatter was to key much of the data with a serverId (number from 1 to 3 on, say, a 3-node system) and include this in the key. The key then implements PartitionAware. I didn't notice an improvement in performance so I decided to execute the logic itself on the cluster and key it the same way (with a PartitionAware/Runnable submitted to a DurableExecutorService). I figured if I couldn't select which member the logic could be processed on, I could at least execute it on the same member consistently and co-located with the data.
That made performance even worse as all data and all execution tasks were being stored and run on a single node. I figured this meant node #1 was getting partitions 1 to 90, node #2 was getting 91 to 180, and node #3 was getting 181 to 271 (or some variant of this without complete knowledge of the key hash algorithm and exactly how my int serverId translates to a partition number). So hashing serverId 1, 2, 3 and resulted in e.g. the oldest member getting all the data and execution tasks.
My next attempt was to set backup count to (member count) - 1 and enable backup reads. That improved things a little.
I then looked into ReplicatedMap but it doesn't support indexing or predicates. One of my motivations to moving to Hazelcast was its more comprehensive support (and, from what I've seen, better performance) for indexing and querying map data.
I'm not convinced any of these are the right approaches (especially since mapping 3 node numbers to partition numbers doesn't match up to how partitions were intended to be used). Is there anything else I can look at that would provide this kind of layout, with one member being a preferred primary for data and still having readable backups on 1 or more other members after failure?
Thanks!
Data grids provide scalability, you can add or remove storage nodes to adjust capacity, and for this to work the grid needs to be able to rebalance the data load. Rebalancing means moving some of the data from one place to another. So as a general rule, the placement of data is out of your control and may change while the grid runs.
Partition awareness will keep related items together, if they move they move together. A runnable/callable accessing both can satisfy this from the one JVM, so will be more efficient.
There are two possible improvements if you really need data local to a particular node, read-backup-data or near-cache. See this answer.
Both or either will help reads, but not writes.
In a JSR-352 batch I want to use partitioning. I can define the number of partitions via configuration or implement a PartitionMapper to do that.
Then, there are the JobContext and StepContext injectables to provide context information to my processing. However, there is no PartitionContext or the like which maintains and provides details about the partition I'm running in.
Hence the question:
How do I tell each partitioned instance of a chunk which partition it is running in so that its ItemReader can read only those items which belong to that particular partition?
If I don't do that, each partition would perform the same work on the same data instead of splitting up the input data set into n distinct partitions.
I know I can store some ID in the partition plan's properties which I can then use to set another property in the step's configuration like <property name="partitionId" value="#{partitionPlan['partitionId']}" />. But this seems overly complicated and fragile because I'd have to know the name of the property from the partition plan and must remember to always set another property to this value for each step.
Isn't there another, clean, standard way to provide partition information to steps?
Or, how else should I be splitting work by partitions and assign it to different ItemReader instances in the same partitioned chunk?
Update:
It appears that jberet has the org.jberet.cdi.PartitionScoped CDI scope, but it's not part of the JSR standard.
When defining a partition with either partition plan (XML), or partition mapper (programatical), include these information as partition properties, and then reference these partition properties within item reader/processor/writer properties.
This is the standard way to tell item reader and other batch artifacts what resource to handle, where to begin, and where to end. This is not much different from non-partitioned chunk configuration, where you also need to configure the source and range of input data with batch properties.
For example, please org.jberet.test.chunkPartitionFailComplete.xml from one of the jberet test apps.
I have 5-partitions-RDD and 5 workers/executors.
How can I ask Spark to save each RDD's partition on the different worker (IP)?
Am I right if I say Spark can save few partitions on one worker, and 0 partitions on other workers?
Means, I can specify the number of partitions, but Spark still can cache everything on a single node.
Replication is not an option since RDD is huge.
Workarounds I have found
getPreferredLocations
RDD's getPreferredLocations method does not provide a 100% warranty that partition will be stored on a specified node. Spark will try during spark.locality.wait, but afterward, Spark will cache partition on a different node.
As a workarround, you can set very high value to spark.locality.wait and override getPreferredLocations. The bad news - you can not do that with Java, you need to write Scala code. At least Scala internals wrapped with Java code. I.e:
class NodeAffinityRDD[U: ClassTag](prev: RDD[U]) extends RDD[U](prev) {
val nodeIPs = Array("192.168.2.140","192.168.2.157","192.168.2.77")
override def getPreferredLocations(split: Partition): Seq[String] =
Seq(nodeIPs(split.index % nodeIPs.length))
}
SparkContext's makeRDD
SparkContext has makeRDD method. This method lack documentation. As I understand, I can specify preferred locations, and then set a high value to spark.locality.wait. The bad news - preferred location will be discarded on the first shuffle/join/cogroup operation.
Both approaches have the drawback of too high spark.locality.wait can cause your cluster to starve if some of the nodes will be unavailable.
P.S. More context
I have up to 10,000 of sales-XXX.parquet files, each represents sales of different goods in the different regions. Each sales-XXX.parquet could vary from a few KBs to a few GBs. All sales-XXX.parquets together could take up to tens or hundreds of GBs at HDFS.
I need a full-text search through all sales. I have to index each sales-XXX.parquet one-by-one with Lucene. And now I have two options:
Keep Lucene indexes in Spark. There is already solution for this, but it looks pretty suspicious. Is there any better solutions?
Keep Lucene indexes at the local file system. Then I can map-reduce on the results of each worker's index lookup. But this approach requires each worker node keeps an equal amount of data. How could I ensure Spark will keep equal amount of data on each worker node?
I'm learning Hadoop using the book Hadoop in Practice, and while reading chapter 1 i came across this diagram:
From the Hadoop docs:(http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapred/Reducer.html)
1.Shuffle
Reducer is input the grouped output of a Mapper. In the phase the framework, for each Reducer, fetches the relevant partition of the output of all the Mappers, via HTTP.
2.Sort
The framework groups Reducer inputs by keys (since different Mappers may have output the same key) in this stage.
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
While i understand that shuffle and sorting happens at the same time, it's not clear to me how the framework decides which reducer receives which mapper output. From the docs, it seems that each reducer has a way to know which mapoutput to collect, but i can't understand how.
So my question is, given the mappers output above, the final result is always the same for each reducer? If so, what are the steps to achieve this result?
Thanks for any clarifications!
It is the Partitioner that decides how to distribute the output of mappers to different reducers.
Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent for reduction.