I've got an RDD in Spark which I've cached. Before I cache it I repartition it. This works, and I can see in the storage tab in spark that it has the expected number of partitions.
This is what the stages look like on subsequent runs:
It's skipping a bunch of work which I've done to my cached RDD which is great. What I'm wondering though is why Stage 18 starts with a repartition. You can see that it's done at the end of Stage 17.
The steps I don in the code are:
List<Tuple2<String, Integer>> rawCounts = rdd
.flatMap(...)
.mapToPair(...)
.reduceByKey(...)
.collect();
To get the RDD, I grab it out of the session context. I also have to wrap it since I'm using Java:
JavaRDD<...> javaRdd = sc.emptyRDD();
return javaRdd.wrapRDD((RDD<...>)rdd);
Edit
I don't think this is specific to repartitioning. I've removed the repartitioning, and now I'm seeing some of the other operations I do prior to caching appearing after the skipped stages. E.g.
The green dot and everything before it should have already been worked out and cached.
Related
I'm using the Java API to run queries. And I understand that the QueryResults object that's returned by datastore.run() uses a lazy-loading iterator, so the time to iterate through all the results is quite long when retrieving a large set of results.
I'm already using a Cursor for most operations where paging is a possibility, and that works around the issue in those cases. I'm also using datastore.get() instead of queries whenever I know the entity keys in advance (and with that method, I can manually separate the query into smaller chunks, and run those in parallel using Kotlin coroutines).
However, there are several cases where I have to use a query, and I also need to get all the results at once because there's some back-end processing involved with those results. And in those cases, iterating through all the results becomes pretty time-intensive. The dataset is relatively small now (around 5,000 entities), but it'll grow progressively higher, so I'd like to set up a better solution than just brute-force iterating through all the results, and having to wait for it to finish.
Ideally, I'd like to be able to chunk the query into smaller sets of results (maybe 500 - 1000), and then iterate through all the chunks in parallel (again, using coroutines). But I don't see anything that allows me to do that. Cursors require serial processing because you can't get a cursor for the next query without first iterating through all the results first.
Is there any way to run a chunked/split query? Or any other way to improve the performance when I absolutely have to retrieve all the results for a query?
Ok, I found a way to make it happen, and I have to give full credit to this blog post for giving me an idea for how to do it...
The basic premise is that running a key query is much faster than running a full entity query. So, instead of running a normal query, we run the key query, which gives us the keys for all of the results:
val query = Query.newKeyQueryBuilder()
.setKind("my_tem_kind")
.setFilter(myFilter)
// set limit, etc.
val queryResults = store.run(keyQuery.build())
val keys = queryResults.asSequence().toSet()
and now that we have a set of Keys, we can split it into chunks, and run them in parallel:
val jobs = mutableListOf<Job>()
keys.chunked(500).forEach {
jobs.add(CoroutineScope(Dispatchers.IO).launch {
val results = store.get(it)
while(results.hasNext()){
// process the result
}
})
}
runBlocking { jobs.joinAll() }
We have the following situation:
Existing topic with 9 partitions in Kafka contains multiple record types. These are partitioned according to a custom header (key = null) which is basically a string UUID.
Data is consumed via Kstreams, filtered by the type that interests us and repartitioned into a new topic containing only specific record types. The new topic contains 12 partitions and has key=<original id in header>. The increased partition count is to allow more consumers to process this data.
This is where things seem to get a little weird.
In the original topic, we have millions of the relevant records. In each of the 9 partitions, we see relatively monotonically increasing record times, which is to be expected as the partitions should be assigned relatively randomly due to the high cardinality of the partition key.
In the new topic, we're seeing something like the following:
Seemingly the record timestamps are jumping all over the place. Some discrepancies are to be expected seeing how the partitioning in the original (as well as the new) topic isn't exactly round-robin. We're seeing a few partitions in our original topic which have offsets that are ~1-2M higher/lower than others, but seeing how we have many millions of records of ingest daily, I can't explain the one record with a time stamp of 5/28/2022 between 6/17/2022 and 6/14/2022.
What could explain this behaviour?
Edit:
Looking at the consumer group offsets, I've found something interesting:
I was reingesting the data with multiple consumers and noted that they have severly different lags per partition. I don't quite understand why this discrepancy would be so large. Going to investigate further...
Edit:
To add some more detail, the workflow of the Streams app is as follows:
SpecificAvroSerde<MyEvent> specificAvroSerde = new SpecificAvroSerde<>();
specificAvroSerde.configure(Collections.singletonMap(AbstractKafkaSchemaSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, SCHEMA_REGISTRY_URL), /*isKey*/ false);
streamsBuilder
.stream("events", Consumed.with(Serdes.Void(), Serdes.ByteArray()))
.transform(new FilterByTypeHeaderTransformerSupplier(topicProperties))
.transform(new MyEventAvroTransformerSupplier())
.to(topicProperties.getOutputTopic(), Produced.with(Serdes.UUID(), specificAvroSerde));
where the FilterByTypeHeaderTransformerSupplier instantiates a transformer that does, in essence:
public KeyValue<Void, byte[]> transform(Void key, byte[] value) {
// checks record headers
if (matchesFilter()) {
return KeyValue.pair(key, value);
}
// skip since it is not an event type that interests us
return null;
}
while the other transformer does the following (which doesn't have great performance but does the job for now):
public KeyValue<UUID, MyAvroEvent> transform(Void key, byte[] value) {
MyEvent event = objectMapper.readValue(value, MyEvent.class);
MyAvroEvent avroRecord = serializeAsAvro(event);
return KeyValue.pair(event.getEventId(), avroRecord);
}
hence I use the default timestamp extractor (FailOnInvalidTimestamp).
Most notably, as can be seen, I'm adding a key to this record: however, this key is the same one that was previously used to partition the data (in the existing 9 partitions, however).
I'll try removing this key first to see if the behaviour changes, but I'm kind of doubtful that that's the reason, especially since it's the same partition key value that was used previously.
I still haven't found the reason for the wildly differing consumer offsets, unfortunately. I very much hope that I don't have to have a single consumer reprocess this once to catch up, since that would take a very long time...
Edit 2:
I believe I've found the cause of this discrepancy. The original records were produced using Spring Cloud Stream - these records included headers such as e.g "scst_partition=4". However, the hashing the was used for the producer back then used Java based hashing (e.g. "keyAsString".hashCode() % numPartitions), while the Kafka Clients use:
Utils.toPositive(Utils.murmur2(keyAsBytes))
As a result, we're seeing behaviour where records in e.g. source partition 0 could land in any one of the new partitions. Hence, small discrepancies in the source distribution could lead to rather large fluctuations in record ordering in the new partitions.
I'm not quite sure how to deal with this in a sensible manner. Currently I've tried using a simple round-robin partitioning in the target topic to see if the distribution is a bit more even in that case.
The reason why this is a problem is that this data will be put on an object storage via e.g. Kafka Connect. If I want this data stored in e.g. a daily format, then old data coming in all the time would cause buffers that should've been closed a long time ago to be kept open, increasing memory consumption. It doesn't make sense to use any kind of windowing for late data in this case, seeing how it's not a real-time aggregation but simply consumption of historical data.
Ideally for the new partitioning I'd want something like: given the number of partitions in the target topic is a multiple of the number of partitions in the source topic, have records in partition 0 go to either partition 0 or 9, from 1 to either 1 or 10, etc. (perhaps even randomly)
This would require some more work in the form of a custom partitioner, but I can't foresee if this would cause other problems down the line.
I've also tried setting the partition Id header ("kafka_partitionId" - as far as I know, documentation here isn't quite easy to find) but it is seemingly not used.
I'll investigate a bit further...
Final edit:
For what it's worth, the problem boiled down to the following two issues:
My original data, written by Spring Cloud Stream, was partitioned differently that how a vanilla Kafka Producer (which Kafka Streams internally uses) would. This led to data jumping all over the place from a "record-time" point of view.
Due to the above, I had to choose a number of partitions that was a multiple of the previous number of partitions as well as use a custom partitioner which does it the "spring cloud stream".
The requirement that the new number be a multiple of the previous one is a result of modular arithmetic. If I wished to have deterministic partitioning for my existing data, having a multiple would allow data to go into one of two possible new partitions as opposed to only one as in the previous case.
E.g. with 9 -> 18 partitions:
id 1 -> previously hashed to partition 0, now hashes to either 0 or 9 (mod 18)
id 2 -> previously hashed to partition 1, now hashes to either 1 or 10 (mod 18)
Hence my requirement for higher paralellism is met and the data inside a single partition is ordered as desired, since a target partition is only supplied from at most one source partition.
I'm sure there might have been a simpler way to go about this all, but this works for now.
For further context/info, see also this Q&A.
I am running the Spark job on Hadoop YARN Cluster.
i am using saveAsTextFile() method to store the RDD to text file.
I can see more than 150 empty part files created out of 250 files.
Is there a way we can avoid this?
Each partition is written to it's own file. Empty partitions will be written as empty files.
In order to avoid writing the empty files you can either coalesce or repartition your RDD into a smaller number of partitions.
If you didn't expect to have empty partitions, it may be worth investigating why you have them. Empty partitions can happen either due to a filtering step which removed all the elements from some partitions, or due to a bad hash function. If the hashCode() for your RDD's elements doesn't distribute the elements well, it's possible to end up with an unbalanced RDD that has empty partitions.
How can I force Spark to execute a call to map, even if it thinks it does not need to be executed due to its lazy evaluation?
I have tried to put cache() with the map call but that still doesn't do the trick. My map method actually uploads results to HDFS. So, its not useless, but Spark thinks it is.
Short answer:
To force Spark to execute a transformation, you'll need to require a result. Sometimes a simple count action is sufficient.
TL;DR:
Ok, let's review the RDD operations.
RDDs support two types of operations:
transformations - which create a new dataset from an existing one.
actions - which return a value to the driver program after running a computation on the dataset.
For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away.
Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
Conclusion
To force Spark to execute a call to map, you'll need to require a result. Sometimes a count action is sufficient.
Reference
Spark Programming Guide.
Spark transformations only describe what has to be done. To trigger an execution you need an action.
In your case there is a deeper problem. If goal is to create some kind of side effect, like storing data on HDFS, the right method to use is foreach. It is both an action and has a clean semantics. What is also important, unlike map, it doesn't imply referential transparency.
spring-batch newbie: I have a series of batches that
read all new records (since the last execution) from some sql tables
upload all the new records to hadoop
run a series of map-reduce (pig) jobs on all the data (old and new)
download all the output to local and run some other local processing on all the output
point is, I don't have any obvious "item" - I don't want to relate to the specific lines of text in my data, I work with all of it as one big chunk and don't want any commit intervals and such...
however, I do want to keep all these steps loosely coupled - as in, step a+b+c might succeed for several days and accumulate processed stuff while step d keeps failing, and then when it finally succeeds it will read and process all of the output of it's previous steps.
SO: is my "item" a fictive "working-item" which will signify the entire new data? do I maintain a series of queues myself and pass this fictive working-items between them?
thanks!
people always assume that the only use of spring batch is really only for the chunk processing. that is a huge feature, but what's overlooked is the visibility of the processing and job control.
give 5 people the same task with no spring batch and they're going to implement flow control and visibility their own way. give 5 people the same task and spring batch and you may end up with custom tasklets all done differently, but getting access to the job metadata and starting and stopping jobs is going to be consistent. from my perspective it's a great tool for job management. if you already have your jobs written, you can implement them as custom tasklets if you don't want to rewrite them to conform the 'item' paradigm. you'll still see benefits.
I don't see the problem. Your scenario seems like a classic application of Spring Batch to me.
read all new records (since the last execution) from some sql tables
Here, an item is a record
upload all the new records to hadoop
Same here
run a series of map-reduce (pig) jobs on all the data (old and new)
Sounds like a StepListener or ChunkListener
download all the output to local and run some other local processing on all the output
That's the next step.
The only problem I see is if you don't have Domain Objects for your records. But even then, you can work with maps or arrays, while still using ItemReaders and ItemWriters.