How to convert DataFrame to javaRdd with distibuted copy?

How to convert DataFrame to javaRdd with distibuted copy? - java

I'm new to spark optimizaiton.
I'm trying to read hive data into a dataFrame. Then I'm converting the dataFrame to javaRdd and running a map function on top of it.
The problem I'm facing is, the transformation running on top of javaRdd is running with single task. Also the transformations running on top of this javaRdd is running with single task. To parallelize it, I've repartitioned the javaRdd. Is there any better way to do it, since repartition takes more time to shuffle data.
DataFrame tempDf = df.sqlContext().sql("SELECT * FROM my_table");
// without repartition, the next transformation will run with 1 task only.
JavaRDD<IMSSummaryPOJO> inputData = tempDf.toJavaRDD().flatMap(new FlatMapFunction<Row, IMSSummaryPOJO>() {
//map operation
}).repartition(repartition);
// Even though i've extra executors, if the previous transformation(inputData) is not repartitioned, then this transformation runs with single task.
JavaPairRDD<Text,IMSMetric> inputRecordRdd = inputData.flatMapToPair(new IMSInputRecordFormat(dimensionName,hllCounterPValue,hllCounterKValue,dimensionConfigMapBroadCast));

Related

Can the Google Cloud Datastore query results iterator be made to run faster (preferably, parallelized with chunked queries)?

I'm using the Java API to run queries. And I understand that the QueryResults object that's returned by datastore.run() uses a lazy-loading iterator, so the time to iterate through all the results is quite long when retrieving a large set of results.
I'm already using a Cursor for most operations where paging is a possibility, and that works around the issue in those cases. I'm also using datastore.get() instead of queries whenever I know the entity keys in advance (and with that method, I can manually separate the query into smaller chunks, and run those in parallel using Kotlin coroutines).
However, there are several cases where I have to use a query, and I also need to get all the results at once because there's some back-end processing involved with those results. And in those cases, iterating through all the results becomes pretty time-intensive. The dataset is relatively small now (around 5,000 entities), but it'll grow progressively higher, so I'd like to set up a better solution than just brute-force iterating through all the results, and having to wait for it to finish.
Ideally, I'd like to be able to chunk the query into smaller sets of results (maybe 500 - 1000), and then iterate through all the chunks in parallel (again, using coroutines). But I don't see anything that allows me to do that. Cursors require serial processing because you can't get a cursor for the next query without first iterating through all the results first.
Is there any way to run a chunked/split query? Or any other way to improve the performance when I absolutely have to retrieve all the results for a query?

Ok, I found a way to make it happen, and I have to give full credit to this blog post for giving me an idea for how to do it...
The basic premise is that running a key query is much faster than running a full entity query. So, instead of running a normal query, we run the key query, which gives us the keys for all of the results:
val query = Query.newKeyQueryBuilder()
.setKind("my_tem_kind")
.setFilter(myFilter)
// set limit, etc.
val queryResults = store.run(keyQuery.build())
val keys = queryResults.asSequence().toSet()
and now that we have a set of Keys, we can split it into chunks, and run them in parallel:
val jobs = mutableListOf<Job>()
keys.chunked(500).forEach {
jobs.add(CoroutineScope(Dispatchers.IO).launch {
val results = store.get(it)
while(results.hasNext()){
// process the result
}
})
}
runBlocking { jobs.joinAll() }

Force spark to use more executors, one per partition

Spark is executing too many partitions within a single task, instead of distributing it.
We are ingesting fairly large volumes of data from HBase into a Spark dataset.
Due to incompatibility we are unable to use HBase-Spark and have resorted to using the basic JavaAPI client for HBase.
To help parallelize the ingest from HBase we placed the "startRows" into a dataset, re-partitioned the dataset to give 16 partitions, each containing 4 start rows.
We then used mapPartitions() to query the 4 start rows and return an iterator of the actual row data.
It does result in all rows being fetched, however even though we are sure the data is uniformly distributed between those start rows Spark insists on moving most of the partitions to 3 or 4 executors, instead of 16.
I'm fairly sure this is because Spark is unaware of the actual data we are loading and is optimizing souly on the startRows in the dataset.
Is there anyway to force spark to execute these as one task, one executor, per partition?
List<String> keys = new ArrayList<>();
for(int salt=0; salt<maxSalt; salt++) { // maxSalt=64
keys.add( extractStartRow( mainKey, String.valueOf(salt));
}
Dataset<String> saltSeed = sparkSession.createDataset(keys,
Encoders.STRING());
int partitions = 16;
saltRange = saltRange.repartition(partitions);
Dataset<Results> = saltRange.mapPartitions(new Ingestor(mainKey), Encoders.bean(Results.class));
// Ingestor function, does the actual read from Hbase for the given salted start row.
We would like to find a way to get more tasks/executors working on the
problem of reading from HBase. Whatever we try, Spark reduces the workload down to only a few executors. The rest get no partitions and no data to ingest. The active executors take hours.

DataSet to Kafka Using Flink? Is it possible

I have a usecase where i need to move records from hive to kafka. I couldn't find a way where i can directly add a kafka sink to flink dataset.
Hence i used a workaround where i call the map transformation on the flink dataset and inside the map function i use the kafkaProducer.send() command for the given record.
The problem i am facing is that i don't have any way to execute kafkaProducer.flush() on every worker node, hence the number of records written in kafka is always slightly lesser than the number of records in the dataset.
Is there an elegant way to handle this? Any way i can add a kafka sink to dataset in flink? Or a way to call kafkaProducer.flush() as a finalizer?

You could simply create a Sink that will use KafkaProducer under the hood and will write data to Kafka.

Spark repartition operation duplicated

I've got an RDD in Spark which I've cached. Before I cache it I repartition it. This works, and I can see in the storage tab in spark that it has the expected number of partitions.
This is what the stages look like on subsequent runs:
It's skipping a bunch of work which I've done to my cached RDD which is great. What I'm wondering though is why Stage 18 starts with a repartition. You can see that it's done at the end of Stage 17.
The steps I don in the code are:
List<Tuple2<String, Integer>> rawCounts = rdd
.flatMap(...)
.mapToPair(...)
.reduceByKey(...)
.collect();
To get the RDD, I grab it out of the session context. I also have to wrap it since I'm using Java:
JavaRDD<...> javaRdd = sc.emptyRDD();
return javaRdd.wrapRDD((RDD<...>)rdd);
Edit
I don't think this is specific to repartitioning. I've removed the repartitioning, and now I'm seeing some of the other operations I do prior to caching appearing after the skipped stages. E.g.
The green dot and everything before it should have already been worked out and cached.

How can I force Spark to execute code?

How can I force Spark to execute a call to map, even if it thinks it does not need to be executed due to its lazy evaluation?
I have tried to put cache() with the map call but that still doesn't do the trick. My map method actually uploads results to HDFS. So, its not useless, but Spark thinks it is.

Short answer:
To force Spark to execute a transformation, you'll need to require a result. Sometimes a simple count action is sufficient.
TL;DR:
Ok, let's review the RDD operations.
RDDs support two types of operations:
transformations - which create a new dataset from an existing one.
actions - which return a value to the driver program after running a computation on the dataset.
For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away.
Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
Conclusion
To force Spark to execute a call to map, you'll need to require a result. Sometimes a count action is sufficient.
Reference
Spark Programming Guide.

Spark transformations only describe what has to be done. To trigger an execution you need an action.
In your case there is a deeper problem. If goal is to create some kind of side effect, like storing data on HDFS, the right method to use is foreach. It is both an action and has a clean semantics. What is also important, unlike map, it doesn't imply referential transparency.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.