Spark is executing too many partitions within a single task, instead of distributing it.
We are ingesting fairly large volumes of data from HBase into a Spark dataset.
Due to incompatibility we are unable to use HBase-Spark and have resorted to using the basic JavaAPI client for HBase.
To help parallelize the ingest from HBase we placed the "startRows" into a dataset, re-partitioned the dataset to give 16 partitions, each containing 4 start rows.
We then used mapPartitions() to query the 4 start rows and return an iterator of the actual row data.
It does result in all rows being fetched, however even though we are sure the data is uniformly distributed between those start rows Spark insists on moving most of the partitions to 3 or 4 executors, instead of 16.
I'm fairly sure this is because Spark is unaware of the actual data we are loading and is optimizing souly on the startRows in the dataset.
Is there anyway to force spark to execute these as one task, one executor, per partition?
List<String> keys = new ArrayList<>();
for(int salt=0; salt<maxSalt; salt++) { // maxSalt=64
keys.add( extractStartRow( mainKey, String.valueOf(salt));
}
Dataset<String> saltSeed = sparkSession.createDataset(keys,
Encoders.STRING());
int partitions = 16;
saltRange = saltRange.repartition(partitions);
Dataset<Results> = saltRange.mapPartitions(new Ingestor(mainKey), Encoders.bean(Results.class));
// Ingestor function, does the actual read from Hbase for the given salted start row.
We would like to find a way to get more tasks/executors working on the
problem of reading from HBase. Whatever we try, Spark reduces the workload down to only a few executors. The rest get no partitions and no data to ingest. The active executors take hours.
Related
So, I have a 16 node cluster where every node has Spark and Cassandra installed with a replication factor of 3 and spark.sql.shuffle.partitions of 96. I am also using the Spark-Cassandra Connector 3.0.0.
I have a Spark Dataset with 4 partition keys and I want to do a DirectJoin with a cassandra table.
Should I use repartitionByCassandraReplica? Is there a recommended number of partition keys for which it would make sense to use repartitionByCassandraReplica before a DirectJoin?
Is there also a recommended number for partitionsPerHost parameter? How could I just get 4 spark partitions in total if I have 4 partition keys..so that rows with the same partition key would be found in one spark partition?
If I do not use repartitionByCassandraReplica, I can see from SparkUI that DirectJoin is implemented. However if I use repartitionByCassandraReplica on same partition keys then I do not see any DirectJoin in the DAG, just a CassandraPartitionedRDD and later on a HashAggregate. Also it takes ~5 times more time than without repartitionByCassandraReplica. Any idea why and what is happening?
Does converting an RDD after repartitionByCassandraReplica to Spark Dataset, change the number or location of partitions?
How can I see if repartitionByCassandraReplica is working properly? I am using nodetool getendpoints to see where the data are stored, but other than that?
Please let me know if you need any more info. I just tried to summarize my questions from Spark-Cassandra: repartitionByCassandraReplica or converting dataset to JavaRDD and back do not maintain number of partitions?
I'm new to spark optimizaiton.
I'm trying to read hive data into a dataFrame. Then I'm converting the dataFrame to javaRdd and running a map function on top of it.
The problem I'm facing is, the transformation running on top of javaRdd is running with single task. Also the transformations running on top of this javaRdd is running with single task. To parallelize it, I've repartitioned the javaRdd. Is there any better way to do it, since repartition takes more time to shuffle data.
DataFrame tempDf = df.sqlContext().sql("SELECT * FROM my_table");
// without repartition, the next transformation will run with 1 task only.
JavaRDD<IMSSummaryPOJO> inputData = tempDf.toJavaRDD().flatMap(new FlatMapFunction<Row, IMSSummaryPOJO>() {
//map operation
}).repartition(repartition);
// Even though i've extra executors, if the previous transformation(inputData) is not repartitioned, then this transformation runs with single task.
JavaPairRDD<Text,IMSMetric> inputRecordRdd = inputData.flatMapToPair(new IMSInputRecordFormat(dimensionName,hllCounterPValue,hllCounterKValue,dimensionConfigMapBroadCast));
I have 5-partitions-RDD and 5 workers/executors.
How can I ask Spark to save each RDD's partition on the different worker (IP)?
Am I right if I say Spark can save few partitions on one worker, and 0 partitions on other workers?
Means, I can specify the number of partitions, but Spark still can cache everything on a single node.
Replication is not an option since RDD is huge.
Workarounds I have found
getPreferredLocations
RDD's getPreferredLocations method does not provide a 100% warranty that partition will be stored on a specified node. Spark will try during spark.locality.wait, but afterward, Spark will cache partition on a different node.
As a workarround, you can set very high value to spark.locality.wait and override getPreferredLocations. The bad news - you can not do that with Java, you need to write Scala code. At least Scala internals wrapped with Java code. I.e:
class NodeAffinityRDD[U: ClassTag](prev: RDD[U]) extends RDD[U](prev) {
val nodeIPs = Array("192.168.2.140","192.168.2.157","192.168.2.77")
override def getPreferredLocations(split: Partition): Seq[String] =
Seq(nodeIPs(split.index % nodeIPs.length))
}
SparkContext's makeRDD
SparkContext has makeRDD method. This method lack documentation. As I understand, I can specify preferred locations, and then set a high value to spark.locality.wait. The bad news - preferred location will be discarded on the first shuffle/join/cogroup operation.
Both approaches have the drawback of too high spark.locality.wait can cause your cluster to starve if some of the nodes will be unavailable.
P.S. More context
I have up to 10,000 of sales-XXX.parquet files, each represents sales of different goods in the different regions. Each sales-XXX.parquet could vary from a few KBs to a few GBs. All sales-XXX.parquets together could take up to tens or hundreds of GBs at HDFS.
I need a full-text search through all sales. I have to index each sales-XXX.parquet one-by-one with Lucene. And now I have two options:
Keep Lucene indexes in Spark. There is already solution for this, but it looks pretty suspicious. Is there any better solutions?
Keep Lucene indexes at the local file system. Then I can map-reduce on the results of each worker's index lookup. But this approach requires each worker node keeps an equal amount of data. How could I ensure Spark will keep equal amount of data on each worker node?
Hi I need to read multiple tables from my databases and join the tables. Once the tables are joined I would like to push them to Elasticsearch.
The tables are joined from an external process as the data can come from multiple sources. This is not an issue in fact I have 3 separate processes reading 3 separate tables at an average of 30,000 records per second. The records are joined into a multimap, which then a single JsonDocument is produced for each key.
Then there is a separate process reads the denormalized JsonDocuments and bulks them to Elasticsearch at an average of 3000 documents per second.
I'm having troubles trying to find a way to split the work. I'm pretty sure my Elasticsearch cluster can handle more than 3000 documents per second. I was thinking somehow split the multimap that holds the Joined json docs.
Anyways I'm building a custom application for this. So I was wondering is there any tools that can be put together to do all this? Either some form of ETL, or stream processing or something?
While streaming would make records more readily available then bulk processing, and would reduce the overhead in the java container regarding large object management, you can have a hit on the latency. Usually in these kind of scenarios you have to find an optimum for the bulk size. In this I follow the following steps:
1) Build a streaming bulk insert (so stream but still get more then 1 record (or build more then 1 JSON in your case at the time)
2) Experiment with several bulk sizes: 10,100,1000,10000 for example and plot them in a quick graph. Run a sufficient amount of records to see if performance does not go down over time: It can be that the 10 is extremely fast per record, but that there is an incremental insert overhead (for example the case in SQL Server on the primary key maintenance). If you run the same number of total records for every test, it should be representative of your performance.
3) Interpolate in your graph and maybe try out 3 values between your best values of run 2
Then use the final result as your optimal stream bulk insertion size.
Once you have this value, you can add one more step:
Run multiple processes in parallel. This then fills the gaps in you process a bit. Watch the throughput and adjust your bulk sizes maybe one more time.
This approach once helped me with a multi TB import process to speed up from 2 days to about 12hrs, so it can work out pretty positive.
I need to compute aggregate over HBase table.
Say I have this hbase table: 'metadata' Column family:M column:n
Here metadata object has a list of strings
class metadata
{
List tags;
}
I need to compute the count of tags for which I was thinking of using either using mapreduce or scan over hbase directly.
The result has to be returned on the fly. So which one can I use in this scenario? Scan over hbase and compute the aggregate or mapreduce?
Mapreduce ultimately is going to scan hbase and compute the count.
What are the pros and cons of using either of these?
I suspect you're not aware about what are the pros and cons of HBase, it's not suited for computing realtime aggregations of large datasets.
Let's start by saying that MapReduce is a scheduled job by itself, you won't be able to return the response on the fly, expect no less than 15 seconds for the Task Tracker to initialize the job.
In the end, the MapReduce Job will do exactly the same thing: a HBase scan, the difference between performing the scan right-away and the MapReduce it's just the paralellization and data locality, which excels when you have millions/billions of rows. If your queries only needs to read a few thousand consecutive rows to aggregate them, sure, you could just do a scan and it will probably have an acceptable response time, but for larger datasets it's just going to be impossible to do that at query time.
HBase is best suited for handling tons of atomic reads and writes, that way, you can maintain those aggregations in real time, no matter how many pre-aggregated counters you'll need or how many requests you're going to receive: with a proper row key design and split policy you can scale to satisfy the demand.
Think of it as a word count, you could store all the words in a list and count them at query-time when requested or you can process that list at insert-time and store the number of times each word is used in the document, as global counter, and in a daily, monthly, yearly, per-country, per-author tables (or even families).