So, I have a 16 node cluster where every node has Spark and Cassandra installed with a replication factor of 3 and spark.sql.shuffle.partitions of 96. I am also using the Spark-Cassandra Connector 3.0.0.
I have a Spark Dataset with 4 partition keys and I want to do a DirectJoin with a cassandra table.
Should I use repartitionByCassandraReplica? Is there a recommended number of partition keys for which it would make sense to use repartitionByCassandraReplica before a DirectJoin?
Is there also a recommended number for partitionsPerHost parameter? How could I just get 4 spark partitions in total if I have 4 partition keys..so that rows with the same partition key would be found in one spark partition?
If I do not use repartitionByCassandraReplica, I can see from SparkUI that DirectJoin is implemented. However if I use repartitionByCassandraReplica on same partition keys then I do not see any DirectJoin in the DAG, just a CassandraPartitionedRDD and later on a HashAggregate. Also it takes ~5 times more time than without repartitionByCassandraReplica. Any idea why and what is happening?
Does converting an RDD after repartitionByCassandraReplica to Spark Dataset, change the number or location of partitions?
How can I see if repartitionByCassandraReplica is working properly? I am using nodetool getendpoints to see where the data are stored, but other than that?
Please let me know if you need any more info. I just tried to summarize my questions from Spark-Cassandra: repartitionByCassandraReplica or converting dataset to JavaRDD and back do not maintain number of partitions?
Related
Spark is executing too many partitions within a single task, instead of distributing it.
We are ingesting fairly large volumes of data from HBase into a Spark dataset.
Due to incompatibility we are unable to use HBase-Spark and have resorted to using the basic JavaAPI client for HBase.
To help parallelize the ingest from HBase we placed the "startRows" into a dataset, re-partitioned the dataset to give 16 partitions, each containing 4 start rows.
We then used mapPartitions() to query the 4 start rows and return an iterator of the actual row data.
It does result in all rows being fetched, however even though we are sure the data is uniformly distributed between those start rows Spark insists on moving most of the partitions to 3 or 4 executors, instead of 16.
I'm fairly sure this is because Spark is unaware of the actual data we are loading and is optimizing souly on the startRows in the dataset.
Is there anyway to force spark to execute these as one task, one executor, per partition?
List<String> keys = new ArrayList<>();
for(int salt=0; salt<maxSalt; salt++) { // maxSalt=64
keys.add( extractStartRow( mainKey, String.valueOf(salt));
}
Dataset<String> saltSeed = sparkSession.createDataset(keys,
Encoders.STRING());
int partitions = 16;
saltRange = saltRange.repartition(partitions);
Dataset<Results> = saltRange.mapPartitions(new Ingestor(mainKey), Encoders.bean(Results.class));
// Ingestor function, does the actual read from Hbase for the given salted start row.
We would like to find a way to get more tasks/executors working on the
problem of reading from HBase. Whatever we try, Spark reduces the workload down to only a few executors. The rest get no partitions and no data to ingest. The active executors take hours.
I have some data which I have to ingest every day into Solr, per day data is around 10-12 GB, and I have to run a catch-up job for last 1 year, every day is around 10-12 GB data.
I am using Java and I need scoring in my data by doing a partial update if same unique key arrives again, I used docValues with TextField.
https://github.com/grossws/solr-dvtf
Initially, I used a sequential approach which took a lot of time(reading from S3 and adding to Solr in batches of 60k).
I found this repo:
https://github.com/lucidworks/spark-solr,
but I couldn't understand the implementation as I needed to modify field data for some scoring logic, so wrote custom spark code.
Then I created 4 nodes in Solr(on the same IP), and used Spark to insert data, initially as the partitions created by Spark were way more than the Solr nodes and also the 'executors' specified were more than nodes, so it took way much more time.
Then I repartitioned the RDD into 4(no. of Solr nodes), specified 4 executors, then insertion took less time and was successful, but when I ran the same for a month, one or more Solr nodes kept on going down, I have enough free space on HD, and rarely my ram usage ends up being full.
Please suggest me a way to solve this problem, and I have 8 core CPU,
or should I use a different system for different nodes on Solr?
Thanks!
I am not sure spark would be the best way to load that much of data into solr.
Your possible options for loading data into solr are :
Through hbase-indexer also called batch indexer which syncs data between your hbase table and solr index.
You can also implement an hbase-lily-indexer which is almost in real time.
You can also use solr's jdbc utility - THE BEST in my opinion. What you can do is read data from s3 load into an hive table through spark. Then you can implement a solr jdbc to your hive table and trust me it is very fast.
Let me know if you want more information on any of these.
I have 5-partitions-RDD and 5 workers/executors.
How can I ask Spark to save each RDD's partition on the different worker (IP)?
Am I right if I say Spark can save few partitions on one worker, and 0 partitions on other workers?
Means, I can specify the number of partitions, but Spark still can cache everything on a single node.
Replication is not an option since RDD is huge.
Workarounds I have found
getPreferredLocations
RDD's getPreferredLocations method does not provide a 100% warranty that partition will be stored on a specified node. Spark will try during spark.locality.wait, but afterward, Spark will cache partition on a different node.
As a workarround, you can set very high value to spark.locality.wait and override getPreferredLocations. The bad news - you can not do that with Java, you need to write Scala code. At least Scala internals wrapped with Java code. I.e:
class NodeAffinityRDD[U: ClassTag](prev: RDD[U]) extends RDD[U](prev) {
val nodeIPs = Array("192.168.2.140","192.168.2.157","192.168.2.77")
override def getPreferredLocations(split: Partition): Seq[String] =
Seq(nodeIPs(split.index % nodeIPs.length))
}
SparkContext's makeRDD
SparkContext has makeRDD method. This method lack documentation. As I understand, I can specify preferred locations, and then set a high value to spark.locality.wait. The bad news - preferred location will be discarded on the first shuffle/join/cogroup operation.
Both approaches have the drawback of too high spark.locality.wait can cause your cluster to starve if some of the nodes will be unavailable.
P.S. More context
I have up to 10,000 of sales-XXX.parquet files, each represents sales of different goods in the different regions. Each sales-XXX.parquet could vary from a few KBs to a few GBs. All sales-XXX.parquets together could take up to tens or hundreds of GBs at HDFS.
I need a full-text search through all sales. I have to index each sales-XXX.parquet one-by-one with Lucene. And now I have two options:
Keep Lucene indexes in Spark. There is already solution for this, but it looks pretty suspicious. Is there any better solutions?
Keep Lucene indexes at the local file system. Then I can map-reduce on the results of each worker's index lookup. But this approach requires each worker node keeps an equal amount of data. How could I ensure Spark will keep equal amount of data on each worker node?
I am working with Spark on top of a HDFS cluster.
Before a join operation in Java Spark between two (key,value) PairRDDs, I partition data of both files with a HashPartitioner to have elements with the same key on the same machine. That's fine for each file independently, but as mentioned in a previous post When create two different Spark Pair RDD with same key set, will Spark distribute partition with same key to the same machine?, partitioned RDDs are not necessarily co-located, so the same keys from both RDDs may not be on the same machine.
Is there a way to force spark to do that? to be sure that same keys from different PairRDDs are on the same machine
Btw since we use a hashpartitioning, for both RDDs same keys are in the same partition number but the order of the machines on which partitions are saved is different from one RDD to the other.
For example, if we have 'N' machines, is there a way to attribute a number to each machine (I guess spark do this internally),
and then simply force the partition number 'P' to be written on the machine 'N modulo P'?
I would like to use Server-side data selection and filtering using the cassandra spark connector. In fact we have many sensors that send values every 1s, we are interested on these data aggregation using months, days, hours, etc,
I have proposed the following data model:
CREATE TABLE project1(
year int,
month int,
load_balancer int,
day int,
hour int,
estimation_time timestamp,
sensor_id int,
value double,
...
PRIMARY KEY ((year, month, load_balancer), day, hour, estimation_time, sensor_id)
Then, we were interested to get the data aggregation of a 2014-December- with loadbalancer IN (0,1,2,3). So they are 4 different partitions.
We are using the cassandra spark connector version 1.1.1, and we used a combine by query to get all values mean aggregated by hour.
So the processing time for 4,341,390 tuples, spark takes 11min to return the result.
Now the issue is that we are using 5 nodes however spark uses only one worker to execute the task.
Could you please suggest an update to the query or data model in order to enhance the performance?
Spark Cassandra Connector has this feature, it is SPARKC-25. You can just create an arbitrary RDD with values and then use it as a source of keys to fetch data from Cassandra table. Or in other words - join an arbitrary RDD to Cassandra RDD. In your case, that arbitrary RDD would include 4 tuples with different load balancer values. Look at the documentation for more info. SCC 1.2 has been released recently and it is probably compatible with Spark 1.1 (it is designed for Spark 1.2 though).