I am running the Spark job on Hadoop YARN Cluster.
i am using saveAsTextFile() method to store the RDD to text file.
I can see more than 150 empty part files created out of 250 files.
Is there a way we can avoid this?
Each partition is written to it's own file. Empty partitions will be written as empty files.
In order to avoid writing the empty files you can either coalesce or repartition your RDD into a smaller number of partitions.
If you didn't expect to have empty partitions, it may be worth investigating why you have them. Empty partitions can happen either due to a filtering step which removed all the elements from some partitions, or due to a bad hash function. If the hashCode() for your RDD's elements doesn't distribute the elements well, it's possible to end up with an unbalanced RDD that has empty partitions.
Related
I have a scenario where for each request, I've to make a batch get of atleast 1000 keys.
Currently I'm getting 2000 requests per minute and this is expected to rise.
Also I've read that batch get of aerospike internally makes individual request to server concurrently/sequentially.
I am using the aerospike as a cluster (running on SSD). So is this efficient to write UDF (user defined method) in lua for making a batch request, and aggregating the results at server level instead of multiple hits from client
Kindly suggest if default batch get of aerospike will be efficient or I've to do something else.
Batch read is the right way to do it. Results are returned in the order of keys specified in the list. Records not found will return null. Client parallel-izes the keys by nodes - waits (there is no callback in client unlike Secondary Index or Scan) and collects the returns from all nodes and presents them back in the client in original order. Make sure you have adequate memory in the client to hold all the returned batch results.
To UDF or Not to UDF?
First thing, you cannot do batch reads as a UDF, at least not in any way that's remotely efficient.
You have two kinds of UDF. The first is a record UDF, which is limited to operating on a single record. The record is locked as your UDF executes, so it can either read or modify the data, but it is sandboxed from accessing other records. The second is a stream UDF, which is read-only, and runs against either a query or a full scan of a namespace or set. Its purpose is to allow you to implement aggregations. Even if you're retrieving 1000 keys at a time, using stream UDFs to just pick a batch of keys from a much larger set or namespace is very inefficient. That aside, UDFs will always be slower than the native operations provided by Aerospike, and this is true for any database.
Batch Reads
Read the documentation for batch operations, and specifically the section on the batch-index protocol. There is a great pair of FAQs in the community forum you should read:
FAQ - Differences between getting single record versus batch
FAQ - batch-index tuning parameters
Capacity Planning
Finally, if you are getting 2000 requests per-second at your application, and each of those turns into a batch-read of 1000 keys, you need to make sure that your cluster is sized properly to handle 2000 * 1000 = 2Mtps reads. Tuning the batch-index parameters will help, but if you don't have enough aggregate SSD capacity to support those 2 million reads per-second, your problem is one of capacity planning.
I have 5-partitions-RDD and 5 workers/executors.
How can I ask Spark to save each RDD's partition on the different worker (IP)?
Am I right if I say Spark can save few partitions on one worker, and 0 partitions on other workers?
Means, I can specify the number of partitions, but Spark still can cache everything on a single node.
Replication is not an option since RDD is huge.
Workarounds I have found
getPreferredLocations
RDD's getPreferredLocations method does not provide a 100% warranty that partition will be stored on a specified node. Spark will try during spark.locality.wait, but afterward, Spark will cache partition on a different node.
As a workarround, you can set very high value to spark.locality.wait and override getPreferredLocations. The bad news - you can not do that with Java, you need to write Scala code. At least Scala internals wrapped with Java code. I.e:
class NodeAffinityRDD[U: ClassTag](prev: RDD[U]) extends RDD[U](prev) {
val nodeIPs = Array("192.168.2.140","192.168.2.157","192.168.2.77")
override def getPreferredLocations(split: Partition): Seq[String] =
Seq(nodeIPs(split.index % nodeIPs.length))
}
SparkContext's makeRDD
SparkContext has makeRDD method. This method lack documentation. As I understand, I can specify preferred locations, and then set a high value to spark.locality.wait. The bad news - preferred location will be discarded on the first shuffle/join/cogroup operation.
Both approaches have the drawback of too high spark.locality.wait can cause your cluster to starve if some of the nodes will be unavailable.
P.S. More context
I have up to 10,000 of sales-XXX.parquet files, each represents sales of different goods in the different regions. Each sales-XXX.parquet could vary from a few KBs to a few GBs. All sales-XXX.parquets together could take up to tens or hundreds of GBs at HDFS.
I need a full-text search through all sales. I have to index each sales-XXX.parquet one-by-one with Lucene. And now I have two options:
Keep Lucene indexes in Spark. There is already solution for this, but it looks pretty suspicious. Is there any better solutions?
Keep Lucene indexes at the local file system. Then I can map-reduce on the results of each worker's index lookup. But this approach requires each worker node keeps an equal amount of data. How could I ensure Spark will keep equal amount of data on each worker node?
I've got an RDD in Spark which I've cached. Before I cache it I repartition it. This works, and I can see in the storage tab in spark that it has the expected number of partitions.
This is what the stages look like on subsequent runs:
It's skipping a bunch of work which I've done to my cached RDD which is great. What I'm wondering though is why Stage 18 starts with a repartition. You can see that it's done at the end of Stage 17.
The steps I don in the code are:
List<Tuple2<String, Integer>> rawCounts = rdd
.flatMap(...)
.mapToPair(...)
.reduceByKey(...)
.collect();
To get the RDD, I grab it out of the session context. I also have to wrap it since I'm using Java:
JavaRDD<...> javaRdd = sc.emptyRDD();
return javaRdd.wrapRDD((RDD<...>)rdd);
Edit
I don't think this is specific to repartitioning. I've removed the repartitioning, and now I'm seeing some of the other operations I do prior to caching appearing after the skipped stages. E.g.
The green dot and everything before it should have already been worked out and cached.
Recently I have being trying out Spark and do far I have observed quite interesting results, but currently I am stuck with famous groupByKey OOM problem. Basically what the job does it tries to search in the large datasets the periods where measured value is increasing consecutively for at least N times. I managed to get rid of the problem by writing the results to the disk, but the application is running much slower now (which is expected due to the disk IO). Now the question: is there any other memory efficient strategy where I can run sorted data and check whether adjacent values(for the same key) are increasing in at least N consecutive observations, without recurring to groupByKey method?
I have designed an algorithm to do it with reduceByKey, but there is one problem, reduce seems to ignore data ordering and yells completely wrong results at the end.
Any ideas appreciated.
There are a few ways you can approach this problem:
repartitionAndSortWithinPartitions with custom partitioner and ordering:
keyBy (name, timestamp) pairs
create custom partitioner which considers only the name
repartitionAndSortWithinPartitions using custom partitioner
use mapPartitions to iterate over data and yield matching sequences
sortBy(Key) - this is similar to the first solution but provides higher granularity at the cost of additional post-processing.
keyBy (name, timestamp) pairs
sortByKey
process individual partitions using mapPartitionsWithIndex keeping track of leading / trailing patterns for each partition
adjust final results to include patterns which span over more than one partitions
create fixed sized windows over sorted data using sliding from mllib.rdd.RDDFunctions.
sortBy (name, timestamp)
create sliding RDD and filter windows which cover multiple names
check if any window contains desired pattern.
I want to use Spring Batch to process CSV files. Each CSV file contains one record per line. For a given file, some records may be inter related i.e. processing of such records MUST follow the order they appear in the file. Using the regular sequential approach (i.e. single thread for the entire file) yields me bad performances, therefore I want to use the partitioning feature. Due to my processing requirement, inter related records MUST be in the same partition (as well as in the order they appear in the file). I thought about the idea of using a hash based partitioning algorithm with a carefully chosen hash function (so that near equally sized partitions are created).
Any idea if this is possible with Spring Batch?
How should the Partitioner be implemented for such case? According to one of the Spring Batch author/developer, the master does not send the actual data, only the information required for the slave to obtain the data it is supposed to process. In my case, I guess this information would be the hash value. Therefore, does the FlatFileItemReader of each slave need to read the entire file line by line skipping the lines with a different hash?
Thanks,
Mickael
What you're describing is something normally seen in batch processing. You have a couple options here:
Split the file by sequence and partition based on the created files - In this case, you'd iterate through the file once to divide it up into each of the list of records that needs to be processed in sequence. From there, you can use the MultiResourcePartitioner to process each file in parallel.
Load the file into a staging table - This is the easier method IMHO. Load the file into a staging table. From there, you can partition the processing based on any number of factors.
In either case, the results allows you to scale out the process as wide as you need to go to obtain the performance you need to achieve.
Flat file item reader is not thread safe so you cannot simply use it in parallell procesing.
There is more info in the docs:
Spring Batch provides some implementations of ItemWriter and ItemReader. Usually they say in the Javadocs if they are thread safe or not, or what you have to do to avoid problems in a concurrent environment. If there is no information in Javadocs, you can check the implementation to see if there is any state. If a reader is not thread safe, it may still be efficient to use it in your own synchronizing delegator. You can synchronize the call to read() and as long as the processing and writing is the most expensive part of the chunk your step may still complete much faster than in a single threaded configuration.
I think your question is somehow duplicate to this: multithreaded item reader