Detecting repeating consecutive values in large datasets with Spark

Detecting repeating consecutive values in large datasets with Spark - java

Recently I have being trying out Spark and do far I have observed quite interesting results, but currently I am stuck with famous groupByKey OOM problem. Basically what the job does it tries to search in the large datasets the periods where measured value is increasing consecutively for at least N times. I managed to get rid of the problem by writing the results to the disk, but the application is running much slower now (which is expected due to the disk IO). Now the question: is there any other memory efficient strategy where I can run sorted data and check whether adjacent values(for the same key) are increasing in at least N consecutive observations, without recurring to groupByKey method?
I have designed an algorithm to do it with reduceByKey, but there is one problem, reduce seems to ignore data ordering and yells completely wrong results at the end.
Any ideas appreciated.

There are a few ways you can approach this problem:
repartitionAndSortWithinPartitions with custom partitioner and ordering:
keyBy (name, timestamp) pairs
create custom partitioner which considers only the name
repartitionAndSortWithinPartitions using custom partitioner
use mapPartitions to iterate over data and yield matching sequences
sortBy(Key) - this is similar to the first solution but provides higher granularity at the cost of additional post-processing.
keyBy (name, timestamp) pairs
sortByKey
process individual partitions using mapPartitionsWithIndex keeping track of leading / trailing patterns for each partition
adjust final results to include patterns which span over more than one partitions
create fixed sized windows over sorted data using sliding from mllib.rdd.RDDFunctions.
sortBy (name, timestamp)
create sliding RDD and filter windows which cover multiple names
check if any window contains desired pattern.

Related

Is it recommended to use aerospike to filter on some field

I have around 2 millions records, each record having 10-12 fields(mostly are string). Now I want to filter records on the basis of some field. Is it advisable to do this using secondary index or some other better option is available? Also, how much time would it take to get all the records/just keys (after applying the filter)?
Thanks in advance.

You can do a scan with predicate filter - which is quite versatile (you can even do regex) or secondary index query which only honors equality filter on strings.
Scans are more reliable and will be even better in the next upcoming release (Mar/Apr 2020) in terms of managing their progress. Scans do require reading all records from disk first and then applying the filter.
SI will be faster because you are filtering (in-memory secondary index) before you fetch the record from disk but less reliable if underlying cluster nodes are not stable - i.e. if you lose or add a node during SI query. The query runs in parallel on all cluster nodes and pipelines the results back to the client in no particular order. You can mitigate that by using "failOnClusterChange" option and restarting when cluster is stable. (Scans also have the same option available.)
Which is better? do A/B test on your specific problem.

Enforce partition be stored on the specific executor

I have 5-partitions-RDD and 5 workers/executors.
How can I ask Spark to save each RDD's partition on the different worker (IP)?
Am I right if I say Spark can save few partitions on one worker, and 0 partitions on other workers?
Means, I can specify the number of partitions, but Spark still can cache everything on a single node.
Replication is not an option since RDD is huge.
Workarounds I have found
getPreferredLocations
RDD's getPreferredLocations method does not provide a 100% warranty that partition will be stored on a specified node. Spark will try during spark.locality.wait, but afterward, Spark will cache partition on a different node.
As a workarround, you can set very high value to spark.locality.wait and override getPreferredLocations. The bad news - you can not do that with Java, you need to write Scala code. At least Scala internals wrapped with Java code. I.e:
class NodeAffinityRDD[U: ClassTag](prev: RDD[U]) extends RDD[U](prev) {
val nodeIPs = Array("192.168.2.140","192.168.2.157","192.168.2.77")
override def getPreferredLocations(split: Partition): Seq[String] =
Seq(nodeIPs(split.index % nodeIPs.length))
}
SparkContext's makeRDD
SparkContext has makeRDD method. This method lack documentation. As I understand, I can specify preferred locations, and then set a high value to spark.locality.wait. The bad news - preferred location will be discarded on the first shuffle/join/cogroup operation.
Both approaches have the drawback of too high spark.locality.wait can cause your cluster to starve if some of the nodes will be unavailable.
P.S. More context
I have up to 10,000 of sales-XXX.parquet files, each represents sales of different goods in the different regions. Each sales-XXX.parquet could vary from a few KBs to a few GBs. All sales-XXX.parquets together could take up to tens or hundreds of GBs at HDFS.
I need a full-text search through all sales. I have to index each sales-XXX.parquet one-by-one with Lucene. And now I have two options:
Keep Lucene indexes in Spark. There is already solution for this, but it looks pretty suspicious. Is there any better solutions?
Keep Lucene indexes at the local file system. Then I can map-reduce on the results of each worker's index lookup. But this approach requires each worker node keeps an equal amount of data. How could I ensure Spark will keep equal amount of data on each worker node?

How keep keys in aerospike effectively?

For not very big amount of data we store all keys in one bin with List.
But there are limitations on the size of bin.
Function scanAll with ScanCallback in Java client, actually works very slowly, so we cannot afford it in our project. Aerospike works fast when you give him the Key.
Now we have some sets where are a lot of records and keys. What is the best way to store all keys, or maybe there are some way to get it fast and without scanAll ?

Scanning small sets is currently an inefficient operation, because there are 4K logical partitions, and a scan thread has to reduce each of those partitions during the scan. Small sets don't necessarily have records in all the partitions, so you're paying for the overhead of scanning those regardless. This is likely to change in future versions, but is the case for now.
There are two ways to get all the records in a set faster:
If you actually know what the key space is like, you can iterate over batch-reads to fetch them (which can also be done in parallel). Trying to access a non-existent key in a batch-read does not cause an error, it just comes back with no value in the specific index.
Alternatively, you can add a bin that has the set name, and create a secondary index over that bin, then query for all the records WHERE setname=XYZ. This will come back much faster than the scan, for a small set.

Trying bulk/ingest "large" amount of documents SQL Db to Elasticsearch

Hi I need to read multiple tables from my databases and join the tables. Once the tables are joined I would like to push them to Elasticsearch.
The tables are joined from an external process as the data can come from multiple sources. This is not an issue in fact I have 3 separate processes reading 3 separate tables at an average of 30,000 records per second. The records are joined into a multimap, which then a single JsonDocument is produced for each key.
Then there is a separate process reads the denormalized JsonDocuments and bulks them to Elasticsearch at an average of 3000 documents per second.
I'm having troubles trying to find a way to split the work. I'm pretty sure my Elasticsearch cluster can handle more than 3000 documents per second. I was thinking somehow split the multimap that holds the Joined json docs.
Anyways I'm building a custom application for this. So I was wondering is there any tools that can be put together to do all this? Either some form of ETL, or stream processing or something?

While streaming would make records more readily available then bulk processing, and would reduce the overhead in the java container regarding large object management, you can have a hit on the latency. Usually in these kind of scenarios you have to find an optimum for the bulk size. In this I follow the following steps:
1) Build a streaming bulk insert (so stream but still get more then 1 record (or build more then 1 JSON in your case at the time)
2) Experiment with several bulk sizes: 10,100,1000,10000 for example and plot them in a quick graph. Run a sufficient amount of records to see if performance does not go down over time: It can be that the 10 is extremely fast per record, but that there is an incremental insert overhead (for example the case in SQL Server on the primary key maintenance). If you run the same number of total records for every test, it should be representative of your performance.
3) Interpolate in your graph and maybe try out 3 values between your best values of run 2
Then use the final result as your optimal stream bulk insertion size.
Once you have this value, you can add one more step:
Run multiple processes in parallel. This then fills the gaps in you process a bit. Watch the throughput and adjust your bulk sizes maybe one more time.
This approach once helped me with a multi TB import process to speed up from 2 days to about 12hrs, so it can work out pretty positive.

Hbase scan vs Mapreduce for on the fly computation

I need to compute aggregate over HBase table.
Say I have this hbase table: 'metadata' Column family:M column:n
Here metadata object has a list of strings
class metadata
{
List tags;
}
I need to compute the count of tags for which I was thinking of using either using mapreduce or scan over hbase directly.
The result has to be returned on the fly. So which one can I use in this scenario? Scan over hbase and compute the aggregate or mapreduce?
Mapreduce ultimately is going to scan hbase and compute the count.
What are the pros and cons of using either of these?

I suspect you're not aware about what are the pros and cons of HBase, it's not suited for computing realtime aggregations of large datasets.
Let's start by saying that MapReduce is a scheduled job by itself, you won't be able to return the response on the fly, expect no less than 15 seconds for the Task Tracker to initialize the job.
In the end, the MapReduce Job will do exactly the same thing: a HBase scan, the difference between performing the scan right-away and the MapReduce it's just the paralellization and data locality, which excels when you have millions/billions of rows. If your queries only needs to read a few thousand consecutive rows to aggregate them, sure, you could just do a scan and it will probably have an acceptable response time, but for larger datasets it's just going to be impossible to do that at query time.
HBase is best suited for handling tons of atomic reads and writes, that way, you can maintain those aggregations in real time, no matter how many pre-aggregated counters you'll need or how many requests you're going to receive: with a proper row key design and split policy you can scale to satisfy the demand.
Think of it as a word count, you could store all the words in a list and count them at query-time when requested or you can process that list at insert-time and store the number of times each word is used in the document, as global counter, and in a daily, monthly, yearly, per-country, per-author tables (or even families).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.