Serialize/Broadcast large Map in Spark + Scala

Serialize/Broadcast large Map in Spark + Scala - java

My dataset is made up of data points which are 5000-element arrays (of Doubles) and each data point has a clusterId assigned to it.
For the purposes of the problem I am solving, I need to aggregate those arrays (element-wise) per clusterId and then do a dot product calculation between each data point and its respective aggregate cluster array.
The total number of data points I am dealing with is 4.8mm and they are split across ~50k clusters.
I use 'reduceByKey' to get the aggregated arrays per clusterId (which is my key) - using this dataset, I have two distinct options:
join the aggregate (clusterId, aggregateVector) pairs to the original dataset - so that each aggregateVector is available to each partition
collect the rdd of (clusterId, aggregateVector) locally and serialize it back to my executors - again, so that I can make the aggregateVectors available to each partition
My understanding is that joins cause re-partitioning based on the join key, so in my case, the unique values of my key are ~50k, which will be quite slow.
What I tried is the 2nd approach - I managed to collect the RDD localy - and turn it into a Map of clusterId as the key and 5000-element Array[Double] as the value.
However, when I try to broadcast/serialize this variable into a Closure, I am getting a ''java.lang.OutOfMemoryError: Requested array size exceeds VM limit''.
My question is - given the nature of my problem where I need to make aggregate data available to each executor, what is the best way to approach this, given that the aggregate dataset (in my case 50k x 5000) could be quite large?
Thanks

I highly recommend the join. 5000 values x 50,000 elements x 8 bytes per value is already 2 GB, which is manageable, but it's definitely in the "seriously slow things down, and maybe break some stuff" ballpark.
You are right that repartitioning can sometimes be slow, but I think you are more concerned about it than necessary. It's still an entirely parallel/distributed operation, which makes it essentially infinitely scalable. Collecting things into the driver is not.

Related

Alternate approach to storing complex data types in room database

Existing approach: Currently we use TypeConverters to help database to store and retrieve complex data type (POJO class objects). But that involves serializing and deserializing the objects, which seems unnecessary when we only need a simple primitive data type like int, string, float, etc.
My approach: I am thinking of an approach of breaking down the complex data type to primitive ones and create separate columns to store them. When we need a simple primitive type from the database, then we won't have to go through the process of deserializing complex objects.
I have tried the my approach and it is working but I'm not sure of the corner cases that may arise while implementing this approach in big projects.
I am still new to this, need help in finding pros and cons of my approach.

There are some who advocate storing a representation of objects as a single column. This is fine if you just want to store and retrieve the objects and then work with the objects. The code itself can often be shorter.
If you want to manipulate the underlying values (fields in the object) embedded in the representation via the power of SQLite then matters can get quite complex and perhaps inefficient due to the much higher likelihood of full table scans due to the likely lack of available/usable indexes.
e.g. if myvalue is a value within a representation (typically JSON) then to find rows with this value you would have to use something like
#Query("SELECT * FROM the_table WHERE the_one_for_many_values_column LIKE '%myvalue%'")
or
#Query("SELECT * FROM the_table WHERE instr(the_one_for_may_values_column,'myvalue')
as opposed to myvalue being stored in a column of it's own (say the_value) then
#Query("SELECT * FROM the_table WHERE the_value LIKE 'myvalue')
the former two have the flaw that if myvalue is stored elsewhere within the representation then that row is also included even. Other than the fact that LIKE is case independent the third is an EXACT.
an index on the the_value column may improve performance
Additionally the representation will undoubtedly add bloat (separators and descriptors of the values) and thus will require more storage. This compounded as the same data would often be stored multiple times whilst a normalised relational approach may well store just 1 instance of the data (with just up to 8 bytes per repetition if a 1-M relationship (indexes excluded)).
32 bytes of bloat and the maximum needed to cater for a many-many relationship (8 bytes parent 8 bytes child and two 8 bytes columns in the mapping table) will be exceeded.
As the SQLite API utilises Cursors (buffering) to retrieve extracted data, then with a greater storage requirement fewer rows can be held at once by a CursorWindow (the limited sized buffer that is loaded by x rows of output). There is also greater potential, again due to the bloat, of a single row being larger than the CursorWindow permits.
In short for smaller simpler projects not interested greatly in performance, then storing representations via TypeConverters could be the more convenient and practical approach. For larger more complex projects then unleashing the relational aspects of SQLite via data that is related rather than embedded within a representation could well be the way to go.

is it possible to create a parallel operations inside one partition of spark?

i am new to spark and to its relevant concepts, so please be kind with me and help me to clear up my doubts, i'll give you an example to help you to understand my question.
i have one javaPairRDD "rdd" which contains tuples like
Tuple2 <Integer,String[]>
lets assume that String[].length =3, means it contains 3 elements besides the key,what i want to do is to update each element of the vector using 3 RDDs and 3 operations,"R1" and "operation1" is used to modify the first element,"R2" and "operation2" is used to modify the second element and "R3" and "operation3" is used to modify the third element,
R1, R2 and R3 are the RDDs that provide the new values of elements
i know that spark devides the data (in this example is "rdd") into many partitions, but what i am asking about : is it possible to do different operations in the same partition and at the same time?
according to my example,and because i have 3 operations, it means that i can take 3 tuples at the same time instead of taking only one to operate it:
the treatment that i want it is :(t refers the time)
at t=0:
*tuple1=use operation1 to modify the element 1
*tuple2=use operation2 to modify the element2
*tuple3=use operation3 to modify the element 3
at t=1:
*tuple1=use operation2 to modify the element 2
*tuple2=use operation3 to modify the element3
*tuple3=use operation1 to modify the element 1
at t=2:
*tuple1=use operation.3 to modify the element 3
*tuple2=use operation1 to modify the element1
*tuple3=use operation2 to modify the element 2
After finish updating the 3 first tuples, i take others (3 tuples) from the same partion to treat them, and so on..
please be kind it's just a thought that crossed my mind, and i want to know if it is possible to do it or not, thank you for your help

Spark doesn't guarantee the order of execution.
You decide how individual elements of RDD should be transformed and Spark is responsible for applying the transformation to all elements in a way that it decides is the most efficient.
Depending on how many executors (i.e. thread or servers or both) are available in your environment Spark will actually process as many tuples as possible at the same time.

First of all, welcome to the Spark community.
To add to #Tomasz Błachut answer, Spark's execution context does not identify nodes (e.g. one computing PC) as individual processing units but instead their cores. Therefore, one job may be assigned to two cores on a 22-core Xeon instead of the whole node.
Spark EC does consider nodes as computing units when it comes to their efficiency and performance, though; as this is relevant for dividing bigger jobs among nodes of varying performance or blacklisting them if they are slow or fail often.

How is data retrieved from hash tables for collisions

I understand that hash tables are designed to have easy sorting and retrieval of data when storing massive amounts of them. However, when retrieving a specific piece of data, how do they retrieve it if they were stored in an alternative location due to collision?
Say there are 10 indexes and data A was stored in index 3 and data E runs into collision because data A is stored in index 3 already and collision prevention puts it in index 7 instead. When it comes time to retrieve data E, how does it retrieve E instead of using the first hash function and retrieving A instead?
Sorry if this is dumb question. I'm still somewhat new to programming.

I don't believe that Java will resolve a hashing collision by moving an item to a different bucket. Doing that would make it difficult if not impossible to determine the correct bucket into which it should have been hashed. If you read this SO article carefully, you will note that it points out two tools which Java has at its disposal. First, it maintains a list of values for each bucket* (read note below). Second, if the list becomes too large it can increase the number of buckets.
I believe that the list has now been replaced with a tree. This will ensure O(n*lgn) performance for lookup, insertion, etc., in the worst case, whereas with a list the worst case performance was O(n).

pig - I get "Error: Java heap space" with hundreds of thousands tuples

I have three sets of data separated by their type, usually there's only few hundred tuples for each uid. But (propably due some bug) there are few uids with up to 200000-300000 rows of data.
StuffProcessor throws heap space error sometimes when there's too many tuples in single databag. How should I fix this? Can I check somehow if there's for example 100000+ tuples for single uid and then split data into smaller batches?
I am completely new with pig and have almost no idea what I am doing.
-- Create union of the three stuffs
stuff = UNION stuff1, stuff2, stuff3;
-- Group data by uid
stuffGrouped = group stuff by (long)$0;
-- Process data
processedStuff = foreach stuffGrouped generate StuffProcessor(stuff);
-- Flatten the UID groups into single table
flatProcessedStuff = foreach processedStuff generate FLATTEN($0);
-- Separate into different datasets by type, these are all schemaless
processedStuff1 = filter flatProcessedStuff by (int)$5 == 9;
processedStuff2 = filter flatProcessedStuff by (int)$5 == 17;
processedStuff3 = filter flatProcessedStuff by (int)$5 == 20;
-- Store everything into separate files into HDFS
store processedStuff1 into '$PROCESSING_DIR/stuff1.txt';
store processedStuff2 into '$PROCESSING_DIR/stuff2.txt';
store processedStuff3 into '$PROCESSING_DIR/stuff3.txt';
Cloudera cluster should have 4GB heap space allocated
This might actually have something to do with cloudera users, since I haven't been able to reproduce this problem with certain users (piggy user vs hdfs user).

If your UDF doesn't really need to see all the tuples belonging to a key at the same time, you may want to implement the Accumulator interface in order to process them by smaller batches. You can also consider implementing the Algebraic interface to speed up the process.
The builtin COUNT is the perfect example.

Efficient solution for grouping same values in a large dataset

At my job I was to develop and implement a solution for the following problem:
Given a dataset of 30M records extract (key, value) tuples from the particular dataset field, group them by key and value storing the number of same values for each key. Write top 5000 most frequent values for each key to a database. Each dataset row contains up to 100 (key, value) tuples in a form of serialized XML.
I came up with the solution like this (using Spring-Batch):
Batch job steps:
Step 1. Iterate over the dataset rows and extract (key, value) tuples. Upon getting some fixed number of tuples dump them on disk. Each tuple goes to a file with the name pattern '/chunk-', thus all values for a specified key are stored in one directory. Within one file values are stored sorted.
Step 2. Iterate over all '' directories and merge their chunk files into one grouping same values. Since the values are stored sorted, it is trivial to merge them for O(n * log k) complexity, where 'n' is the number of values in a chunk file and 'k' is the initial number of chunks.
Step 3. For each merged file (in other words for each key) sequentially read its values using PriorityQueue to maintain top 5000 values without loading all the values into memory. Write queue content to the database.
I spent about a week on this task, mainly because I have not worked with Spring-Batch previously and because I tried to make emphasis on scalability that requires accurate implementation of the multi-threading part.
The problem is that my manager consider this task way too easy to spend that much time on it.
And the question is - do you know more efficient solution or may be less efficient that would be easier to implement? And how much time would you need to implement my solution?
I am aware about MapReduce-like frameworks, but I can't use them because the application is supposed to be run on a simple PC with 3 cores and 1GB for Java heap.
Thank you in advance!
UPD: I think I did not stated my question clearly. Let me ask in other way:
Given the problem and being the project manager or at least the task reviewer would you accept my solution? And how much time would you dedicate to this task?

Are you sure this approach is faster than doing a pre-scan of the XML-file to extract all keys, and then parse the XML-file over and over for each key? You are doing a lot of file management tasks in this solution, which is definitely not for free.
As you have three Cores, you could parse three keys at the same time (as long as the file system can handle the load).

You solution seems reasonable and efficient, however I'd probably use SQL.
While parsing the Key/Value pairs I'd insert/update into a SQL table.
I'd then query the table for the top records.
Here's an example using only T-SQL (SQL 2008, but the concept should be workable in most any mordern rdbms)
The SQL between / START / and / END / would be the statements you need to execute in your code.
BEGIN
-- database table
DECLARE #tbl TABLE (
k INT -- key
, v INT -- value
, c INT -- count
, UNIQUE CLUSTERED (k, v)
)
-- insertion loop (for testing)
DECLARE #x INT
SET #x = 0
SET NOCOUNT OFF
WHILE (#x < 1000000)
BEGIN
--
SET #x = #x + 1
DECLARE #k INT
DECLARE #v INT
SET #k = CAST(RAND() * 10 as INT)
SET #v = CAST(RAND() * 100 as INT)
-- the INSERT / UPDATE code
/* START this is the sql you'd run for each row */
UPDATE #tbl SET c = c + 1 WHERE k = #k AND v = #v
IF ##ROWCOUNT = 0
INSERT INTO #tbl VALUES (#k, #v, 1)
/* END */
--
END
SET NOCOUNT ON
-- final select
DECLARE #topN INT
SET #topN = 50
/* START this is the sql you'd run once at the end */
SELECT
a.k
, a.v
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY k ORDER BY k ASC, c DESC) [rid]
, k
, v
FROM #tbl
) a
WHERE a.rid < #topN
/* END */
END

Gee, it doesn't seem like much work to try the old fashioned way of just doing it in-memory.
I would try just doing it first, then if you run out of memory, try one key per run (as per #Storstamp's answer).

If using the "simple" solution is not an option due to the size of the data, my next choice would be to use an SQL database. However, as most of these require quite much memory (and coming down to a crawl when heavily overloaded in RAM), maybe you should redirect your search into something like a NoSQL database such as MongoDB that can be quite efficient even when mostly disk-based. (Which your environment basically requires, having only 1GB of heap available).
The NoSQL database will do all the basic bookkeeping for you (storing the data, keeping track of all indexes, sorting it), and may probably do it a bit more efficient than your solution, due to the fact that all data may be sorted and indexed already when inserted, removing the extra steps of sorting the lines in the /chunk- files, merging them etc.
You will end up with a solution that is probably much easier to administrate, and it will also allow you to set up different kind of queries, instead of being optimized only for this specific case.
As a project manager I would not oppose your current solution. It is already fast and solves the problem. As an architect however, I would object due to the solution being a bit hard to maintain, and for not using proven technologies that basically does partially the same thing as you have coded on your own. It is hard to beat the tree and hash implementations of modern databases.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.