I am trying to model a genetics problem we are trying to solve, building up to it in steps. I can successfully run the PiAverage examples from Spark Examples. That example "throws darts" at a circle (10^6 in our case) and counts the number that "land in the circle" to estimate PI
Let's say I want to repeat that process 1000 times (in parallel) and average all those estimates. I am trying to see the best approach, seems like there's going to be two calls to parallelize? Nested calls? Is there not a way to chain map or reduce calls together? I can't see it.
I want to know the wisdom of something like the idea below. I thought of tracking the resulting estimates using an accumulator. jsc is my SparkContext, full code of single run is at end of question, thanks for any input!
Accumulator<Double> accum = jsc.accumulator(0.0);
// make a list 1000 long to pass to parallelize (no for loops in Spark, right?)
List<Integer> numberOfEstimates = new ArrayList<Integer>(HOW_MANY_ESTIMATES);
// pass this "dummy list" to parallelize, which then
// calls a pieceOfPI method to produce each individual estimate
// accumulating the estimates. PieceOfPI would contain a
// parallelize call too with the individual test in the code at the end
jsc.parallelize(numberOfEstimates).foreach(accum.add(pieceOfPI(jsc, numList, slices, HOW_MANY_ESTIMATES)));
// get the value of the total of PI estimates and print their average
double totalPi = accum.value();
// output the average of averages
System.out.println("The average of " + HOW_MANY_ESTIMATES + " estimates of Pi is " + totalPi / HOW_MANY_ESTIMATES);
It doesn't seem like a matrix or other answers I see on SO give the answer to this specific question, I have done several searches but I am not seeing how to do this without "parallelizing the parallelization." Is that a bad idea?
(and yes I realize mathematically I could just do more estimates and effectively get the same results :) Trying to build a structure my boss wants, thanks again!
I have put my entire single-test program here if that helps, sans an accumulator I was testing out. The core of this would become PieceOfPI():
import java.io.Serializable;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.Accumulable;
import org.apache.spark.Accumulator;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.storage.StorageLevel;
import org.apache.spark.SparkConf;
import org.apache.spark.storage.StorageLevel;
public class PiAverage implements Serializable {
public static void main(String[] args) {
PiAverage pa = new PiAverage();
pa.go();
}
public void go() {
// should make a parameter like all these finals should be
// int slices = (args.length == 1) ? Integer.parseInt(args[0]) : 2;
final int SLICES = 16;
// how many "darts" are thrown at the circle to get one single Pi estimate
final int HOW_MANY_DARTS = 1000000;
// how many "dartboards" to collect to average the Pi estimate, which we hope converges on the real Pi
final int HOW_MANY_ESTIMATES = 1000;
SparkConf sparkConf = new SparkConf().setAppName("PiAverage")
.setMaster("local[4]");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
// setup "dummy" ArrayList of size HOW_MANY_DARTS -- how many darts to throw
List<Integer> throwsList = new ArrayList<Integer>(HOW_MANY_DARTS);
for (int i = 0; i < HOW_MANY_DARTS; i++) {
throwsList.add(i);
}
// setup "dummy" ArrayList of size HOW_MANY_ESTIMATES
List<Integer> numberOfEstimates = new ArrayList<Integer>(HOW_MANY_ESTIMATES);
for (int i = 0; i < HOW_MANY_ESTIMATES; i++) {
numberOfEstimates.add(i);
}
JavaRDD<Integer> dataSet = jsc.parallelize(throwsList, SLICES);
long totalPi = dataSet.filter(new Function<Integer, Boolean>() {
public Boolean call(Integer i) {
double x = Math.random();
double y = Math.random();
if (x * x + y * y < 1) {
return true;
} else
return false;
}
}).count();
System.out.println(
"The average of " + HOW_MANY_DARTS + " estimates of Pi is " + 4 * totalPi / (double)HOW_MANY_DARTS);
jsc.stop();
jsc.close();
}
}
Let me start with your "background question". Transformation operations like map, join, groupBy, etc. fall into two categories; those that require a shuffle of data as input from all the partitions, and those that don't. Operations like groupBy and join require a shuffle, because you need to bring together all records from all the RDD's partitions with the same keys (think of how SQL JOIN and GROUP BY ops work). On the other hand, map, flatMap, filter, etc. don't require shuffling, because the operation works fine on the input of the previous step's partition. They work on single records at a time, not groups of them with matching keys. Hence, no shuffling is necessary.
This background is necessary to understand that an "extra map" does not have a significant overhead. A sequent of operations like map, flatMap, etc. are "squashed" together into a "stage" (which is shown when you look at details for a job in the Spark Web console) so that only one RDD is materialized, the one at the end of the stage.
On to your first question. I wouldn't use an accumulator for this. They are intended for "side-band" data, like counting how many bad lines you parsed. In this example, you might use accumulators to count how many (x,y) pairs were inside the radius of 1 vs. outside, as an example.
The JavaPiSpark example in the Spark distribution is about as good as it gets. You should study why it works. It's the right dataflow model for Big Data systems. You could use "aggregators". In the Javadocs, click the "index" and look at the agg, aggregate, and aggregateByKey functions. However, they are no more understandable and not necessary here. They provide greater flexibility than map then reduce, so they are worth knowing
The problem with your code is that you are effectively trying to tell Spark what to do, rather than expressing your intent and letting Spark optimize how it does it for you.
Finally, I suggest you buy and study O'Reilly's "Learning Spark". It does a good job explaining the internal details, like staging, and it shows lots of example code you can use, too.
Related
Like i sad , i am working on Euler problem 12 https://projecteuler.net/problem=12 , i believe that this program will give the correct answer but is too slow , i tried to wait it out but even after 9min it still cant finish it. How can i modify it to run faster ?
package highlydivisibletriangularnumber_ep12;
public class HighlyDivisibleTriangularNumber_EP12 {
public static void findTriangular(int triangularNum){
triangularValue = triangularNum * (triangularNum + 1)/2;
}
static long triangularValue = 0l;
public static void main(String[] args) {
long n = 1l;
int counter = 0;
int i = 1;
while(true){
findTriangular(i);
while(n<=triangularValue){
if(triangularValue%n==0){
counter++;
}
n++;
}
if(counter>500){
break;
}else{
counter = 0;
}
n=1;
i++;
}
System.out.println(triangularValue);
}
}
Just two simple tricks:
When x%n == 0, then also x%m == 0 with m = x/n. This way you need to consider only n <= Math.ceil(sqrt(x)), which is a huge speed up. With each divisor smaller than the square root, you get another one for free. Beware of the case of equality. The speed gain is huge.
As your x is a product of two numbers i and i+1, you can generate all its divisors as product of the divisors of i and i+1. What makes it more complicated is the fact that in general, the same product can be created using different factors. Can it happen here? Do you need to generate products or can you just count them? Again, the speed gain is huge.
You could use prime factorization, but I'm sure, these tricks alone are sufficient.
It appears to me that your algorithm is a bit too brute-force, and due to this, will consume an enormous amount of cpu time regardless of how you might rearrange it.
What is needed is an algorithm that implements a formula that calculates at least part of the solution, instead of brute-forcing the whole thing.
If you get stuck, you can use your favorite search engine to find a number of solutions, with varying degrees of efficiency.
So I have this method that picks at random an object from a list of 2 objects. I would like to write a junit test (#Test) asserting based on a confidence level that somehow there's a 50% chance for each of the 2 objects to be picked.
The piece of code under test:
public MySepecialObj pickTheValue(List<MySepecialObj> objs, Random shufflingFactor) {
// this could probably be done in a more efficient way
// but my point is asserting on the 50% chance of the
// two objects inside the input list
Collections.shuffle(objs, shufflingFactor);
return objs.get(0);
}
In the test I would like to provide 2 mocks (firstMySepecialObjMock and secondMySepecialObjMock) as input objects of type MySepecialObj and new Random() as the input shuffling parameter, then assert that the firstMySepecialObjMock happens to be the choice 50% of the times and secondMySepecialObjMock happens to be the choice in the other 50% of the times.
Something like:
#Test
public void myTestShouldCheckTheConfidenceInterval() {
// using Mockito here
MySepecialObj firstMySepecialObjMock = mock(MySepecialObj.class);
MySepecialObj secondMySepecialObjMock = mock(MySepecialObj.class);
// using some helpers from Guava to build the input list
List<MySepecialObj> theListOfTwoElements = Lists.newArrayList(firstMySepecialObjMock, secondMySepecialObjMock);
// call the method (multiple times? how many?) like:
MySepecialObj chosenValue = pickTheValue(theListOfTwoElements, new Random());
// assert somehow on all the choices using a confidence level
// verifying that firstMySepecialObjMock was picked ~50% of the times
// and secondMySepecialObjMock was picked the other ~50% of the times
}
I am not sure about the statistics theory here, so maybe I should provide a different instance of Random with different parameters to its constructor?
I would also like to have a test where I could set the confidence level as a parameter (I guess usually is 95%, but it could be another value?).
What could be a pure java solution/setup of the test involving a confidence level parameter?
What could be an equivalent solution/setup of the test involving some helper library like the Apache Commons?
First of all this is the normal way to pick random elements from a List in Java. (nextInt(objs.size() produces random integers between 0 and objs.size()).
public MySepecialObj pickTheValue(List<MySepecialObj> objs, Random random) {
int i = random.nextInt(objs.size());
return objs.get(i);
}
You can read in Wikipedia about how many times you should perform an experiment with 2 possible outcomes for a given confidence level. E.g. for confidence level of 95% you get a confidence interval of 1.9599. You also need to provide a maximum error say 0.01. Then the number of times to perform the experiment:
double confidenceInterval = 1.9599;
double maxError = 0.01;
int numberOfPicks = (int) (Math.pow(confidenceInterval, 2)/(4*Math.pow(maxError, 2)));
which results in numberOfPicks = 9603. That's how many times you should call pickTheValue.
This would be how I recommend you perform the experiment multiple times (Note that random is being reused):
Random random = new Random();
double timesFirstWasPicked = 0;
double timesSecondWasPicked = 0;
for (int i = 0; i < numberOfPicks; ++i) {
MySepecialObj chosenValue = pickTheValue(theListOfTwoElements, random);
if (chosenValue == firstMySepecialObjMock) {
++timesFirstWasPicked;
} else {
++timesSecondWasPicked;
}
}
double probabilityFirst = timesFirstWasPicked / numberOfPicks;
double probabilitySecond = timesSecondWasPicked / numberOfPicks;
Then assert that probabilityFirst, probabilitySecond are no further than maxError from 0.5
I found a BinomialTest class in apache-commons-math but I don't see how it can help in your case. It can calculate the confidence level from the number of experiments. You want the reverse of that.
I am creating javafx.Text objects (maintained in an instance of LinkedList) and placing them on javafx.Group (i.e: sourceGroup.getChildren().add(Text)). Each Text instance holds only one letter (not an entire word).
I have a click even that returns the x and y coordinates of the click. I want to drop a cursor to appear in front of the letter. This needs to be done in constant time, so I cant just iterate over my LinkedList and examine the Text x and y values.
There are certain restrictions on the libraries I can use. I can essentially only use javafx stuffs and java.util stuffs.
I was reading that HashMaps lookups essentially take place in constant time. My idea to drop the cursor is to:
1) While adding Text to the LinkedList instance, I want to update four hashMaps. One hashMap for the upper X value, one for the lower X value and the same for the Y values.
2) when it comes time to drop a cursor, i grab the x and y coordinates of the mouse click and perform a series of intersections (this part im not sure how to do yet) which should return the Text or subset of texts that fall between the X range and the Y range.
My Question:
Is there a better/more efficient way to do this? Am I being terribly inefficient with this idea?
Just add a click listener to each text item, and, when the mouse is clicked on the text, reposition the cursor based upon the text bounds in parent.
It's your homework, so you may or may not wish to look at the following code...
import javafx.application.Application;
import javafx.geometry.*;
import javafx.scene.Scene;
import javafx.scene.control.ScrollPane;
import javafx.scene.layout.FlowPane;
import javafx.scene.layout.Pane;
import javafx.scene.shape.Line;
import javafx.scene.text.Text;
import javafx.stage.Stage;
import java.util.stream.Collectors;
public class SillySong extends Application {
private static final String lyrics =
"Mares eat oats and does eat oats and little lambs eat ivy. ";
private static final int CURSOR_HEIGHT = 16;
private static final int INSET = 2;
private static final int N_LYRIC_REPEATS = 10;
private Line cursor = new Line(INSET, INSET, INSET, INSET + CURSOR_HEIGHT);
#Override
public void start(Stage stage) {
FlowPane textPane = new FlowPane();
for (int i = 0; i < N_LYRIC_REPEATS; i++) {
lyrics.codePoints()
.mapToObj(this::createTextNode)
.collect(Collectors.toCollection(textPane::getChildren));
}
textPane.setPadding(new Insets(INSET));
Pane layout = new Pane(textPane, cursor) {
#Override
protected void layoutChildren() {
super.layoutChildren();
layoutInArea(textPane, 0, 0, getWidth(), getHeight(), 0, new Insets(0), HPos.LEFT, VPos.TOP);
}
};
ScrollPane scrollPane = new ScrollPane(layout);
scrollPane.setFitToWidth(true);
stage.setScene(new Scene(scrollPane, 200, 150));
stage.show();
}
private Text createTextNode(int c) {
Text text = new Text(new String(Character.toChars(c)));
text.setOnMouseClicked(event -> {
Bounds bounds = text.getBoundsInParent();
cursor.setStartX(bounds.getMinX());
cursor.setStartY(bounds.getMinY());
cursor.setEndX(bounds.getMinX());
cursor.setEndY(bounds.getMinY() + CURSOR_HEIGHT);
});
return text;
}
public static void main(String[] args) {
launch(args);
}
}
This was just a basic sample, if you want to study something more full featured, look at the source of RichTextFX.
Truly, new TextField() is simpler :-)
So, what's really going on in the sample above? Where did all your fancy hash tables for click support go? How is JavaFX determining you clicked on a given text node? Is it using some kind of tricky algorithm for spatial indexing such as a quadtree or a kdtree?
Nah, it is just doing a straight depth first search of the scene graph tree and returning the first node it finds that intersects the click point, taking care to loop through children in reverse order so that the last added child to a parent group receives click processing priority over earlier children if the two children overlap.
For a parent node (Parent.java source):
#Deprecated
#Override protected void impl_pickNodeLocal(PickRay pickRay, PickResultChooser result) {
double boundsDistance = impl_intersectsBounds(pickRay);
if (!Double.isNaN(boundsDistance)) {
for (int i = children.size()-1; i >= 0; i--) {
children.get(i).impl_pickNode(pickRay, result);
if (result.isClosed()) {
return;
}
}
if (isPickOnBounds()) {
result.offer(this, boundsDistance, PickResultChooser.computePoint(pickRay, boundsDistance));
}
}
}
For a leaf node (Node.java):
#Deprecated
protected void impl_pickNodeLocal(PickRay localPickRay, PickResultChooser result) {
impl_intersects(localPickRay, result);
}
So you don't need to implement your own geometry processing and pick handling with a complicated supporting algorithm, as JavaFX already provides an appropriate structure (the scene graph) and is fully capable of processing click handling events for it.
Addressing additional questions or concerns
I know that searching trees is fast and efficient, but it isn't constant time right?
Searching trees is not necessarily fast nor efficient. Search speed depends upon the depth and width of the tree and whether the tree is ordered, allowing a binary search. The scene graph is not a [binary search tree[(https://en.wikipedia.org/wiki/Binary_search_tree) or a red-black tree or a b-tree or any other kind of tree that is optimized for search. The hit testing algorithm that JavaFX uses, as can be seen above is a in-order traversal of the tree, which is linear in time: O(n).
If you wanted, you could subclass parent, region, pane or group and implement your own search algorithm for picking by overriding functions such as impl_pickNodeLocal. For example, if you constrain your field to a fixed width font, calculation of which letter a click will hit is trivial function that could be done in constant time via a simple mathematical equation and no additional data structures.
Starting to get really off-topic aside
But, even if you can do implement a custom hit processing algorithm, you need to consider whether you really should. Obviously the default hit testing algorithm for JavaFX is sufficient for most applications and further optimizations of it for the general use case are currently deemed unnecessary. If there existed some well-known golden algorithm or data structure that greatly improved its quality and there was sufficient demand for it, the hit-testing algorithm would have been further optimized already. So it is probably best to use the default implementation unless you are experiencing performance issues and you have a specific use case (such as a mapping application), where an alternate algorithm or data structure (such as an r-tree), can be used to effect a performance boost. Even then, you would want to benchmark various approaches on different sizes and types of data sets to validate the performance on those data sets.
You can see the evidence of the optimization approach such as I described above in the multiply algorithm for BigIntegers from the JDK. You might think the number multiplication is a constant time operation, but, its not for large numbers, because digits in the numbers are represented in different bytes and it is necessary to process all of the bytes to perform the multiplication. There are various algorithms out there for processing the bytes for multiplication, but the choice of what is the "most efficient" one depends upon the properties of the numbers themselves (e.g. their size). For smaller numbers, a straight loop for long multiplication is the most efficient, for larger numbers, the Karatsuba algorithm is used and for larger numbers again the Toom-Cook algorithm is used. The thresholds for choosing just how large a number is required to switch to the different algorithm would have been chosen via analysis and benchmarking. Also, the number is being multiplied with itself (it is being squared), a more efficient algorithm can be used to perform the square (so that is an example of a special edge case that is being specifically optimized for).
/**
* Returns a BigInteger whose value is {#code (this * val)}.
*
* #implNote An implementation may offer better algorithmic
* performance when {#code val == this}.
*
* #param val value to be multiplied by this BigInteger.
* #return {#code this * val}
*/
public BigInteger multiply(BigInteger val) {
if (val.signum == 0 || signum == 0)
return ZERO;
int xlen = mag.length;
if (val == this && xlen > MULTIPLY_SQUARE_THRESHOLD) {
return square();
}
int ylen = val.mag.length;
if ((xlen < KARATSUBA_THRESHOLD) || (ylen < KARATSUBA_THRESHOLD)) {
int resultSign = signum == val.signum ? 1 : -1;
if (val.mag.length == 1) {
return multiplyByInt(mag,val.mag[0], resultSign);
}
if (mag.length == 1) {
return multiplyByInt(val.mag,mag[0], resultSign);
}
int[] result = multiplyToLen(mag, xlen,
val.mag, ylen, null);
result = trustedStripLeadingZeroInts(result);
return new BigInteger(result, resultSign);
} else {
if ((xlen < TOOM_COOK_THRESHOLD) && (ylen < TOOM_COOK_THRESHOLD)) {
return multiplyKaratsuba(this, val);
} else {
return multiplyToomCook3(this, val);
}
}
}
I'm trying to use the apache commons math library version 3.5+ to solve an optimization problem. Basically, I'm trying to fit a (gamma) distribution to some data points. I can't seem to find any simple examples of how to use the new (version 3.5) optimization tools, such as SimplexSolver, SimplexOptimizer, or OptimizationData, to solve a trivial optimization problem.
Similar questions have been asked here before, but all the answers seem to be for older version of apache math - in 3.5 things were restructured and none of the example code I could find works.
Does anyone have a working example how to use the new optimizers or solvers? I'm most interested in SimplexOptimizer, but at this point anything would be useful.
Indeed, the optimizers may be hard to use: Lots of parameters, of which different combinations are required for the different types of optimizers, and they are all hidden in the generic OptimizationData array that they receive. Unless you start matching the code with the papers that they refer to, you can hardly get any results out of them whatsoever.
I also wanted to use some of thes solvers/optimizers a try occasionally, the main source of reliable, working ""examples"" for me turned out to be the unit tests of these classes, which usually are quite elaborate and cover many cases. For example, regarding the SimplexOptimizer, you may want to have a look at the org/apache/commons/math4/optim/nonlinear/scalar/noderiv/ test cases, containing the test classes SimplexOptimizerMultiDirectionalTest.java and SimplexOptimizerNelderMeadTest.java.
(Sorry, maybe this is not what you expected or hoped for, but ... I found these tests tremendously helpful when I tried to figure out which OptimizationData these optimizers actually need...)
EDIT
Just for reference, a complete example, extracted from one of the basic unit tests:
import java.util.Arrays;
import org.apache.commons.math3.analysis.MultivariateFunction;
import org.apache.commons.math3.optim.InitialGuess;
import org.apache.commons.math3.optim.MaxEval;
import org.apache.commons.math3.optim.PointValuePair;
import org.apache.commons.math3.optim.nonlinear.scalar.GoalType;
import org.apache.commons.math3.optim.nonlinear.scalar.ObjectiveFunction;
import org.apache.commons.math3.optim.nonlinear.scalar.noderiv.NelderMeadSimplex;
import org.apache.commons.math3.optim.nonlinear.scalar.noderiv.SimplexOptimizer;
import org.apache.commons.math3.util.FastMath;
public class SimplexOptimizerExample
{
public static void main(String[] args)
{
SimplexOptimizer optimizer = new SimplexOptimizer(1e-10, 1e-30);
final FourExtrema fourExtrema = new FourExtrema();
final PointValuePair optimum =
optimizer.optimize(
new MaxEval(100),
new ObjectiveFunction(fourExtrema),
GoalType.MINIMIZE,
new InitialGuess(new double[]{ -3, 0 }),
new NelderMeadSimplex(new double[]{ 0.2, 0.2 }));
System.out.println(Arrays.toString(optimum.getPoint()) + " : "
+ optimum.getSecond());
}
private static class FourExtrema implements MultivariateFunction
{
// The following function has 4 local extrema.
final double xM = -3.841947088256863675365;
final double yM = -1.391745200270734924416;
final double xP = 0.2286682237349059125691;
final double yP = -yM;
final double valueXmYm = 0.2373295333134216789769; // Local maximum.
final double valueXmYp = -valueXmYm; // Local minimum.
final double valueXpYm = -0.7290400707055187115322; // Global minimum.
final double valueXpYp = -valueXpYm; // Global maximum.
public double value(double[] variables)
{
final double x = variables[0];
final double y = variables[1];
return (x == 0 || y == 0) ? 0 : FastMath.atan(x)
* FastMath.atan(x + 2) * FastMath.atan(y) * FastMath.atan(y)
/ (x * y);
}
}
}
I have a pair RDD with millions of key-value pairs, where every value is a list which may contain a single element or billions of elements. This leads to a poor performance since the large groups will block the nodes of the cluster for hours, while groups that would take a few seconds cannot be processed in parallel since the whole cluster is already busy.
Is there anyway to improve this?
EDIT:
The operation that is giving me problems is a flatMap where the whole list for a given key is analyzed. The key is not touched, and the operation compares every element in the list to the rest of the list, which takes a huge amount of time but unfortunately it has to be done. This means that the WHOLE list needs to be in the same node at the same time. The resulting RDD will contain a sublist depending on a value calculated in the flatMap.
I cannot use broadcast variables in this case scenario, as no common data will be used between the different key-value pairs. As for a partitioner, according to the O'Reilly Learning Spark book, this kind of operation will not benefit from a partitioner since no shuffle is involved (although I am not sure if this is true). Can a partitioner help in this situation?
SECOND EDIT:
This is an example of my code:
public class MyFunction implements FlatMapFunction
<Tuple2<String, Iterable<Bean>>, ComparedPerson> {
public Iterable<ProcessedBean> call(Tuple2<Key, Iterable<Bean>> input) throws Exception {
List<ProcessedBean> output = new ArrayList<ProcessedBean>();
List<Bean> listToProcess = CollectionsUtil.makeList(input._2());
// In some cases size == 2, in others size > 100.000
for (int i = 0; i < listToProcess.size() - 1; i++) {
for (int j = i + 1; j < listToProcess.size(); j++) {
ProcessedBean processed = processData(listToProcess.get(i), listToProcess.get(j));
if (processed != null) {
output.add(processed);
}
}
}
return output;
}
The double for will loop n(n-1)/2 times, but this cannot be avoided.
The order in which the keys get processed has no effect on the total computation time. The only issue from variance (some values are small, others are large) I can imagine is at the end of processing: one large task is still running while all other nodes are already finished.
If this is what you are seeing, you could try increasing the number of partitions. This would reduce the size of tasks, so a super large task at the end is less likely.
Broadcast variables and partitioners will not help with the performance. I think you should focus on making the everything-to-everything comparison step as efficient as possible. (Or better yet, avoid it. I don't think quadratic algorithms are really sustainable in big data.)
If 'processData' is expensive, it's possible that you could parallelize that step and pick up some gains there.
In pseudo-code, it would be something like:
def processData(bean1:Bean, bean2:Bean):Option[ProcessedData] = { ... }
val rdd:RDD[(Key, List[Bean])] = ...
val pairs:RDD[(Bean, Bean)] = rdd.flatMap((key, beans) => {
val output = mutable.List[ProcessedBean]()
val len = beans.length
for (var i=0; i < len - 1; i++) {
for (var j=i+1; j < len; j++) {
output.add((beans(i), beans(j)))
}
}
output
}).repartition(someNumber)
val result:RDD[ProcessedBean] = pairs
.map(beans => processData(beans._1, beans._2))
.filter(_.isDefined)
.map(_.get)
The flatMap step will still be bounded by your biggest list, and you'll incur a shuffle when you repartition, but moving the processData step outside of that N^2 step could gain you some parallelism.
Skew like this is often domain specific. You could create your value data as an RDD and join on it. Or you could try using broadcast variables. Or you could write a custom partitioner that might help split the data differently.
But, ultimately, it is going to depend on the computation and specifics of the data.