Nesting parallelizations in Spark? What's the right approach?

Nesting parallelizations in Spark? What's the right approach? - java

NESTED PARALLELIZATIONS?
Let's say I am trying to do the equivalent of "nested for loops" in Spark. Something like in a regular language, let's say I have a routine in the inside loop that estimates Pi the way the Pi Average Spark example does (see Estimating Pi)
i = 1000; j = 10^6; counter = 0.0;
for ( int i =0; i < iLimit; i++)
for ( int j=0; j < jLimit ; j++)
counter += PiEstimator();
estimateOfAllAverages = counter / i;
Can I nest parallelize calls in Spark? I am trying and have not worked out the kinks yet. Would be happy to post errors and code but I think I am asking a more conceptual question about whether this is the right approach in Spark.
I can already parallelize a single Spark Example / Pi Estimate, now I want to do that 1000 times to see if it converges on Pi. (This relates to a larger problem we are trying to solve, if something closer to MVCE is needed I'd be happy to add )
BOTTOM LINE QUESTION I just need someone to answer directly: Is this the right approach, to use nested parallelize calls? If not please advise something specific, thanks! Here's a pseudo-code approach of what I think will be the right approach:
// use accumulator to keep track of each Pi Estimate result
sparkContext.parallelize(arrayOf1000, slices).map{ Function call
sparkContext.parallelize(arrayOf10^6, slices).map{
// do the 10^6 thing here and update accumulator with each result
}
}
// take average of accumulator to see if all 1000 Pi estimates converge on Pi
BACKGROUND: I had asked this question and got a general answer but it did not lead to a solution, after some waffling I decided to post a new question with a different characterization. I also tried to ask this on the Spark User maillist but no dice there either. Thanks in advance for any help.

This is not even possible as SparkContext is not serializable. If you want a nested for loop, then your best option is to use cartesian
val nestedForRDD = rdd1.cartesian(rdd2)
nestedForRDD.map((rdd1TypeVal, rdd2TypeVal) => {
//Do your inner-nested evaluation code here
})
Keep in mind, just as a double for loop, this comes at a size cost.

In the Pi example, in the nested for loop you can get the same answer by doing a single loop through the process i * j times and summing over all of them and then dividing by j at the end. If you have steps that you want to apply in the outer loop, do them within the loop, but create different groups by assigning specific keys to each inner-loop group. Without knowing what kinds of things you want to do in the outer loop its hard to give an example here.
For the simple case of just averaging to improve convergence, its relatively easy. Instead of doing the nested loop, just make an rdd with i * j elements and then apply the function to each element.
this might look like (with pySpark ):
( f is whatever function you want to apply, remember that it will pass each element in the RDD so define your f with an input even if you don't use it in your function)
x = RandomRDDs.uniformRDD(sc, i*j)
function_values = x.map(f)
from operator import add
sum_of_values = function_values.reduce(add)
averaged_value = sum_of_values/j (if you are only averaging over the outer loop)
If you want perform actions in the outer loop, I'd assign an index (zipWIthIndex) then create a key using the index modulo j. Then each different key would be a single virtual inner loop cycle and you can use operators like aggregateByKey, foldByKey, or reduceByKey to perform actions only on those records. This will probably take a bit of a performance hit if the different keys are distributed to different partitions.
An alternative would be to repartition the rdd onto j partitions and then use a foreachPartition function to apply a function to each partition.
A third option would be to run the inner loop j times in parallel, concatenate the results into one distributed file, and then do the outer loop operations after reading this into Spark.

No. You can't.
SparkContext is only accessible from the spark Driver node. The inner parallelization() calls would try to execute SparkContext from the worker nodes, which do not have access to SparkContext.

Related

Elements in a Vector that start with a given letter/number

I am trying to find on a Vector how many indexes start with a given letter or number.
I've tried vector.indexOf("A"); and vector.lastIndexOf("A"); but of course they are "useless" for what I am trying to do because those try to find a position that only have "A" and nothing else.
I wanted to know if there is a Java method to do this or if I need to do it "by myself", if so, a little guiding on the how-to process would be thanked.

If you do not want to (or can) use streams or lambdas you can also use this little loop here:
int count=0;
for (int i = 0; i < vec.size(); i++) {
count = vec.get(i).charAt(0)=='A' ? count+1 : count;
}
No big thing, just checking each element if it starts with A and then counting up.

In Java8 you can use Streams to access functional-style operations such as filter() and count(). Use the stream() method to get a Stream on your collection.

Creating groups based on previous data

I am trying to get my app to determine the best solution for grouping 20 golfers into foursomes.
I have data that shows when a golfer played, what date and the others in the group.
I would like the groups made up of golfer who haven't played together, or when everyone has played together, the longest amount of time that they played together. In other words, I want groups made up of players who haven't played together in a while as opposed to last time out.
Creating a permutation list of 20! to determine the lowest combinations didn't work well.
Is there another solution that I am not thinking of?

#Salix-alba's answer is spot on to get you started. Basically, you need a way to figure out how much time has already been spent together by members of your golfing group. I'll assume for illustration that you have a method to determine how much time two golfers have spent together. The algorithm can then be summed up as:
Compute total time spent together of every group of 4 golfers (see Salix-alba's answer), storing the results in an ordered fashion.
Pick the group of 4 golfers with the least time together as your first group.
Continue to pick groups from your ordered list of possible groups such that no member of the next group picked is a member of any prior group picked
Halt when all golfers have a group, which will always happen before you run out of possible combinations.
By way of quick, not promised to compile example (I wrote it in the answer window directly):
Let's assume you have a method time(a,b) where a and b are the golfer identities, and the result is how much time the two golfers have spent together.
Let's also that assume that we will use a TreeMap> to keep track of "weights" associated with groups, in a sorted manner.
Now, let's construct the weights of our groups using the above assumptions:
TreeMap<Integer,Collection<Integer>> options = new TreeMap<Integer, Collection<Integer>>();
for(int i=0;i<17;++i) {
for(int j=i+1;j<18;++j) {
for(int k=j+1;k<19;++k) {
for(int l=k+1;l<20;++l) {
Integer timeTogether = time(i,j) + time(i,k) + time(i,l) + time(j,k) + time(j,l)+time(k,l);
Collection<Integer> group = new HashSet<Integer>();
group.add(i);
group.add(j);
group.add(k);
group.add(l);
options.put(timeTogether, group);
}
}
}
}
Collection<Integer> golferLeft = new HashSet<Integer>(); // to quickly determine if we should consider a group.
for(int a=0; a < maxGolfers, a++) {
golferLeft.add(a);
}
Collection<Collection<Integer>> finalPicks = new ArrayList<Collection<Integer>>();
do{
Map.Entry<Integer, Collection<Integer>> least = options.pollFirstEntry();
if (least != null && golferLeft.containsAll(least.getValue()) {
finalPicks.add(least.getValue());
golferLeft.removeAll(least.getValue());
}
}while (golferLeft.size() > 0 && least != null);
And at the end of the final loop, finalPicks will have a number of collections, with each collection representing a play-group.
Obviously, you can tweak the weight function to get different results -- say you would rather be concerned with minimizing the time since members of the group played together. In that case, instead of using play time, sum up time since last game for each member of the group with some arbitrarily large but reasonable value to indicate if they have never played, and instead of finding the least group, find the largest. And so on.
I hope this has been a helpful primer!

There should be 20 C 4 possible groupings which is 4845. It should be possible to generate these combinations quite easily with four nested for loops.
int count = 0;
for(int i=0;i<17;++i) {
for(int j=i+1;j<18;++j) {
for(int k=j+1;k<19;++k) {
for(int l=k+1;l<20;++l) {
System.out.println(""+i+"\t"+j+"\t"+k+"\t"+l);
++count;
}
}
}
}
System.out.println("Count "+count);
You can quickly loop through all of these and use some objective function to workout which is the most optimal grouping. Your problem definition is a little fuzzy so I'm not sure how tell which is the best combination.
Thats just the number of way picking four golfers out of 20, you really need 5 group of 4 which I think is 20C4 * 16C4 * 12C4 * 8C4 which is 305,540,235,000. This is still in the realm of exhaustive computation though you might need to wait a few minutes.
Another approach might be a probabilistic approach. Just pick the groups at random, rejecting illegal combinations and those which don't meet your criteria. Keep picking random groups until you have found which is good enough.

parallel processing with loop construct in java

I am working to create a crawler- a java web app, in which users can define crawl jobs, which extract and store information from specific websites.
As part of this work, there is a 'loop' construct... it has a list portion, which is evaluated initially (and typically represents a list of values)... After that is the loop body, which is executed once for each item in the list (from the list portion mentioned previously).
Note that there can be a loop construct within another loop construct, and so on.
The problem is, sometimes one list can contain millions of rows of data - and the body is to be executed for each row in this list. The body has a start index value, upper bound for the index, and is incremented by one.
What I want to do is, for a single level loop, initially calculate the list value and store it in database. After that, instead of executing the body in one go, split it up into different sections so that different sections of the list are processed in parallel.
However, how do I split up a job for an n-level loop? (Ie one loop within one loop and so on.)
Is there some recommended way of doing such processing... Any tutorial or guide you could point me to, would be very helpful for me.

I suggest packing the processing logic for 1 element of the list into a Runnable or Callable, and then pass them to an Executor for execution. This will run tasks in parallel in different worker-threads. Of course it depends on how many cores your machine has, how "parallel" this will really be.
If each element of the list can be processed completely independent of all the others, than this would be the way to go for me, instead of messing around myself with Threads and dividing the list into sublists etc.

According your description, i got that you are fetching the source code of xyz website and scrap data from that.
You can use XPath and RegularExpression to do this kind of task as its best. use JSOUP for that ,it helps you a lot.
As far as parallelization is concern you can use the .select , getElementbyId, getElementByClass of JSOUP (it's a opensource). than simply put a
for(i=0 ;i< length;i++)
{
i am fetching i;
i am fetching i+1;
int temp=i+1;
if(temp>=length)
{
break;
}
}
hope this helps: http://jsoup.org

This sounds like a great candidate for the Java 7 fork / join framework

Lets say you create 3 thread: T1, T2, T3. and following is the looping construct, for eaxmple
for(int i=0; i<100; i++)
{
for(int j=0; j<100; j++)
{
for(int k=0; k<100; k++)
{
// do some processing.
}
}
}
Modify the increment part as i += no. of threads. In this case it will be i += 3
Thus, the initial values of i, j, k will vary for each thread.
For T1: i = 0;
For T2: i = 1;
For T3: i = 2;
Similary the loop limit can be set.

How do you (get around) dynamically naming variables?

I'm not sure if I'm using the right nomenclature, so I'll try to make my question as specific as possible. That said, I imagine this problem comes up all the time, and there are probably several different ways to deal with it.
Let's say I have an array (vector) called main of 1000 random years between 1980 and 2000 and that I want to make 20 separate arrays (vectors) out of it. These arrays would be named array1980, array1981, etc., would also have length 1000 but would contain 1s where the index in the name was equal to the corresponding element in main and 0s elsewhere. In other words:
for(int i=0; i<1000; i++){
if(main[i]==1980){
array1980[i]=1;
} else {
array1980[i]=0;
}
Of course, I don't want to have to write twenty of these, so it'd be good if I could create new variable names inside a loop. The problem is that you can't generally assign variable names to expressions with operators, e.g.,
String("array"+ j)=... # returns an error
I'm currently using Matlab the most, but I can also do a little in Java, c++ and python, and I'm trying to get an idea for how people go about solving this problem in general. Ideally, I'd like to be able to manipulate the individual variables (or sub-arrays) in some way that the year remains in the variable name (or array index) to reduce the chance for error and to make things easier to deal with in general.
I'd appreciate any help.

boolean main[][] = new boolean[1000][20];
for (int i=0; i < 1000; i++) {
array[i][main[i]-1980] = true;
}
In many cases a map will be a good solution, but here you could use a 2-dim array of booleans, since the size is known before (0-20) and continuous, and numerable.
Some languages will initialize an array of booleans to false for every element, so you would just need to set the values to true, to which main[i] points.
since main[i] returns numbers from 1980 to 2000, 1980-main[i] will return 1980-1980=0 to 2000-1980=20. To find your values, you have to add 1980 to the second index, of course.

The general solution to this is to not create variables with dynamic names, but to instead create a map. Exactly how that's done will vary by language.
For Java, it's worth looking at the map section of the Sun collections tutorial for a start.

Don Roby's answer is correct, but i would like to complete it.
You can use maps for this purpose, and it would look something like this:
Map<Integer,ArrayList<Integer>> yearMap = new HashMap<Integer,ArrayList<Integer>>();
yearMap.put(1980,new ArrayList<Integer>());
for (int i = 0; i < 1000; i++){
yearMap.get(1980).add(0);
}
yearMap.get(1980).set(999,1);
System.out.println(yearMap.get(1980).get(999));
But there is probably a better way to solve the problem that you have. You should not ask how to use X to solve Y, but how to solve Y.
So, what is it, that you are trying to solve?

Array access optimization

I have a 10x10 array in Java, some of the items in array which are not used, and I need to traverse through all elements as part of a method. What Would be better to do :
Go through all elements with 2 for loops and check for the nulltype to avoid errors, e.g.
for(int y=0;y<10;y++){
for(int x=0;x<10;x++){
if(array[x][y]!=null)
//perform task here
}
}
Or would it be better to keep a list of all the used addresses... Say an arraylist of points?
Something different I haven't mentioned.
I look forward to any answers :)

Any solution you try needs to be tested in controlled conditions resembling as much as possible the production conditions. Because of the nature of Java, you need to exercise your code a bit to get reliable performance stats, but I'm sure you know that already.
This said, there are several things you may try, which I've used to optimize my Java code with success (but not on Android JVM)
for(int y=0;y<10;y++){
for(int x=0;x<10;x++){
if(array[x][y]!=null)
//perform task here
}
}
should in any case be reworked into
for(int x=0;x<10;x++){
for(int y=0;y<10;y++){
if(array[x][y]!=null)
//perform task here
}
}
Often you will get performance improvement from caching the row reference. Let as assume the array is of the type Foo[][]:
for(int x=0;x<10;x++){
final Foo[] row = array[x];
for(int y=0;y<10;y++){
if(row[y]!=null)
//perform task here
}
}
Using final with variables was supposed to help the JVM optimize the code, but I think that modern JIT Java compilers can in many cases figure out on their own whether the variable is changed in the code or not. On the other hand, sometimes this may be more efficient, although takes us definitely into the realm of microoptimizations:
Foo[] row;
for(int x=0;x<10;x++){
row = array[x];
for(int y=0;y<10;y++){
if(row[y]!=null)
//perform task here
}
}
If you don't need to know the element's indices in order to perform the task on it, you can write this as
for(final Foo[] row: array){
for(final Foo elem: row
if(elem!=null)
//perform task here
}
}
Another thing you may try is to flatten the array and store the elements in Foo[] array, ensuring maximum locality of reference. You have no inner loop to worry about, but you need to do some index arithmetic when referencing particular array elements (as opposed to looping over the whole array). Depending on how often you do it, it may or not be beneficial.
Since most of the elements will be not-null, keeping them as a sparse array is not beneficial for you, as you lose locality of reference.
Another problem is the null test. The null test itself doesn't cost much, but the conditional statement following it does, as you get a branch in the code and lose time on wrong branch predictions. What you can do is to use a "null object", on which the task will be possible to perform but will amount to a non-op or something equally benign. Depending on the task you want to perform, it may or may not work for you.
Hope this helps.

You're better off using a List than an array, especially since you may not use the whole set of data. This has several advantages.
You're not checking for nulls and may not accidentally try to use a null object.
More memory efficient in that you're not allocating memory which may not be used.

For a hundred elements, it's probably not worth using any of the classic sparse array
implementations. However, you don't say how sparse your array is, so profile it and see how much time you spend skipping null items compared to whatever processing you're doing.
( As Tom Hawtin - tackline mentions ) you should, when using an array of arrays, try to loop over members of each array rather than than looping over the same index of different arrays. Not all algorithms allow you to do that though.
for ( int x = 0; x < 10; ++x ) {
for ( int y = 0; y < 10; ++y ) {
if ( array[x][y] != null )
//perform task here
}
}
or
for ( Foo[] row : array ) {
for ( Foo item : row ) {
if ( item != null )
//perform task here
}
}
You may also find it better to use a null object rather than testing for null, depending what the complexity of the operation you're performing is. Don't use the polymorphic version of the pattern - a polymorphic dispatch will cost at least as much as a test and branch - but if you were summing properties having an object with a zero is probably faster on many CPUs.
double sum = 0;
for ( Foo[] row : array ) {
for ( Foo item : row ) {
sum += item.value();
}
}
As to what applies to android, I'm not sure; again you need to test and profile for any optimisation.

Holding an ArrayList of points would be "over engineering" the problem. You have a multi-dimensional array; the best way to iterate over it is with two nested for loops. Unless you can change the representation of the data, that's roughly as efficient as it gets.
Just make sure you go in row order, not column order.

Depends on how sparse/dense your matrix is.
If it is sparse, you better store a list of points, if it is dense, go with the 2D array. If in between, you can have a hybrid solution storing a list of sub-matrices.
This implementation detail should be hidden within a class anyway, so your code can also anytime convert between any of these representations.
I would discourage you from settling on any of these solutions without profiling with your real application.

I agree an array with a null test is the best approach unless you expect sparsely populated arrays.
Reasons for this:
1- More memory efficient for dense arrays (a list needs to store the index)
2- More computationally efficient for dense arrays (You need only compare the value you just retrieved to NULL, instead of having to also get the index from memory).
Also, a small suggestion, but in Java especially you are often better off faking a multi dimensional array with a 1D array where possible (square/rectangluar arrays in 2D). Bounds checking only happens once per iteration, instead of twice. Not sure if this still applies in the android VMs, but it has traditionally been an issue. Regardless, you can ignore it if the loop is not a bottleneck.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Nesting parallelizations in Spark? What's the right approach? - java

No. You can't. SparkContext is only accessible from the spark Driver node. The inner parallelization() calls would try to execute SparkContext from the worker nodes, which do not have access to SparkContext.

Related

Elements in a Vector that start with a given letter/number

Creating groups based on previous data

parallel processing with loop construct in java

How do you (get around) dynamically naming variables?

Array access optimization

Categories

Resources