Build Spark JavaRDD List from DropResult objects

Build Spark JavaRDD List from DropResult objects - java

(What's possible in Scala should be possible in Java, right? But I would take Scala suggestions as well)
I am not trying to iterate over an RDD, instead I need to build one with n elements from a random/simulator class of a type called DropResult. DropResult can't be cast into anything else.
I thought the Spark "find PI" example had me on the right track but no luck. Here's what I am trying:
On a one-time basis a DropResult is made like this:
make a single DropResult from pld (PipeLinkageData)
DropResult dropResultSeed = pld.doDrop();
I am trying something like this:
JavaRDD<DropResult> simCountRDD = spark.parallelize(makeRangeList(1, getSimCount())).foreach(pld.doDrop());
I just need to run pld.doDrop() about 10^6 times on the cluster and put the results in a Spark RDD for the next operation, also on the cluster. I can't figure out what kind of function to use on "parallelize" to make this work.
makeRangeList:
private List<Integer> makeRangeList(int lower, int upper) {
List<Integer> range = IntStream.range(lower, upper).boxed().collect(Collectors.toList());
return range;
}
(FWIW I was trying to use the Pi example from http://spark.apache.org/examples.html as a model of how to do a for loop to create a JavaRDD)
int count = spark.parallelize(makeRange(1, NUM_SAMPLES)).filter(new Function<Integer, Boolean>() {
public Boolean call(Integer i) {
double x = Math.random();
double y = Math.random();
return x*x + y*y < 1;
}
}).count();
System.out.println("Pi is roughly " + 4 * count / NUM_SAMPLES);

Yea, seems like you should be able to do this pretty easily. Sounds like you just need to parallelize an RDD of 10^6 integers simply so that you can create 10^6 DropResult objects into an RDD.
If this is the case, I don't think you need to explicitly create a list as above. It seems like you should just be able to use makeRange() the way the Spark Pi example does like this :
JavaRDD<DropResult> simCountRDD = spark.parallelize(makeRange(1,getSimCount())).map(new Function<Integer, DropResult>()
{
public DropResult call(Integer i) {
return pld.doDrop();
}
});

Related

ConcurrentHashMap throws recursive update exception

Here is my Java code:
static Map<BigInteger, Integer> cache = new ConcurrentHashMap<>();
static Integer minFinder(BigInteger num) {
if (num.equals(BigInteger.ONE)) {
return 0;
}
if (num.mod(BigInteger.valueOf(2)).equals(BigInteger.ZERO)) {
//focus on stuff thats happening inside this block, since with given inputs it won't reach last return
return 1 + cache.computeIfAbsent(num.divide(BigInteger.valueOf(2)),
n -> minFinder(n));
}
return 1 + Math.min(cache.computeIfAbsent(num.subtract(BigInteger.ONE), n -> minFinder(n)),
cache.computeIfAbsent(num.add(BigInteger.ONE), n -> minFinder(n)));
}
I tried to memoize a function that returns a minimum number of actions such as division by 2, subtract by one or add one.
The problem I'm facing is when I call it with smaller inputs such as:
minFinder(new BigInteger("32"))
it works, but with bigger values like:
minFinder(new BigInteger("64"))
It throws a Recursive Update exception.
Is there any way to increase recursion size to prevent this exception or any other way to solve this?

From the API docs of Map.computeIfAbsent():
The mapping function should not modify this map during computation.
The API docs of ConcurrentHashMap.computeIfAbsent() make that stronger:
The mapping function must not modify this map during computation.
(Emphasis added)
You are violating that by using your minFinder() method as the mapping function. That it seems nevertheless to work for certain inputs is irrelevant. You need to find a different way to achieve what you're after.
Is there any way to increase recursion size to prevent this exception or any other way to solve this?
You could avoid computeIfAbsent() and instead do the same thing the old-school way:
BigInteger halfNum = num.divide(BigInteger.valueOf(2));
BigInteger cachedValue = cache.get(halfNum);
if (cachedValue == null) {
cachedValue = minFinder(halfNum);
cache.put(halfNum, cachedValue);
}
return 1 + cachedValue;
But that's not going to be sufficient if the computation loops. You could perhaps detect that by putting a sentinel value into the map before you recurse, so that you can recognize loops.

Unbox value from Either<L, R> vavr

Background
I have a small function that can return Either<String, Float>. If it succeed it returns a float otherwise an error string.
My objective is to perform a series of operations in a pipeline, and to achieve railway oriented programming using Either.
Code
import java.util.function.Function;
import io.vavr.control.Either;
#Test
public void run(){
Function<Float, Either<String, Float>> either_double = num -> {
if(num == 4.0f)
Either.left("I don't like this number");
return Either.right(num * 2);
};
Function<Float, Float> incr = x -> x + 1.0f;
Float actual =
Either.right(2f)
.map(incr)
.map(either_double)
.get();
Float expected = 6.0f;
assertEquals(expected, actual);
}
This code does a series of simple operations. First I create a either of right with value 2, then I increment it and I finish by doubling it. The result of these operations is 6.
Problem
The result of the mathematical operations is 6.0f, but that's not what I get. Instead I get Right(6.0f).
This is an issue that prevents the code from compiling. I have a value boxed inside the Either Monad, but after checking their API for Either I didn't find a way to unbox it and get the value as is.
I thought about using getOrElseGet but even that method returns a Right.
Question
How do I access the real value stored inside the Either Monad?

Use flatMap(either_double) instead of map(either_double).

Is there an ``arrayAdd`` method or something to facilitate array additions?

So I have an array
static final int N = maxNofActiveThreads;
static final int[]arr = new int[N*nofEntries];
Where the N threads write to mutually exclusive regions of the array.
I should now like to add a monitoring thread that will periodically collect the results for decision-making by simply summing up all the threads' tables.
I.e. in pseudo-code
int[] snapshot = arr[0 : nofEntries] + arr[nofEntries : 2*nofEntries] + ... + arr[(N-1) * nofEntries : N*nofEntries]
The obvious choice would be to simply create
int[] snapshot = new int[nofEntries]
System.arrayCopy(arr,0,snapshot,0,nofEntries);
and then walking through the rest of arr, adding one value at a time.
Is there a smarter/more efficient way?
Oh, and we don't care if we miss an update every so often, it will eventually show up on the next pass and that's fine. No need for any synchronisation.
(I should also mention that the project I'm working on requires me to use Java 7.)

The best I can imagine is to use the Arrays class static method parallelSetAll. As it tries to parallelize operation it may be more efficient. Javadoc for another parallel method says:
... computation is usually more efficient than sequential loops for large arrays.
So you could use:
private int tsum(int index) {
int val = 0;
for (int t=0; t<N; t++) {
val += arr[index + t * nofEntries];
}
return val;
}
and:
IntUnaryOperator generator = (x) -> tsum(x) ;
int[] snapshot = new int[nofEntries];
Arrays.parallelSetAll(snapshot, generator);

What is the fastest and most concise/correct way to implement this model class backed by values in a 2-dimensional array?

I solved this problem using a graph, but unfortunately now I'm stuck with having to use a 2d array and I have questions about the best way to go about this:
public class Data {
int[][] structure;
public data(int x, int y){
structure = new int[x][y]
}
public <<TBD>> generateRandom() {
// This is what my question is about
}
}
I have a controller/event handler class:
public class Handler implements EventHandler {
#Override
public void onEvent(Event<T> e) {
this.dataInstance.generateRandom();
// ... other stuff
}
}
Here is what each method will do:
Data.generateRandom() will generate a random value at a random location in the 2d int array if there exists a value in the structure that in not initialized or a value exists that is equal to zero
If there is no available spot in the structure, the structure's state is final (i.e. in the literal sense, not the Java declaration)
This is what I'm wondering:
What is the most efficient way to check if the board is full? Using a graph, I was able to check if the board was full on O(1) and get an available yet also random location on worst-case O(n^2 - 1), best case O(1). Obviously now with an array improving n^2 is tough, so I'm just now focusing on execution speed and LOC. Would the fastest way to do it now to check the entire 2d array using streams like:
Arrays.stream(board).flatMapToInt(tile -> tile.getX()).map(x -> x > 0).count() > board.getWidth() * board.getHeight()

(1) You can definitely use a parallel stream to safely perform read only operations on the array. You can also do an anyMatch call since you are only caring (for the isFull check) if there exists any one space that hasn't been initialized. That could look like this:
Arrays.stream(structure)
.parallel()
.anyMatch(i -> i == 0)
However, that is still an n^2 solution. What you could do, though, is keep a counter of the number of spaces possible that you decrement when you initialize a space for the first time. Then the isFull check would always be constant time (you're just comparing an int to 0).
public class Data {
private int numUninitialized;
private int[][] structure;
public Data(int x, int y) {
if (x <= 0 || y <= 0) {
throw new IllegalArgumentException("You can't create a Data object with an argument that isn't a positive integer.");
}
structure = new int[x][y];
int numUninitialized = x * y;
}
public void generateRandom() {
if (isFull()) {
// do whatever you want when the array is full
} else {
// Calculate the random space you want to set a value for
int x = ThreadLocalRandom.current().nextInt(structure.length);
int y = ThreadLocalRandom.current().nextInt(structure[0].length);
if (structure[x][y] == 0) {
// A new, uninitialized space
numUninitialized--;
}
// Populate the space with a random value
structure[x][y] = ThreadLocalRandom.current().nextInt(Integer.MIN_VALUE, Integer.MAX_VALUE);
}
}
public boolean isFull() {
return 0 == numUninitialized;
}
}
Now, this is with my understanding that each time you call generateRandom you take a random space (including ones already initialized). If you are supposed to ONLY choose a random uninitialized space each time it's called, then you'd do best to hold an auxiliary data structure of all the possible grid locations so that you can easily find the next random open space and to tell if the structure is full.
(2) What notification method is appropriate for letting other classes know the array is now immutable? It's kind of hard to say as it depends on the use case and the architecture of the rest of the system this is being used in. If this is an MVC application with a heavy use of notifications between the data model and a controller, then an observer/observable pattern makes a lot of sense. But if your application doesn't use that anywhere else, then perhaps just having the classes that care check the isFull method would make more sense.
(3) Java is efficient at creating and freeing short lived objects. However, since the arrays can be quite large I'd say that allocating a new array object (and copying the data) over each time you alter the array seems ... inefficient at best. Java has the ability to do some functional types of programming (especially with the inclusion of lambdas in Java 8) but only using immutable objects and a purely functional style is kind of like the round hole to Java's square peg.

Splitting vectors into subvectors - Java

I have a function that processes vectors. Size of input vector can be anything up to few millions. Problem is that function can only process vectors that are no bigger than 100k elements without problems.
I would like to call function in smaller parts if vector has too many elements
Vector<Stuff> process(Vector<Stuff> input) {
Vector<Stuff> output;
while(1) {
if(input.size() > 50000) {
output.addAll(doStuff(input.pop_front_50k_first_ones_as_subvector());
}
else {
output.addAll(doStuff(input));
break;
}
}
return output;
}
How should I do this?

Not sure if a Vector with millions of elements is a good idea, but Vector implements List, and thus there is subList which provides a lightweight (non-copy) view of a section of the Vector.
You may have to update your code to work with the interface List instead of only the specific implementation Vector, though (because the sublist returned is not a Vector, and it is just good practice in general).

You probably want to rewrite your doStuff method to take a List rather than a Vector argument,
public Collection<Output> doStuff(List<Stuff> v) {
// calculation
}
(and notice that Vector<T> is a List<T>)
and then change your process method to something like
Vector<Stuff> process(Vector<Stuff> input) {
Vector<Stuff> output;
int startIdx = 0;
while(startIdx < input.size()) {
int endIdx = Math.min(startIdx + 50000, input.size());
output.addAll(doStuff(input.subList(startIdx, endIdx)));
startIdx = endIdx;
}
}
this should work as long as the "input" Vector isn't being concurrently updated during the running of the process method.
If you can't change the signature of doStuff, you're probably going to need to wrap a new Vector around the result of subList,
output.addAll(doStuff(new Vector<Stuff>(input.subList(startIdx, endIdx)))));

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Build Spark JavaRDD List from DropResult objects - java

Related

ConcurrentHashMap throws recursive update exception

Unbox value from Either<L, R> vavr

Is there an ``arrayAdd`` method or something to facilitate array additions?

What is the fastest and most concise/correct way to implement this model class backed by values in a 2-dimensional array?

Splitting vectors into subvectors - Java

Categories

Resources