Spark flatMap/reduce: How to scale and avoid OutOfMemory?

Spark flatMap/reduce: How to scale and avoid OutOfMemory? - java

I am migrating some map-reduce code into Spark, and having problems when constructing an Iterable to return in the function.
In MR code, I had a reduce function that grouped by key, and then (using multipleOutputs) would iterate the values and use write (in multiple outputs, but that's unimportant) to some code like this (simplified):
reduce(Key key, Iterable<Text> values) {
// ... some code
for (Text xml: values) {
multipleOutputs.write(key, val, directory);
}
}
However, in Spark I have translated a map and this reduce into a sequence of:
mapToPair -> groupByKey -> flatMap
as recommended... in some book.
mapToPair basically adds a Key via functionMap, which based on some values on the record creates a Key for that record. Sometimes a key may have ver high cardinality.
JavaPairRDD<Key, String> rddPaired = inputRDD.mapToPair(new PairFunction<String, Key, String>() {
public Tuple2<Key, String> call(String value) {
//...
return functionMap.call(value);
}
});
The rddPaired is applied a RDD.groupByKey() to get the RDD to feed the flatMap function:
JavaPairRDD<Key, Iterable<String>> rddGrouped = rddPaired.groupByKey();
Once grouped, a flatMap call to do the reduce. Here, operation is a transformation :
public Iterable<String> call (Tuple2<Key, Iterable<String>> keyValue) {
// some code...
List<String> out = new ArrayList<String>();
if (someConditionOnKey) {
// do a logic
Grouper grouper = new Grouper();
for (String xml : keyValue._2()) {
// group in a separate class
grouper.add(xml);
}
// operation is now performed on the whole group
out.add(operation(grouper));
} else {
for (String xml : keyValue._2()) {
out.add(operation(xml));
}
return out;
}
}
It works fine... with keys that don't have too many records. Actually, it breaks by OutOfMemory when a key with lot of values enters the "else" on the reduce.
Note: I have included the "if" part to explain the logic I want to produce, but the failure happens when entering the "else"... because when data enters the "else", it normally means there will be many more values for that due by the nature of the data.
It is clear that, having to keep all of the grouped values in "out" list, it won't scale if a key has millions of records, because it will keep them in memory. I have reached the point where the OOM happens (yes, it's when performing the "operation" above which asks for memory - and none is given. It's not a very expensive memory operation though).
Is there any way to avoid this in order to scale? Either by replicating behaviour using some other directives to reach the same output in a more scalable way, or to be able to hand to Spark the values for merging (just as I used to do with MR)...

It's inefficient to do condition inside the flatMap operation. You should check the condition outside to create 2 distinct RDDs and deal with them separatedly.
rddPaired.cache();
// groupFilterFunc will filter which items need grouping
JavaPairRDD<Key, Iterable<String>> rddGrouped = rddPaired.filter(groupFilterFunc).groupByKey();
// processGroupedValuesFunction should call `operation` on group of all values with the same key and return the result
rddGrouped.mapValues(processGroupedValuesFunction);
// nogroupFilterFunc will filter which items don't need grouping
JavaPairRDD<Key, Iterable<String>> rddNoGrouped = rddPaired.filter(nogroupFilterFunc);
// processNoGroupedValuesFunction2 should call `operation` on a single value and return the result
rddNoGrouped.mapValues(processNoGroupedValuesFunction2);

Related

Aggregate values and convert into single type within the same Java stream

I have a class with a collection of Seed elements. One of the method's return type of Seed is Optional<Pair<Boolean, String>>.
I'm trying to loop over all seeds, find if any boolean value is true and at the same time, create a set with all the String values. For instance, my input is in the form Optional<Pair<Boolean, String>>, the output should be Optional<Signal> where Signal is like:
class Signal {
public boolean exposure;
public Set<String> alarms;
// constructor and getters (can add anything to this class, it's just a bag)
}
This is what I currently have that works:
// Seed::hadExposure yields Optional<Pair<Boolean, String>> where Pair have key/value or left/right
public Optional<Signal> withExposure() {
if (seeds.stream().map(Seed::hadExposure).flatMap(Optional::stream).findAny().isEmpty()) {
return Optional.empty();
}
final var exposure = seeds.stream()
.map(Seed::hadExposure)
.flatMap(Optional::stream)
.anyMatch(Pair::getLeft);
final var alarms = seeds.stream()
.map(Seed::hadExposure)
.flatMap(Optional::stream)
.map(Pair::getRight)
.filter(Objects::nonNull)
.collect(Collectors.toSet());
return Optional.of(new Signal(exposure, alarms));
}
Now I have time to make it better because Seed::hadExposure could become and expensive call, so I was trying to see if I could make all of this with only one pass. I've tried (some suggestions from previous questions) with reduce, using collectors (Collectors.collectingAndThen, Collectors.partitioningBy, etc.), but nothing so far.

It's possible to do this in a single stream() expression using map to convert the non-empty exposure to a Signal and then a reduce to combine the signals:
Signal signal = exposures.stream()
.map(exposure ->
new Signal(
exposure.getLeft(),
exposure.getRight() == null
? Collections.emptySet()
: Collections.singleton(exposure.getRight())))
.reduce(
new Signal(false, new HashSet<>()),
(leftSig, rightSig) -> {
HashSet<String> alarms = new HashSet<>();
alarms.addAll(leftSig.alarms);
alarms.addAll(rightSig.alarms);
return new Signal(
leftSig.exposure || rightSig.exposure, alarms);
});
However, if you have a lot of alarms it would be expensive because it creates a new Set and adds the new alarms to the accumulated alarms for each exposure in the input.
In a language that was designed from the ground-up to support functional programming, like Scala or Haskell, you'd have a Set data type that would let you efficiently create a new set that's identical to an existing set but with an added element, so there'd be no efficiency worries:
filteredSeeds.foldLeft((false, Set[String]())) { (result, exposure) =>
(result._1 || exposure.getLeft, result._2 + exposure.getRight)
}
But Java doesn't come with anything like that out of the box.
You could create just a single Set for the result and mutate it in your stream's reduce expression, but some would regard that as poor style because you'd be mixing a functional paradigm (map/reduce over a stream) with a procedural one (mutating a set).
Personally, in Java, I'd just ditch the functional approach and use a for loop in this case. It'll be less code, more efficient, and IMO clearer.
If you have enough space to store an intermediate result, you could do something like:
List<Pair<Boolean, String>> exposures =
seeds.stream()
.map(Seed::hadExposure)
.flatMap(Optional::stream)
.collect(Collectors.toList());
Then you'd only be calling the expensive Seed::hadExposure method once per item in the input list.

Java 8 Streams for List iteration

I have a HashMap that contains List<Dto> and List<List<String>>:
Map<List<Dto>, List<List<String>>> mapData = new HashMap();
and an Arraylist<Dto>.
I want to iterate over this map, get the keys-key1, key2 etc and get the value out of it and set it to the Dto object and thereafter add it to a List. So i am able to successfully iterate using foreach and get it added to lists but not able to get it correctly done using Java 8. So i need some help on that. Here is the sample code
List<DTO> dtoList = new ArrayList();
DTO dto = new DTO();
mapData.entrySet().stream().filter(e->{
if(e.getKey().equals("key1")){
dto.setKey1(e.getValue())
}
if(e.getKey().equals("key2")){
dto.setKey2(e.getValue())
}
});
Here e.getValue() is from List<List<String>>()
so first thing is I need to iterate over it to set the value.
And second is I need to add dto to a Arraylist dtoList. So how to achieve this.
Basic Snippet that i tried without adding to a HashMap where List has keys, multiList has values and Dto list is where finally i add into
for(List<Dto> dtoList: column) {
if ("Key1".equalsIgnoreCase(column.getName())) {
index = dtoList.indexOf(column);
}
}
for(List<String> listoflists: multiList) {
if(listoflists.contains(index)) {
for(String s: listoflists) {
dto.setKey1(s);
}
dtoList.add(dto);
}
}

See https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html
Stream operations are divided into intermediate and terminal operations, and are combined to form stream pipelines. A stream pipeline consists of a source (such as a Collection, an array, a generator function, or an I/O channel); followed by zero or more intermediate operations such as Stream.filter or Stream.map; and a terminal operation such as Stream.forEach or Stream.reduce.
So in your snippet above, filter isn't really doing anything. To trigger it, you'd add a collect operation at the end. Notice that the filter lambda function needs to return a boolean for your code to compile in the first place.
mapData.entrySet().stream().filter(entry -> {
// do something here
return true;
}).collect(Collectors.toList());
Of course you don't need to abuse intermediate operations - or generate a bunch of new objects - for straightforward tasks, something like this should suffice:
mapData.entrySet().stream().forEach(entry -> {
// do something
});

Java Stream Group By Row When It Fetched From Database

Let's say I have these piece of code. As far as I know, the code below, runs like that if I have 10 query and run them at the same time, and each query return 10M results, I have to wait 100M rows fetched from Database to start group function.
My problem, since the cardinality of Country and City cartesian product is low, And the number of rows which I have to fetch from database is huge. I wanna immediately calcute the group result when a row fetched from database. How can I do that using Java Stream?
myqueries
.parallelstream()
.map( m-> {
//queryresult is a stream which return database rows
return queryresult;
})
.flatMap(fm-> fm)
.collect(Collectors.groupingBy(g-> {
List<Object> objects = Arrays.<Object>asList(
g.getCountry(),
g.getCity());
return objects;
}, Collectors.toList()))
.entrySet().stream().map(m-> {
MyResultClass item = new MyResultClass();
item.setCountry((String) m.getKey().get(0));
item.setCity((String) m.getKey().get(1));
item.setSumField1(m.getValue().stream().mapToDouble(m2-> m2.getSumField1()).sum());
item.setSumField2(m.getValue().stream().mapToDouble(m2-> m2.getSumField2()).sum());
item.setSumField3(m.getValue().stream().mapToDouble(m2-> m2.getSumField3()).sum());
return item;
}).forEach(f-> {
//print the MyResultClass fields
});

The problem with your solution is, you are collecting all data into a list, just to do further reduction. So it will accumulate all data into memory. You can combine both reduction into single one using toMap like this :
myqueries
.parallelstream()
.flatMap( m-> {
//queryresult is a stream which return database rows
return queryresult;
})
.collect(Collectors.toMap(
g-> Arrays.<Object>asList(g.getCountry(), g.getCity()),
v -> {
MyResultClass item = new MyResultClass();
item.setCountry(v.getCountry());
item.setCity(v.getCity());
return item;
},
(t, u) -> {
t.setSumField1(t.getSumField1() + u.getSumField1());
t.setSumField2(t.getSumField2() + u.getSumField3());
t.setSumField3(t.getSumField3() + u.getSumField3());
return t;
}
)
.values().forEach(f-> {
//print the MyResultClass fields
});
Also, note that, when you use parallelStream here, that does not mean all queries will be running in parallel. Parallelism will depend on number of queries, number of cores in your machine and runtime environment. If you want to control the concurrent query behaviour, better use ExecutorService.
Another point to note is, execution will also depend on how you are creating Stream from the query result in the first place. If you wait till you get all result, and then create Stream, then you will defeat the purpose of the question itself.

Partition Strategy for applying multiple JOINs on a Flink DataSet

I am using Flink 1.4.0.
Suppose I have a POJO as follows:
public class Rating {
public String name;
public String labelA;
public String labelB;
public String labelC;
...
}
and a JOIN function:
public class SetLabelA implements JoinFunction<Tuple2<String, Rating>, Tuple2<String, String>, Tuple2<String, Rating>> {
#Override
public Tuple2<String, Rating> join(Tuple2<String, Rating> rating, Tuple2<String, String> labelA) {
rating.f1.setLabelA(labelA)
return rating;
}
}
and suppose I want to apply a JOIN operation to set the values of each field in a DataSet<Tuple2<String, Rating>>, which I can do as follows:
DataSet<Tuple2<String, Rating>> ratings = // [...]
DataSet<Tuple2<String, Double>> aLabels = // [...]
DataSet<Tuple2<String, Double>> bLabels = // [...]
DataSet<Tuple2<String, Double>> cLabels = // [...]
...
DataSet<Tuple2<String, Rating>>
newRatings =
ratings.leftOuterJoin(aLabels, JoinOperatorBase.JoinHint.REPARTITION_SORT_MERGE)
// key of the first input
.where("f0")
// key of the second input
.equalTo("f0")
// applying the JoinFunction on joining pairs
.with(new SetLabelA());
Unfortunately, this is necessary as both ratings and all xLabels are very big DataSets and I am forced to look into each of the xlabels to find the field values I require, while at the same time it is not the case that all rating keys exist in each xlabels.
This practically means that I have to perform a leftOuterJoin per xlabel, for which I need to also create the respective JoinFunction implementation that utilises the correct setter from the Rating POJO.
Is there a more efficient way to solve this that anyone can think of?
As far as the partitioning strategy goes, I have made sure to sort the DataSet<Tuple2<String, Rating>> ratings with:
DataSet<Tuple2<String, Rating>> sorted_ratings = ratings.sortPartition(0, Order.ASCENDING).setParallelism(1);
By setting parallelism to 1 I can be sure that the whole dataset will be ordered. I then use .partitionByRange:
DataSet<Tuple2<String, Rating>> partitioned_ratings = sorted_ratings.partitionByRange(0).setParallelism(N);
where N is the number of cores I have on my VM. Another side question I have here is whether the first .setParallelism which is set to 1 is restrictive in terms of how the rest of the pipeline is executed, i.e. can the follow up .setParallelism(N) change how the DataSet is processed?
Finally, I did all these so that when partitioned_ratings is joined with a xlabels DataSet, the JOIN operation will be done with JoinOperatorBase.JoinHint.REPARTITION_SORT_MERGE. According to Flink docs for v.1.4.0:
REPARTITION_SORT_MERGE: The system partitions (shuffles) each input (unless the input is already partitioned) and sorts each input (unless it is already sorted). The inputs are joined by a streamed merge of the sorted inputs. This strategy is good if one or both of the inputs are already sorted.
So in my case, ratings is sorted (I think) and each of the xlabels DataSets are not, hence it makes sense that this is the most efficient strategy. Anything wrong with this? Any alternative approaches?

So far I haven't been able to pull through this strategy. It seems like relying on JOINs is too troublesome as they are expensive operations and one should avoid them unless they are really necessary.
For instance, JOINs should be used if both Datasets are very big in size. If they are not, a convenient alternative is the use of BroadCastVariables by which one of the two Datasets (the smallest), is broadcasted across workers for whatever purpose it is used. A example appears below (copied from this link for convenience)
DataSet<Point> points = env.readCsv(...);
DataSet<Centroid> centroids = ... ; // some computation
points.map(new RichMapFunction<Point, Integer>() {
private List<Centroid> centroids;
#Override
public void open(Configuration parameters) {
this.centroids = getRuntimeContext().getBroadcastVariable("centroids");
}
#Override
public Integer map(Point p) {
return selectCentroid(centroids, p);
}
}).withBroadcastSet("centroids", centroids);
Also, since populating fields of a POJO implies that a quite similar code will be leverage repeatedly, one should definitely use jlens to avoid code repetition and write a more concise and easy to follow solution.

Adding data to a hashmap from on apache-spark RDD operation (Java)

I've used a map step to create a JavaRDD object containing some objects I need. Based on those objects I want to create a global hashmap containing some stats, but I can't figure out which RDD operation to use. At first I thought reduce would be the solution, but then I saw that you have to return the same type of objects. I'm not interested in reducing the items, but in gathering all the stats from all the machines (they can be computed separately and then just added up_.
For example:
I have an RDD of Objects containing an integer array among other stuff and I want to compute how many times each of the integers has appeared in the array by putting them into a hashtable. Each machine should compute it's own hashtable and then put them all in one place in the driver.

Often when you think you want to end up with a Map, you'd need to transform your records in the RDD into key-value pairs, and use reduceByKey.
Your specific example sounds exactly like the famous wordcount example (see first example here), only you want to count integers from an array within an object, instead of counting words from a sentence (String). In Scala, this would translate to:
import org.apache.spark.rdd.RDD
import scala.collection.Map
class Example {
case class MyObj(ints: Array[Int], otherStuff: String)
def countInts(input: RDD[MyObj]): Map[Int, Int] = {
input
.flatMap(_.ints) // flatMap maps each record into several records - in this case, each int becomes a record
.map(i => (i, 1)) // turn into key-value map, with preliminary value 1 for each key
.reduceByKey(_ + _) // aggregate values by key
.collectAsMap() // collects data into a Map
}
}
Generally, you should let Spark perform as much of the operation as possible in a distributed manner, and delay the collection into memory as much as possible - if you collect the values before reducing, often you'll run out of memory, unless your dataset is small enough to begin with (in which case, you don't really need Spark).
Edit: and here's the same code in Java (much longer, but identical...):
static class MyObj implements Serializable {
Integer[] ints;
String otherStuff;
}
Map<Integer, Integer> countInts(JavaRDD<MyObj> input) {
return input
.flatMap(new FlatMapFunction<MyObj, Integer>() {
#Override
public Iterable<Integer> call(MyObj myObj) throws Exception {
return Arrays.asList(myObj.ints);
}
}) // flatMap maps each record into several records - in this case, each int becomes a record
.mapToPair(new PairFunction<Integer, Integer, Integer>() {
#Override
public Tuple2<Integer, Integer> call(Integer integer) throws Exception {
return new Tuple2<>(integer, 1);
}
}) // turn into key-value map, with preliminary value 1
.reduceByKey(new Function2<Integer, Integer, Integer>() {
#Override
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}
}) // aggregate values by key
.collectAsMap(); // collects data into a Map
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark flatMap/reduce: How to scale and avoid OutOfMemory? - java

Related

Aggregate values and convert into single type within the same Java stream

Java 8 Streams for List iteration

Java Stream Group By Row When It Fetched From Database

Partition Strategy for applying multiple JOINs on a Flink DataSet

Adding data to a hashmap from on apache-spark RDD operation (Java)

Categories

Resources