I am using Flink 1.4.0.
Suppose I have a POJO as follows:
public class Rating {
public String name;
public String labelA;
public String labelB;
public String labelC;
...
}
and a JOIN function:
public class SetLabelA implements JoinFunction<Tuple2<String, Rating>, Tuple2<String, String>, Tuple2<String, Rating>> {
#Override
public Tuple2<String, Rating> join(Tuple2<String, Rating> rating, Tuple2<String, String> labelA) {
rating.f1.setLabelA(labelA)
return rating;
}
}
and suppose I want to apply a JOIN operation to set the values of each field in a DataSet<Tuple2<String, Rating>>, which I can do as follows:
DataSet<Tuple2<String, Rating>> ratings = // [...]
DataSet<Tuple2<String, Double>> aLabels = // [...]
DataSet<Tuple2<String, Double>> bLabels = // [...]
DataSet<Tuple2<String, Double>> cLabels = // [...]
...
DataSet<Tuple2<String, Rating>>
newRatings =
ratings.leftOuterJoin(aLabels, JoinOperatorBase.JoinHint.REPARTITION_SORT_MERGE)
// key of the first input
.where("f0")
// key of the second input
.equalTo("f0")
// applying the JoinFunction on joining pairs
.with(new SetLabelA());
Unfortunately, this is necessary as both ratings and all xLabels are very big DataSets and I am forced to look into each of the xlabels to find the field values I require, while at the same time it is not the case that all rating keys exist in each xlabels.
This practically means that I have to perform a leftOuterJoin per xlabel, for which I need to also create the respective JoinFunction implementation that utilises the correct setter from the Rating POJO.
Is there a more efficient way to solve this that anyone can think of?
As far as the partitioning strategy goes, I have made sure to sort the DataSet<Tuple2<String, Rating>> ratings with:
DataSet<Tuple2<String, Rating>> sorted_ratings = ratings.sortPartition(0, Order.ASCENDING).setParallelism(1);
By setting parallelism to 1 I can be sure that the whole dataset will be ordered. I then use .partitionByRange:
DataSet<Tuple2<String, Rating>> partitioned_ratings = sorted_ratings.partitionByRange(0).setParallelism(N);
where N is the number of cores I have on my VM. Another side question I have here is whether the first .setParallelism which is set to 1 is restrictive in terms of how the rest of the pipeline is executed, i.e. can the follow up .setParallelism(N) change how the DataSet is processed?
Finally, I did all these so that when partitioned_ratings is joined with a xlabels DataSet, the JOIN operation will be done with JoinOperatorBase.JoinHint.REPARTITION_SORT_MERGE. According to Flink docs for v.1.4.0:
REPARTITION_SORT_MERGE: The system partitions (shuffles) each input (unless the input is already partitioned) and sorts each input (unless it is already sorted). The inputs are joined by a streamed merge of the sorted inputs. This strategy is good if one or both of the inputs are already sorted.
So in my case, ratings is sorted (I think) and each of the xlabels DataSets are not, hence it makes sense that this is the most efficient strategy. Anything wrong with this? Any alternative approaches?
So far I haven't been able to pull through this strategy. It seems like relying on JOINs is too troublesome as they are expensive operations and one should avoid them unless they are really necessary.
For instance, JOINs should be used if both Datasets are very big in size. If they are not, a convenient alternative is the use of BroadCastVariables by which one of the two Datasets (the smallest), is broadcasted across workers for whatever purpose it is used. A example appears below (copied from this link for convenience)
DataSet<Point> points = env.readCsv(...);
DataSet<Centroid> centroids = ... ; // some computation
points.map(new RichMapFunction<Point, Integer>() {
private List<Centroid> centroids;
#Override
public void open(Configuration parameters) {
this.centroids = getRuntimeContext().getBroadcastVariable("centroids");
}
#Override
public Integer map(Point p) {
return selectCentroid(centroids, p);
}
}).withBroadcastSet("centroids", centroids);
Also, since populating fields of a POJO implies that a quite similar code will be leverage repeatedly, one should definitely use jlens to avoid code repetition and write a more concise and easy to follow solution.
Related
I have a class with a collection of Seed elements. One of the method's return type of Seed is Optional<Pair<Boolean, String>>.
I'm trying to loop over all seeds, find if any boolean value is true and at the same time, create a set with all the String values. For instance, my input is in the form Optional<Pair<Boolean, String>>, the output should be Optional<Signal> where Signal is like:
class Signal {
public boolean exposure;
public Set<String> alarms;
// constructor and getters (can add anything to this class, it's just a bag)
}
This is what I currently have that works:
// Seed::hadExposure yields Optional<Pair<Boolean, String>> where Pair have key/value or left/right
public Optional<Signal> withExposure() {
if (seeds.stream().map(Seed::hadExposure).flatMap(Optional::stream).findAny().isEmpty()) {
return Optional.empty();
}
final var exposure = seeds.stream()
.map(Seed::hadExposure)
.flatMap(Optional::stream)
.anyMatch(Pair::getLeft);
final var alarms = seeds.stream()
.map(Seed::hadExposure)
.flatMap(Optional::stream)
.map(Pair::getRight)
.filter(Objects::nonNull)
.collect(Collectors.toSet());
return Optional.of(new Signal(exposure, alarms));
}
Now I have time to make it better because Seed::hadExposure could become and expensive call, so I was trying to see if I could make all of this with only one pass. I've tried (some suggestions from previous questions) with reduce, using collectors (Collectors.collectingAndThen, Collectors.partitioningBy, etc.), but nothing so far.
It's possible to do this in a single stream() expression using map to convert the non-empty exposure to a Signal and then a reduce to combine the signals:
Signal signal = exposures.stream()
.map(exposure ->
new Signal(
exposure.getLeft(),
exposure.getRight() == null
? Collections.emptySet()
: Collections.singleton(exposure.getRight())))
.reduce(
new Signal(false, new HashSet<>()),
(leftSig, rightSig) -> {
HashSet<String> alarms = new HashSet<>();
alarms.addAll(leftSig.alarms);
alarms.addAll(rightSig.alarms);
return new Signal(
leftSig.exposure || rightSig.exposure, alarms);
});
However, if you have a lot of alarms it would be expensive because it creates a new Set and adds the new alarms to the accumulated alarms for each exposure in the input.
In a language that was designed from the ground-up to support functional programming, like Scala or Haskell, you'd have a Set data type that would let you efficiently create a new set that's identical to an existing set but with an added element, so there'd be no efficiency worries:
filteredSeeds.foldLeft((false, Set[String]())) { (result, exposure) =>
(result._1 || exposure.getLeft, result._2 + exposure.getRight)
}
But Java doesn't come with anything like that out of the box.
You could create just a single Set for the result and mutate it in your stream's reduce expression, but some would regard that as poor style because you'd be mixing a functional paradigm (map/reduce over a stream) with a procedural one (mutating a set).
Personally, in Java, I'd just ditch the functional approach and use a for loop in this case. It'll be less code, more efficient, and IMO clearer.
If you have enough space to store an intermediate result, you could do something like:
List<Pair<Boolean, String>> exposures =
seeds.stream()
.map(Seed::hadExposure)
.flatMap(Optional::stream)
.collect(Collectors.toList());
Then you'd only be calling the expensive Seed::hadExposure method once per item in the input list.
Given a list in which each entry is a object that looks like
class Entry {
public String id;
public Object value;
}
Multiple entries could have the same id. I need a map where I can access all values that belong to a certain id:
Map<String, List<Object>> map;
My algorithm to achieve this:
for (Entry entry : listOfEntries) {
List<Object> listOfValues;
if (map.contains(entry.id)) {
listOfValues = map.get(entry.id);
} else {
listOfValues = new List<Object>();
map.put(entry.id, listOfValues);
}
listOfValues.add(entry.value);
}
Simply: I transform a list that looks like
ID | VALUE
---+------------
a | foo
a | bar
b | foobar
To a map that looks like
a--+- foo
'- bar
b---- foobar
As you can see, contains is called for each entry of the source list. That's why I wonder if I could improve my algorithm, if I pre-sort the source list and then do this:
List<Object> listOfValues = new List<Object>();
String prevId = null;
for (Entry entry : listOfEntries) {
if (prevId != null && prevId != entry.id) {
map.put(prevId, listOfValues);
listOfValues = new List<Object>();
}
listOfValues.add(entry.value);
prevId = entry.id;
}
if (prevId != null) map.put(prevId, listOfValues);
The second solution has the advantage that I don't need to call map.contains() for every entry but the disadvantage that I have to sort before. Futhermore the first algorithm is easier to implement and less error prone, since you have to add some code after the actual loop.
Therefore my question is: Which method has better performance?
The examples are written in Java pseudo code but the actual question applies to other programming languages as well.
If you have a hash map and a very large amount of entries then inserting items one by one will be faster than sorting and inserting them list by list (O(n) vs O(N log N)). If you use a tree based map than the complexity is the same for both approaches.
However, I really doubt you have a sufficiently large amount of entries so memory access patterns, and how fast compare and hash functions are come into effect. You have 2 options: ignore it since the difference is not going to be significant or benchmark both options and see which one is working better on your system. If you don't have millions of entries I would ignore the issue and go with whatever is easier to understand.
Don't presort. Even fast sorting algorithms like quicksort take, on average, O(n log n) for n items. Afterwards, you still need O(n) to walk the list. contains on a (hash) map takes constant time (checkout this question), don't worry about it. Walk the list in linear time and use contains.
Would like to offer another solution using streams
import static java.util.stream.Collectors.groupingBy;
import static java.util.stream.Collectors.mapping;
import static java.util.stream.Collectors.toList;
Map<String, List<Object>> map = listOfValues.stream()
.collect(groupingBy(entry -> entry.id, mapping(entry -> entry.value, toList())));
This code is more declarative - it only specifies that List should be transformed into Map.
Then it is a library responsibility to actually perform transformation in efficient way.
I am migrating some map-reduce code into Spark, and having problems when constructing an Iterable to return in the function.
In MR code, I had a reduce function that grouped by key, and then (using multipleOutputs) would iterate the values and use write (in multiple outputs, but that's unimportant) to some code like this (simplified):
reduce(Key key, Iterable<Text> values) {
// ... some code
for (Text xml: values) {
multipleOutputs.write(key, val, directory);
}
}
However, in Spark I have translated a map and this reduce into a sequence of:
mapToPair -> groupByKey -> flatMap
as recommended... in some book.
mapToPair basically adds a Key via functionMap, which based on some values on the record creates a Key for that record. Sometimes a key may have ver high cardinality.
JavaPairRDD<Key, String> rddPaired = inputRDD.mapToPair(new PairFunction<String, Key, String>() {
public Tuple2<Key, String> call(String value) {
//...
return functionMap.call(value);
}
});
The rddPaired is applied a RDD.groupByKey() to get the RDD to feed the flatMap function:
JavaPairRDD<Key, Iterable<String>> rddGrouped = rddPaired.groupByKey();
Once grouped, a flatMap call to do the reduce. Here, operation is a transformation :
public Iterable<String> call (Tuple2<Key, Iterable<String>> keyValue) {
// some code...
List<String> out = new ArrayList<String>();
if (someConditionOnKey) {
// do a logic
Grouper grouper = new Grouper();
for (String xml : keyValue._2()) {
// group in a separate class
grouper.add(xml);
}
// operation is now performed on the whole group
out.add(operation(grouper));
} else {
for (String xml : keyValue._2()) {
out.add(operation(xml));
}
return out;
}
}
It works fine... with keys that don't have too many records. Actually, it breaks by OutOfMemory when a key with lot of values enters the "else" on the reduce.
Note: I have included the "if" part to explain the logic I want to produce, but the failure happens when entering the "else"... because when data enters the "else", it normally means there will be many more values for that due by the nature of the data.
It is clear that, having to keep all of the grouped values in "out" list, it won't scale if a key has millions of records, because it will keep them in memory. I have reached the point where the OOM happens (yes, it's when performing the "operation" above which asks for memory - and none is given. It's not a very expensive memory operation though).
Is there any way to avoid this in order to scale? Either by replicating behaviour using some other directives to reach the same output in a more scalable way, or to be able to hand to Spark the values for merging (just as I used to do with MR)...
It's inefficient to do condition inside the flatMap operation. You should check the condition outside to create 2 distinct RDDs and deal with them separatedly.
rddPaired.cache();
// groupFilterFunc will filter which items need grouping
JavaPairRDD<Key, Iterable<String>> rddGrouped = rddPaired.filter(groupFilterFunc).groupByKey();
// processGroupedValuesFunction should call `operation` on group of all values with the same key and return the result
rddGrouped.mapValues(processGroupedValuesFunction);
// nogroupFilterFunc will filter which items don't need grouping
JavaPairRDD<Key, Iterable<String>> rddNoGrouped = rddPaired.filter(nogroupFilterFunc);
// processNoGroupedValuesFunction2 should call `operation` on a single value and return the result
rddNoGrouped.mapValues(processNoGroupedValuesFunction2);
I've used a map step to create a JavaRDD object containing some objects I need. Based on those objects I want to create a global hashmap containing some stats, but I can't figure out which RDD operation to use. At first I thought reduce would be the solution, but then I saw that you have to return the same type of objects. I'm not interested in reducing the items, but in gathering all the stats from all the machines (they can be computed separately and then just added up_.
For example:
I have an RDD of Objects containing an integer array among other stuff and I want to compute how many times each of the integers has appeared in the array by putting them into a hashtable. Each machine should compute it's own hashtable and then put them all in one place in the driver.
Often when you think you want to end up with a Map, you'd need to transform your records in the RDD into key-value pairs, and use reduceByKey.
Your specific example sounds exactly like the famous wordcount example (see first example here), only you want to count integers from an array within an object, instead of counting words from a sentence (String). In Scala, this would translate to:
import org.apache.spark.rdd.RDD
import scala.collection.Map
class Example {
case class MyObj(ints: Array[Int], otherStuff: String)
def countInts(input: RDD[MyObj]): Map[Int, Int] = {
input
.flatMap(_.ints) // flatMap maps each record into several records - in this case, each int becomes a record
.map(i => (i, 1)) // turn into key-value map, with preliminary value 1 for each key
.reduceByKey(_ + _) // aggregate values by key
.collectAsMap() // collects data into a Map
}
}
Generally, you should let Spark perform as much of the operation as possible in a distributed manner, and delay the collection into memory as much as possible - if you collect the values before reducing, often you'll run out of memory, unless your dataset is small enough to begin with (in which case, you don't really need Spark).
Edit: and here's the same code in Java (much longer, but identical...):
static class MyObj implements Serializable {
Integer[] ints;
String otherStuff;
}
Map<Integer, Integer> countInts(JavaRDD<MyObj> input) {
return input
.flatMap(new FlatMapFunction<MyObj, Integer>() {
#Override
public Iterable<Integer> call(MyObj myObj) throws Exception {
return Arrays.asList(myObj.ints);
}
}) // flatMap maps each record into several records - in this case, each int becomes a record
.mapToPair(new PairFunction<Integer, Integer, Integer>() {
#Override
public Tuple2<Integer, Integer> call(Integer integer) throws Exception {
return new Tuple2<>(integer, 1);
}
}) // turn into key-value map, with preliminary value 1
.reduceByKey(new Function2<Integer, Integer, Integer>() {
#Override
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}
}) // aggregate values by key
.collectAsMap(); // collects data into a Map
}
In my program I have a List of Plants, each plant has a measurement (String), day (int), camera (int), and replicate number(int). I obtain a List of all plants wanted by using filters:
List<Plant> selectPlants = allPlants.stream().filter(plant -> passesFilters(plant, filters)).collect(Collectors.toList());
What I would like to do now is take all Plants that have the same camera, measurement, and replicate values. And combine them in order of day. So if I have days 1,2,3,5 I would want to find all similar plants and append the values to one plant where the getValues (function).
I added a method to Plant that appends values by just using addAll( new plant values ).
Is there any way of doing this without iterating through the list over and over to find the similar plants, and then sorting each time by day then appending? I'm sorry for the horrible wording of this question.
While Vakh’s answer is correct, it is unnecessarily complex.
Often, the work of implementing your own key class does not pay off. You can use a List as a key which implies a slight overhead due to boxing primitive values but given the fact that we do operations like hashing here, it will be negligible.
And sorting doesn’t have to be done by using a for loop and, even worse, an anonymous inner class for the Comparator. Why do we have streams and lambdas? You can implement the Comparator using a lambda like (p1,p2) -> p1.getDay()-p2.getDay() or, even better, Comparator.comparing(Plant::getDay).
Further, you can do the entire operation in one step. The sort step will create an ordered stream and the collector will maintain the encounter order, so you can use one stream to sort and group:
Map<List<?>, List<Plant>> groupedPlants = allPlants.stream()
.filter(plant -> passesFilters(plant, filters))
.sorted(Comparator.comparing(Plant::getDay))
.collect(Collectors.groupingBy(p ->
Arrays.asList(p.getMeasurement(), p.getCamera(), p.getReplicateNumber())));
That’s all.
Using Collectors.groupBy:
private static class PlantKey {
private String measurement;
private int camera;
private int replicateNumber;
// + constructor, getters, setters and haschode equals
}
Map<PlantKey, List<Plant>> groupedPlants =
allPlants.stream().filter(plant -> passesFilters(plant, filters))
.collect(Collectors.groupBy(p ->
new PlantKey(p.getMeasurement(),
p.getCamera(),
p.getReplicateNumber())));
// order the list
for(List values : groupedPlants.values()) {
Collections.sort(values, new Comparator<Plant>(){
#Override
public int compare(Plant p1, Plant p2) {
return p1.getDay() - p2.getDay();
}
});
}
I would group them by the common characteristics and compare similar results.
for(List<Plant> plantGroup : allPlants.stream().collect(Collectors.groupingBy(
p -> p.camera+'/'+p.measurement+'/'+p.replicate)).values()) {
// compare the plants in the same group
}
There is a function called sorted which operates on a stream
selectPlants.stream().sorted(Comparator.comparingInt(i -> i.day)).collect(Collectors.toList());