Sorting an rdd after using groupbykey using values - java

I have JavaPairRDD as
JavaPairRDD<String, Iterable<Row>> rdd = mydataset.orderBy("orderfield1", "orderfield2").javaRDD().mapToPair(row -> new Tuple2<>(row.getAs("id").toString(), row)).groupByKey()
As groupbykey() doesn't maintain order orderby doesn't work here.
I want to order the Iterable<Row> using some of the fields from dataset.

You could transform the Iterable into a List and then sort that list like below. I assume that your sorting field is called x and that it is of type String but you can obviously adapt that to your specific case.
String sortingField = "x"
JavaPairRDD<String, List<Row>> rdd = mydataset
.javaRDD()
.mapToPair(row -> new Tuple2<>(row.getAs("id").toString(), row))
.groupByKey()
.mapValues(it -> {
List<Row> rows = new ArrayList<>();
it.forEach(rows::add);
rows.sort(
(Row a, Row b) -> a.<String>getAs(sortingField).compareTo(b.<String>getAs(sortingField))
);
return rows;
});
Note that this is much simpler to write in scala:
val rdd = mydataset
.rdd
.map(row => (row.getAs("id").toString, row))
.groupByKey
.mapValues( _.toSeq.sortBy(_.getAs[String]("x")))

Related

Have Java Streams GroupingBy result Map include a key for each value of an enum, even if value is an empty List

This question is about Java Streams' groupingBy capability.
Suppose I have a class, WorldCup:
public class WorldCup {
int year;
Country champion;
// all-arg constructor, getter/setters, etc
}
and an enum, Country:
public enum Country {
Brazil, France, USA
}
and the following code snippet:
WorldCup wc94 = new WorldCup(1994, Country.Brazil);
WorldCup wc98 = new WorldCup(1998, Country.France);
List<WorldCup> wcList = new ArrayList<WorldCup>();
wcList.add(wc94);
wcList.add(wc98);
Map<Country, List<Integer>> championsMap = wcList.stream()
.collect(Collectors.groupingBy(WorldCup::getCountry, Collectors.mapping(WorldCup::getYear));
After running this code, championsMap will contain:
Brazil: [1994]
France: [1998]
Is there a succinct way to have this list include an entry for all of the values of the enum? What I'm looking for is:
Brazil: [1994]
France: [1998]
USA: []
There are several approaches you can take.
The map which would be used for accumulating the stream data can be prepopulated with entries corresponding to every enum-member. To access all existing enum-members you can use values() method or EnumSet.allOf().
It can be achieved using three-args version of collect() or through a custom collector created via Collector.of().
Map<Country, List<Integer>> championsMap = wcList.stream()
.collect(
() -> EnumSet.allOf(Country.class).stream() // supplier
.collect(Collectors.toMap(
Function.identity(),
c -> new ArrayList<>()
)),
(Map<Country, List<Integer>> map, WorldCup next) -> // accumulator
map.get(next.getCountry()).add(next.getYear()),
(left, right) -> // combiner
right.forEach((k, v) -> left.get(k).addAll(v))
);
Another option is to add missing entries to the map after reduction of the stream has been finished.
For that we can use built-in collector collectingAndThen().
Map<Country, List<Integer>> championsMap = wcList.stream()
.collect(Collectors.collectingAndThen(
Collectors.groupingBy(WorldCup::getCountry,
Collectors.mapping(WorldCup::getYear,
Collectors.toList())),
map -> {
EnumSet.allOf(Country.class)
.forEach(country -> map.computeIfAbsent(country, k -> new ArrayList<>())); // if you're not going to mutate these lists - use Collections.emptyList()
return map;
}
));

Convert Map<Integer, List<Strings> to Map<String, List<Integer>

I'm having a hard time converting a Map containing some integers as keys and a list of random strings as values.
E.g. :
1 = ["a", "b", "c"]
2 = ["a", "b", "z"]
3 = ["z"]
I want to transform it into a Map of distinct strings as keys and lists the integers as values.
E.g. :
a = [1, 2]
b = [1, 2]
c = [1]
z = [2,3]
Here's what I have so far:
Map<Integer, List<String>> integerListMap; // initial list is already populated
List<String> distinctStrings = new ArrayList<>();
SortedMap<String, List<Integer>> stringListSortedMap = new TreeMap<>();
for(Integer i: integers) {
integerListMap.put(i, strings);
distinctStrings.addAll(strings);
}
distinctStrings = distinctStrings.stream().distinct().collect(Collectors.toList());
for(String s : distinctStrings) {
distinctStrings.put(s, ???);
}
Iterate over your source map's value and put each value into the target map.
final Map<String, List<Integer>> target = new HashMap<>();
for (final Map.Entry<Integer, List<String>> entry = source.entrySet()) {
for (final String s : entry.getValue()) {
target.computeIfAbsent(s, k -> new ArrayList<>()).add(entry.getKey());
}
}
Or with the Stream API by abusing Map.Entry to build the inverse:
final Map<String, List<Integer>> target = source.entrySet()
.stream()
.flatMap(e -> e.getValue().stream().map(s -> Map.entry(s, e.getKey()))
.collect(Collectors.groupingBy(e::getKey, Collectors.mapping(e::getValue, Collectors.toList())));
Although this might not be as clear as introducing a new custom type to hold the inverted mapping.
Another alternative would be using a bidirectial map. Many libraries come implementations of these, such as Guava.
There's no need to apply distinct() since you're storing the data into the Map and keys are guaranteed to be unique.
You can flatten the entries of the source map, so that only one string (let's call it name) and a single integer (let's call it number) would correspond to a stream element, and then group the data by string.
To implement this problem using streams, we can utilize flatMap() operation to perform one-to-many transformation. And it's a good practice to define a custom type for that purpose as a Java 16 record, or a class (you can also use a Map.Entry, but note that approach of using a custom type is more advantages because it allows writing self-documenting code).
In order to collect the data into a TreeMap you can make use of the three-args version of groupingBy() which allows to specify mapFactory.
record NameNumber(String name, Integer number) {}
Map<Integer, List<String>> dataByProvider = Map.of(
1, List.of("a", "b", "c"),
2, List.of("a", "b", "z"),
3, List.of("z")
);
NavigableMap<String, List<Integer>> numbersByName = dataByProvider.entrySet().stream()
.flatMap(entry -> entry.getValue().stream()
.map(name -> new NameNumber(name, entry.getKey()))
)
.collect(Collectors.groupingBy(
NameNumber::name,
TreeMap::new,
Collectors.mapping(NameNumber::number, Collectors.toList())
));
numbersByName.forEach((name, numbers) -> System.out.println(name + " -> " + numbers));
Output:
a -> [2, 1]
b -> [2, 1]
c -> [1]
z -> [3, 2]
Sidenote: while using TreeMap it's more beneficial to use NavigableMap as an abstract type because it allows to access methods like higherKey(), lowerKey(), firstEntry(), lastEntry(), etc. which are declared in the SortedMap interface.

Grouping By without using a POJO in java 8

I have a use case where I need to read a file and get the grouping of a sequence and a list of values associated with the sequence. The format of these records in the file are like sequence - val , example
10-A
10-B
11-C
11-A
I want the output to be a map (Map<String,List<String>>) with the sequence as the key and list of values associated with it as value, like below
10,[A,B]
11,[C,A]
Is there a way I can do this without creating a POJO for these records? I have been trying to explore the usage of Collectors.groupingBy and most of the examples I see are based on creating a POJO.
I have been trying to write something like this
Map<String, List<String>> seqCpcGroupMap = pendingCpcList.stream().map(rec ->{
String[] cpcRec = rec.split("-");
return new Tuple2<>(cpcRec[0],cpcRec[1])
}).collect(Collectors.groupingBy(x->x.))
or
Map<String, List<String>> seqCpcGroupMap = pendingCpcList.stream().map(rec ->{
String[] cpcRec = rec.split("-");
return Arrays.asList(cpcRec[0],cpcRec[1]);
}).collect(Collectors.groupingBy(x->(ArrayList<String>)x[0]));
I am unable to provide any key on which the groupingBy can happen for the groupingBy function, is there a way to do this or do I have to create a POJO to use groupingBy?
You may do it like so,
Map<String, List<String>> result = source.stream()
.map(s -> s.split("-"))
.collect(Collectors.groupingBy(a -> a[0],
Collectors.mapping(a -> a[1], Collectors.toList())));
Alternatively, you can use Map.computeIfAbsent directly as :
List<String> pendingCpcList = List.of("10-A","10-B","11-C","11-A");
Map<String, List<String>> seqCpcGroupMap = new HashMap<>();
pendingCpcList.stream().map(rec -> rec.split("-"))
.forEach(a -> seqCpcGroupMap.computeIfAbsent(a[0], k -> new ArrayList<>()).add(a[1]));

How to efficiently join an arbitrary number of RDDs?

Joining two RDDs is simple with a RDD1.join(RDD2). However, if I keep an arbitrary number of RDDs in a List<JavaRDD>, how can I efficiently join them ?
First, please note that you cannot join JavaRDD. You would need to obtain a JavaPairRDD by using:
groupBy() (or keyBy())
cartesian()
[flat]mapToPair()
zipWithIndex() (useful because it adds index where there is none)
etc.
Then, once you have your list, you can join them all like this:
JavaPairRDD<Integer, String> linesA = sc.parallelizePairs(Arrays.asList(
new Tuple2<>(1, "a1"),
new Tuple2<>(2, "a2"),
new Tuple2<>(3, "a3"),
new Tuple2<>(4, "a4")));
JavaPairRDD<Integer, String> linesB = sc.parallelizePairs(Arrays.asList(
new Tuple2<>(1, "b1"),
new Tuple2<>(5, "b5"),
new Tuple2<>(3, "b3")));
JavaPairRDD<Integer, String> linesC = sc.parallelizePairs(Arrays.asList(
new Tuple2<>(1, "c1"),
new Tuple2<>(5, "c6"),
new Tuple2<>(6, "c3")));
// the list of RDDs
List<JavaPairRDD<Integer, String>> allLines = Arrays.asList(linesA, linesB, linesC);
// since we probably don't want to modify any of the datasets in the list, we will
// copy the first one in a separate variable to keep the result
JavaPairRDD<Integer, String> res = allLines.get(0);
for (int i = 1; i < allLines.size(); ++i) { // note we skip position 0 !
res = res.join(allLines.get(i))
/*[1]*/ .mapValues(tuple -> tuple._1 + ':' + tuple._2);
}
The line with [1] is the important one, because it maps a
JavaPairRDD<Integer, Tuple2<String,String>> back into a
JavaPairRdd<Integer,String> which makes it compatible with further joins.
Based on chrisw answer, this could be put into "one line" like this:
JavaPairRDD<Integer, String> res;
res = allLines.stream()
.reduce((rdd1, rdd2) -> rdd1.join(rdd2).mapValues(tup -> tup._1 + ':' + tup._2))
.get(); // get value from Optional<JavaPairRDD>
Finally, some thoughts on performance. In the above example, I used string concatenation to reduce the result of the join back to an RDD of the same type. If you have a lot of RDDs, you could probably speed this up a bit by using the for loop version with JavaPairRDD<Integer, StringBuilder> res, where you do the first join by hand. I will post more details if required.
I'm not familiar with the JavaRDD class/interface but perhaps you could solve this problem using the higher-order function reduce in Java 8, see https://docs.oracle.com/javase/tutorial/collections/streams/reduction.html
final List<JavaRDD> list = getList(); // where getList is your list implementation containing JavaRDD instances
// The JavaRDD class provides rdd() to get the RDD
final JavaRDD rdd = list.stream().map(JavaRDD::rdd).reduce(RDD::join);
An example with the String class would be something like: -
Stream.of("foo", "bar", "baz").reduce(String::concat);
Which produces
foobarbaz

How to map to multiple elements with Java 8 streams?

I have a class like this:
class MultiDataPoint {
private DateTime timestamp;
private Map<String, Number> keyToData;
}
and i want to produce , for each MultiDataPoint
class DataSet {
public String key;
List<DataPoint> dataPoints;
}
class DataPoint{
DateTime timeStamp;
Number data;
}
of course a 'key' can be the same across multiple MultiDataPoints.
So given a List<MultiDataPoint>, how do I use Java 8 streams to convert to List<DataSet>?
This is how I am currently doing the conversion without streams:
Collection<DataSet> convertMultiDataPointToDataSet(List<MultiDataPoint> multiDataPoints)
{
Map<String, DataSet> setMap = new HashMap<>();
multiDataPoints.forEach(pt -> {
Map<String, Number> data = pt.getData();
data.entrySet().forEach(e -> {
String seriesKey = e.getKey();
DataSet dataSet = setMap.get(seriesKey);
if (dataSet == null)
{
dataSet = new DataSet(seriesKey);
setMap.put(seriesKey, dataSet);
}
dataSet.dataPoints.add(new DataPoint(pt.getTimestamp(), e.getValue()));
});
});
return setMap.values();
}
It's an interesting question, because it shows that there are a lot of different approaches to achieve the same result. Below I show three different implementations.
Default methods in Collection Framework: Java 8 added some methods to the collections classes, that are not directly related to the Stream API. Using these methods, you can significantly simplify the implementation of the non-stream implementation:
Collection<DataSet> convert(List<MultiDataPoint> multiDataPoints) {
Map<String, DataSet> result = new HashMap<>();
multiDataPoints.forEach(pt ->
pt.keyToData.forEach((key, value) ->
result.computeIfAbsent(
key, k -> new DataSet(k, new ArrayList<>()))
.dataPoints.add(new DataPoint(pt.timestamp, value))));
return result.values();
}
Stream API with flatten and intermediate data structure: The following implementation is almost identical to the solution provided by Stuart Marks. In contrast to his solution, the following implementation uses an anonymous inner class as intermediate data structure.
Collection<DataSet> convert(List<MultiDataPoint> multiDataPoints) {
return multiDataPoints.stream()
.flatMap(mdp -> mdp.keyToData.entrySet().stream().map(e ->
new Object() {
String key = e.getKey();
DataPoint dataPoint = new DataPoint(mdp.timestamp, e.getValue());
}))
.collect(
collectingAndThen(
groupingBy(t -> t.key, mapping(t -> t.dataPoint, toList())),
m -> m.entrySet().stream().map(e -> new DataSet(e.getKey(), e.getValue())).collect(toList())));
}
Stream API with map merging: Instead of flattening the original data structures, you can also create a Map for each MultiDataPoint, and then merge all maps into a single map with a reduce operation. The code is a bit simpler than the above solution:
Collection<DataSet> convert(List<MultiDataPoint> multiDataPoints) {
return multiDataPoints.stream()
.map(mdp -> mdp.keyToData.entrySet().stream()
.collect(toMap(e -> e.getKey(), e -> asList(new DataPoint(mdp.timestamp, e.getValue())))))
.reduce(new HashMap<>(), mapMerger())
.entrySet().stream()
.map(e -> new DataSet(e.getKey(), e.getValue()))
.collect(toList());
}
You can find an implementation of the map merger within the Collectors class. Unfortunately, it is a bit tricky to access it from the outside. Following is an alternative implementation of the map merger:
<K, V> BinaryOperator<Map<K, List<V>>> mapMerger() {
return (lhs, rhs) -> {
Map<K, List<V>> result = new HashMap<>();
lhs.forEach((key, value) -> result.computeIfAbsent(key, k -> new ArrayList<>()).addAll(value));
rhs.forEach((key, value) -> result.computeIfAbsent(key, k -> new ArrayList<>()).addAll(value));
return result;
};
}
To do this, I had to come up with an intermediate data structure:
class KeyDataPoint {
String key;
DateTime timestamp;
Number data;
// obvious constructor and getters
}
With this in place, the approach is to "flatten" each MultiDataPoint into a list of (timestamp, key, data) triples and stream together all such triples from the list of MultiDataPoint.
Then, we apply a groupingBy operation on the string key in order to gather the data for each key together. Note that a simple groupingBy would result in a map from each string key to a list of the corresponding KeyDataPoint triples. We don't want the triples; we want DataPoint instances, which are (timestamp, data) pairs. To do this we apply a "downstream" collector of the groupingBy which is a mapping operation that constructs a new DataPoint by getting the right values from the KeyDataPoint triple. The downstream collector of the mapping operation is simply toList which collects the DataPoint objects of the same group into a list.
Now we have a Map<String, List<DataPoint>> and we want to convert it to a collection of DataSet objects. We simply stream out the map entries and construct DataSet objects, collect them into a list, and return it.
The code ends up looking like this:
Collection<DataSet> convertMultiDataPointToDataSet(List<MultiDataPoint> multiDataPoints) {
return multiDataPoints.stream()
.flatMap(mdp -> mdp.getData().entrySet().stream()
.map(e -> new KeyDataPoint(e.getKey(), mdp.getTimestamp(), e.getValue())))
.collect(groupingBy(KeyDataPoint::getKey,
mapping(kdp -> new DataPoint(kdp.getTimestamp(), kdp.getData()), toList())))
.entrySet().stream()
.map(e -> new DataSet(e.getKey(), e.getValue()))
.collect(toList());
}
I took some liberties with constructors and getters, but I think they should be obvious.

Categories