stream of objects containing multiple properties question - java

I am trying to solve the following:
Given a list of Data objects try, in a 'one shot' like operation, stream the list, such that the end result will be a generic object or a data object where each prop get its own sum/max/min:
class Data {
int prop1;
int prop2;
...
// constructor
// getters and setters
}
For example, given a list of 2 Data objects as follows:
List<Data> list = Arrays.asList(new Data(1,2), new Data(3,4));
If I apply max to the first property and sum to the second one the result is an object with prop1=3 and prop2=6 or Data(3,6)
Thanks for helping!

I am trying, in a 'one shot' like operation
You are looking for the Teeing Collector introduced in Java 12. Given a list of Data, where a Data class is something like:
#AllArgsConstructor
#Getter
#ToString
public class Data {
int prop1;
int prop2;
}
and a list:
List<Data> data = List.of(
new Data(1,10),
new Data(2,20),
new Data(3,30),
new Data(4,40)
);
the end result will be an generic object or a data object
if I apply max to the first prop and sum to the second
You can use Collectors.teeing to get a new Data object with the result of your operations
Data result =
data.stream().collect(Collectors.teeing(
Collectors.reducing(BinaryOperator.maxBy(Comparator.comparing(Data::getProp1))),
Collectors.summingInt(Data::getProp2),
(optionalMax, sum) -> new Data(optionalMax.get().getProp1(), sum)
));
Or something else, for example a Map<String,Integer>
Map<String,Integer> myMap =
data.stream().collect(Collectors.teeing(
Collectors.reducing(BinaryOperator.maxBy(Comparator.comparing(Data::getProp1))),
Collectors.summingInt(Data::getProp2),
(optionalMax, sum) -> {
HashMap<String, Integer> map = new HashMap();
map.put("max_prop1", optionalMax.get().getProp1());
map.put("sum_prop2", sum);
return map;
}
));

EDIT
After reading your comments I better understood what you meant and need. If your final goal is to create a Data which holds the result of multiple computations, then the stream operation teeing is still a viable solution.
However, since the operation teeing accepts only 2 downstreams and a merger BiFunction to merge their results, you need to nest your teeing calls to include in the first downstream one of the operations you need to perform; while in the second downstream another teeing call. Basically, every second downstream of each nested call uses a teeing operation until you're left with only two computations. Then, the merger function of every outer call takes the first downstream's result and the nested call's result, merges them together, and creates a new Data object with them.
Here is an example with a hypothetical Data class whose properties represent: min value, max value, average and sum:
#lombok.Data
class Data {
private #NonNull int prop1Min;
private #NonNull int prop2Max;
private #NonNull int prop3Avg;
private #NonNull int prop4Sum;
}
public class Main {
public static void main(String[] args) {
List<Data> data = List.of(
new Data(1, 10, 100, 1000),
new Data(2, 20, 200, 2000),
new Data(3, 30, 300, 3000),
new Data(4, 40, 400, 4000)
);
Data result = data.stream().collect(Collectors.teeing(
Collectors.minBy(Comparator.comparing(Data::getProp1Min)),
Collectors.teeing(
Collectors.maxBy(Comparator.comparing(Data::getProp2Max)),
Collectors.teeing(
Collectors.averagingInt(Data::getProp3Avg),
Collectors.summingInt(Data::getProp4Sum),
(avg, count) -> new Data(0, 0, avg.intValue(), count.intValue())),
(max, d) -> new Data(0, max.get().getProp2Max(), d.getProp3Avg(), d.getProp4Sum())
),
(min, d) -> new Data(min.get().getProp1Min(), d.getProp2Max(), d.getProp3Avg(), d.getProp4Sum())
));
System.out.println(result);
}
}
Output
Data(prop1Min=1, prop2Max=40, prop3Avg=250, prop4Sum=10000)
Previous Answer
It sounds like you're trying to retrieve some statistics from a stream of elements.
I am trying [...] to stream the list, such that the end result will be a generic object or a data object where each prop get its own sum/max/min etc.
For this purpose, there is already the IntSummaryStatistics class which includes a set of statistics gathered from a set of int elements. To obtain this information, you just need to stream your elements and invoke the collect operation by supplying Collectors.summarizingInt(); this will return the statistics of your elements. Moreover, Java also provides LongSummaryStatistics and DoubleSummaryStatistics to retrieve statistics of long and double types.
List<Integer> list = new ArrayList<>(List.of(0, 1, 2, 3, 4, 5, 6, 7, 8, 9));
IntSummaryStatistics stats = list.stream()
.collect(Collectors.summarizingInt(Integer::intValue));
System.out.println("Count: " + stats.getCount());
System.out.println("Sum: " + stats.getSum());
System.out.println("Min Value: " + stats.getMin());
System.out.println("Max Value: " + stats.getMax());
System.out.println("Average: " + stats.getAverage());
//In case Data had not been designed in place of IntSummaryStatistics and it's an actual needed class,
//then you could set up the properties you need from the IntSummaryStatistics
Data d = new Data();
d.setMinProp(stats.getMin());
d.setMaxProp(stats.getMax());
d.setSumProp(stats.getSum());
//--------- Data class ---------
class Data {
private int minProp, maxProp, sumProp;
//... rest of the implementation ...
public void setMinProp(int minProp) {
this.minProp = minProp;
}
public void setMaxProp(int maxProp) {
this.maxProp = maxProp;
}
public void setSumProp(int sumProp) {
this.sumProp = sumProp;
}
}
Output
Count: 10
Sum: 45
Min Value: 0
Max Value: 9
Average: 4.5

Well, if they're only reducing functions, then you could utilize reduce as well:
Data someNewData = someData.stream()
.reduce((Data l, Data r) -> {
int a = l.prop1() + r.prop1(); // Find sum of prop1
int b = Math.max(l.prop2(), r.prop2()); // Find max value of prop2
int c = Math.min(l.prop3(), r.prop3()); // Find min value of prop3
return new Data(a, b, c);
})
.orElseThrow();

Related

group objects using stream collector

I have list of objects, let's say of class Document:
class Document {
private final String id;
private final int length;
public Document(String id, int length) {
this.id = id;
this.length = length;
}
public int getLength() {
return length;
}
}
Task at hand is to group them in Envelopes so that number of pages (Document.length) does not exceed certain number.
class Envelope {
private final List<Document> documents = new ArrayList<>();
}
So for example, if I had follwing List of Documents:
Document doc0 = new Document("doc0", 2);
Document doc1 = new Document("doc1", 5);
Document doc2 = new Document("doc2", 5);
Document doc3 = new Document("doc3", 5);
and max page count in envelope is let's say 7, than I expect 3 envelopes with following documents:
Assert.assertEquals(3, envelopeList.size());
Assert.assertEquals(2, envelopeList.get(0).getDocuments().size()); // doc0, doc1
Assert.assertEquals(1, envelopeList.get(1).getDocuments().size()); // doc2
Assert.assertEquals(1, envelopeList.get(2).getDocuments().size()); // doc3
I have implemented this with traditional for loop and bunch of if's but question is, is it possible to do this more elegant way with streams and collectors?
thank you and best regards
Dalibor
For batching the documents based on the length, we need to maintain the state of accumulated lengths. Streams are not the best choice when external state needs to be maintained and custom loop should be simpler and efficient option.
If we force fit, streams for this scenario, the DocumentSpliterator would change as below:
public static List<Couvert> splitDocuments(List<Document> docs) {
IntUnaryOperator helper = new IntUnaryOperator() {
private int bucketIndex = 0;
private int accumulated = 0;
public synchronized int applyAsInt(int length) {
if (length + accumulated > MAX) {
bucketIndex++;
accumulated = 0;
}
accumulated += length;
return bucketIndex;
}
};
return new ArrayList<>(docs.stream()
.map(d -> new AbstractMap.SimpleEntry<>(helper.applyAsInt(d.getLength()), d))
.collect(Collectors.groupingBy(AbstractMap.SimpleEntry::getKey,
Collector.of(Couvert::new,
(c, e) -> c.getDocuments().add(e.getValue()),
(c1, c2) -> {c1.getDocuments().addAll(c2.getDocuments());return c1;})))
.values());
}
Explanation:
helper maintains the accumulated length and provides a new bucket index when it exceeds max. I have used IntUnaryOperator interface here. Alternatively, we can use any interface that takes an int parame and returns an int.
Regarding the stream,
Document is mapped to a SimpleEntry of bucketIndex and Document.
The stream of SimpleEntry is first grouped based on the bucketIndex. Another Collector transforms the stream of Document for a particuar bucketIndex to a Couvert. Output of the collect() is Map<Integer,Couvert>
Finally, the Collection of Couvert are converted to a list and returned.
Note: For this implementation, I removed the front parameter and included it as part of the docs list.

Apache Spark Decision Tree Predictions

I have the following code for classification using decision trees. I need to get the predictions of the test dataset into a java array and print them. Can someone help me to extend this code for that. I need to have an a 2D array of predicted label and actual label and print the predicted labels.
public class DecisionTreeClass {
public static void main(String args[]){
SparkConf sparkConf = new SparkConf().setAppName("DecisionTreeClass").setMaster("local[2]");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
// Load and parse the data file.
String datapath = "/home/thamali/Desktop/tlib.txt";
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), datapath).toJavaRDD();//A training example used in supervised learning is called a “labeled point” in MLlib.
// Split the data into training and test sets (30% held out for testing)
JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.7, 0.3});
JavaRDD<LabeledPoint> trainingData = splits[0];
JavaRDD<LabeledPoint> testData = splits[1];
// Set parameters.
// Empty categoricalFeaturesInfo indicates all features are continuous.
Integer numClasses = 12;
Map<Integer, Integer> categoricalFeaturesInfo = new HashMap();
String impurity = "gini";
Integer maxDepth = 5;
Integer maxBins = 32;
// Train a DecisionTree model for classification.
final DecisionTreeModel model = DecisionTree.trainClassifier(trainingData, numClasses,
categoricalFeaturesInfo, impurity, maxDepth, maxBins);
// Evaluate model on test instances and compute test error
JavaPairRDD<Double, Double> predictionAndLabel =
testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
#Override
public Tuple2<Double, Double> call(LabeledPoint p) {
return new Tuple2(model.predict(p.features()), p.label());
}
});
Double testErr =
1.0 * predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
#Override
public Boolean call(Tuple2<Double, Double> pl) {
return !pl._1().equals(pl._2());
}
}).count() / testData.count();
System.out.println("Test Error: " + testErr);
System.out.println("Learned classification tree model:\n" + model.toDebugString());
}
}
You basically have exactly that with the prediction and label variable. If you really needed a list of a 2d double arrays, you could change the method that you use to:
JavaRDD<double[]> valuesAndPreds = testData.map(point -> new double[]{model.predict(point.features()), point.label()});
and run collect on that reference for a list of 2d double arrays.
List<double[]> values = valuesAndPreds.collect();
I would take a look at the docs here: https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html . You can also change the data to get additional statical performance measurements of your model with classes like MulticlassMetrics. This requires changing the mapToPair function to a map function and changing the generics to an object. So something like:
JavaRDD<Tuple2<Object, Object>> valuesAndPreds = testData().map(point -> new Tuple2<>(model.predict(point.features()), point.label()));
Then running:
MulticlassMetrics multiclassMetrics = new MulticlassMetrics(JavaRDD.toRDD(valuesAndPreds));
All of this stuff is very well documented in Spark's MLLib documentation. Also, you mentioned needing to print the results. If this is homework, I will let you figure out that part, since it would be a good exercise to learn how to do that from a list.
Edit:
ALSO, noticed that you are using java 7, and what I have is from java 8. To answer your main question in how to turn into a 2d double array, you would do:
JavaRDD<double[]> valuesAndPreds = testData.map(new org.apache.spark.api.java.function.Function<LabeledPoint, double[]>() {
#Override
public double[] call(LabeledPoint point) {
return new double[]{model.predict(point.features()), point.label()};
}
});
Then run collect, to get a list of two doubles. Also, to give a hint on the printing part, take a look at the java.util.Arrays toString implementation.

Java - Basic streams usage of forEach

I have a class called Data which has only one method:
public boolean isValid()
I have a Listof Dataand I want to loop through them via a Java 8 stream. I need to count how many valid Data objects there are in this List and print out only the valid entries.
Below is how far I've gotten but I don't understand how.
List<Data> ar = new ArrayList<>();
...
// ar is now full of Data objects.
...
int count = ar.stream()
.filter(Data::isValid)
.forEach(System.out::println)
.count(); // Compiler error, forEach() does not return type stream.
My second attempt: (horrible code)
List<Data> ar = new ArrayList<>();
...
// Must be final or compiler error will happen via inner class.
final AtomicInteger counter = new AtomicInteger();
ar.stream()
.filter(Data:isValid)
.forEach(d ->
{
System.out.println(d);
counter.incrementAndGet();
};
System.out.printf("There are %d/%d valid Data objects.%n", counter.get(), ar.size());
If you don’t need the original ArrayList, containing a mixture of valid and invalid objects, later-on, you might simply perform a Collection operation instead of the Stream operation:
ar.removeIf(d -> !d.isValid());
ar.forEach(System.out::println);
int count = ar.size();
Otherwise, you can implement it like
List<Data> valid = ar.stream().filter(Data::isValid).collect(Collectors.toList());
valid.forEach(System.out::println);
int count = valid.size();
Having a storage for something you need multiple times is not so bad. If the list is really large, you can reduce the storage memory by (typically) factor 32, using
BitSet valid = IntStream.range(0, ar.size())
.filter(index -> ar.get(index).isValid())
.collect(BitSet::new, BitSet::set, BitSet::or);
valid.stream().mapToObj(ar::get).forEach(System.out::println);
int count = valid.cardinality();
Though, of course, you can also use
int count = 0;
for(Data d: ar) {
if(d.isValid()) {
System.out.println(d);
count++;
}
}
Peek is similar to foreach, except that it lets you continue the stream.
ar.stream().filter(Data::isValid)
.peek(System.out::println)
.count();

In java 8 using stream API, how to return instance from Map with multiple calculations required

Suppose there is class like this:
class A {
long sent;
long received;
double val; // given as max {(double)sent/someDenominator,(double)received/someDenominator}
}
of which there are number of instance references in Map<String , A>.
Is it possible in one go, using stream API, to return instance of class A with following properties:
sent = sum of sent fields from all instances
received = sum of received fields from all instances in Map
val = maximum value of val, given all entries where val = max {sent/someDenominator,received/someDenominator}
What would be trivial task using standard for loop and one iteration, i don't have a clue how to achieve with stream API.
You could use reduce:
Optional<A> a = map.values()
.stream()
.reduce((a1, a2) -> new A(a1.sent + a2.sent, a1.received + a2.received, Math.max(a1.val, a2.val)));
If your A objects are mutable, then more efficient solution is possible which is based on collect() method. Add a method to A which describes the merging strategy:
class A {
long sent;
long received;
double val;
void merge(A other) {
sent += other.sent;
received += other.received;
val = Math.max(val, other.val);
}
}
Now you can write
A a = map.values().stream().collect(A::new, A::merge, A::merge);
This way you will not have to create intermediate A object for every reduction step: single common object will be reused instead.

Encounter Order wrong when sorting a parallel stream

I have a Record class:
public class Record implements Comparable<Record>
{
private String myCategory1;
private int myCategory2;
private String myCategory3;
private String myCategory4;
private int myValue1;
private double myValue2;
public Record(String category1, int category2, String category3, String category4,
int value1, double value2)
{
myCategory1 = category1;
myCategory2 = category2;
myCategory3 = category3;
myCategory4 = category4;
myValue1 = value1;
myValue2 = value2;
}
// Getters here
}
I create a big list of a lot of records. Only the second and fifth values, i / 10000 and i, are used later, by the getters getCategory2() and getValue1() respectively.
List<Record> list = new ArrayList<>();
for (int i = 0; i < 115000; i++)
{
list.add(new Record("A", i / 10000, "B", "C", i, (double) i / 100 + 1));
}
Note that first 10,000 records have a category2 of 0, then next 10,000 have 1, etc., while the value1 values are 0-114999 sequentially.
I create a Stream that is both parallel and sorted.
Stream<Record> stream = list.stream()
.parallel()
.sorted(
//(r1, r2) -> Integer.compare(r1.getCategory2(), r2.getCategory2())
)
//.parallel()
;
I have a ForkJoinPool that maintains 8 threads, which is the number of cores I have on my PC.
ForkJoinPool pool = new ForkJoinPool(8);
I use the trick described here to submit a stream processing task to my own ForkJoinPool instead of the common ForkJoinPool.
List<Record> output = pool.submit(() ->
stream.collect(Collectors.toList()
)).get();
I expected that the parallel sorted operation would respect the encounter order of the stream, and that it would be a stable sort, because the Spliterator returned by ArrayList is ORDERED.
However, simple code that prints out the elements of the resultant List output in order shows that it's not quite the case.
for (Record record : output)
{
System.out.println(record.getValue1());
}
Output, condensed:
0
1
2
3
...
69996
69997
69998
69999
71875 // discontinuity!
71876
71877
71878
...
79058
79059
79060
79061
70000 // discontinuity!
70001
70002
70003
...
71871
71872
71873
71874
79062 // discontinuity!
79063
79064
79065
79066
...
114996
114997
114998
114999
The size() of output is 115000, and all elements appear to be there, just in a slightly different order.
So I wrote some checking code to see if the sort was stable. If it's stable, then all of the value1 values should remain in order. This code verifies the order, printing any discrepancies.
int prev = -1;
boolean verified = true;
for (Record record : output)
{
int curr = record.getValue1();
if (prev != -1)
{
if (prev + 1 != curr)
{
System.out.println("Warning: " + prev + " followed by " + curr + "!");
verified = false;
}
}
prev = curr;
}
System.out.println("Verified: " + verified);
Output:
Warning: 69999 followed by 71875!
Warning: 79061 followed by 70000!
Warning: 71874 followed by 79062!
Warning: 99999 followed by 100625!
Warning: 107811 followed by 100000!
Warning: 100624 followed by 107812!
Verified: false
This condition persists if I do any of the following:
Replace the ForkJoinPool with a ThreadPoolExecutor.
ThreadPoolExecutor pool = new ThreadPoolExecutor(8, 8, 0, TimeUnit.SECONDS, new ArrayBlockingQueue<>(10));
Use the common ForkJoinPool by processing the Stream directly.
List<Record> output = stream.collect(Collectors.toList());
Call parallel() after I call sorted.
Stream<Record> stream = list.stream().sorted().parallel();
Call parallelStream() instead of stream().parallel().
Stream<Record> stream = list.parallelStream().sorted();
Sort using a Comparator. Note that this sort criterion is different that the "natural" order I defined for the Comparable interface, although starting with the results already in order from the beginning, the result should still be the same.
Stream<Record> stream = list.stream().parallel().sorted(
(r1, r2) -> Integer.compare(r1.getCategory2(), r2.getCategory2())
);
I can only get this to preserve the encounter order if I don't do one of the following on the Stream:
Don't call parallel().
Don't call any overload of sorted.
Interestingly, the parallel() without a sort preserved the order.
In both of the above cases, the output is:
Verified: true
My Java version is 1.8.0_05. This anomaly also occurs on Ideone, which appears to be running Java 8u25.
Update
I've upgraded my JDK to the latest version as of this writing, 1.8.0_45, and the problem is unchanged.
Question
Is the record order in the resultant List (output) out of order because the sort is somehow not stable, because the encounter order is not preserved, or some other reason?
How can I ensure that the encounter order is preserved when I create a parallel stream and sort it?
It looks like Arrays.parallelSort isn't stable in some circumstances. Well spotted. The stream parallel sort is implemented in terms of Arrays.parallelSort, so it affects streams as well. Here's a simplified example:
public class StableSortBug {
static final int SIZE = 50_000;
static class Record implements Comparable<Record> {
final int sortVal;
final int seqNum;
Record(int i1, int i2) { sortVal = i1; seqNum = i2; }
#Override
public int compareTo(Record other) {
return Integer.compare(this.sortVal, other.sortVal);
}
}
static Record[] genArray() {
Record[] array = new Record[SIZE];
Arrays.setAll(array, i -> new Record(i / 10_000, i));
return array;
}
static boolean verify(Record[] array) {
return IntStream.range(1, array.length)
.allMatch(i -> array[i-1].seqNum + 1 == array[i].seqNum);
}
public static void main(String[] args) {
Record[] array = genArray();
System.out.println(verify(array));
Arrays.sort(array);
System.out.println(verify(array));
Arrays.parallelSort(array);
System.out.println(verify(array));
}
}
On my machine (2 core x 2 threads) this prints the following:
true
true
false
Of course, it's supposed to print true three times. This is on the current JDK 9 dev builds. I wouldn't be surprised if it occurs in all the JDK 8 releases thus far, given what you've tried. Curiously, reducing the size or the divisor will change the behavior. A size of 20,000 and a divisor of 10,000 is stable, and a size of 50,000 and a divisor of 1,000 is also stable. It seems like the problem has to do with a sufficiently large run of values comparing equal versus the parallel split size.
The OpenJDK issue JDK-8076446 covers this bug.

Categories