Aggregate information using Java 8 streams - java

I'm still trying to fully grasp working with the Stream package in Java 8 and was hoping for some help.
I have a class, described below, instances of which I receive in a list as part of a database call.
class VisitSummary {
String source;
DateTime timestamp;
Integer errorCount;
Integer trafficCount;
//Other fields
}
To generate some possibly useful information about this, I have a class VisitSummaryBySource which holds the sum total of all visits (for a given timeframe):
class VisitSummaryBySource {
String sourceName;
Integer recordCount;
Integer errorCount;
}
I was hoping to construct a List<VisitSummaryBySource> collection which as the name sounds, holds the list of VisitSummaryBySource objects containing the sum total of records and errors encountered, for each different source.
Is there a way I can achieve this using streams in a single operation? Or do I need to necessarily break this down into multiple operations? The best I could come up with is:
Map<String, Integer> recordsBySrc = data.parallelStream().collect(Collectors.groupingBy(VisitSummaryBySource::getSource,
Collectors.summingInt(VisitSummaryBySource::getRecordCount)));
and to calculate the errors
Map<String, Integer> errorsBySrc = data.parallelStream().collect(Collectors.groupingBy(VisitSummaryBySource::getSource,
Collectors.summingInt(VisitSummaryBySource::getErrorCount)));
and merging the two maps to come up with the list I'm looking for.

You're on the right track. The uses of Collectors.summingInt are examples of downstream collectors of the outer groupingBy collector. This operation extracts one of the integer values from each VisitSummaryBySource instance in the same group, and sums them. This is essentially a reduction over integers.
The problem, as you note, is that you can extract/reduce only one of the integer values, so you have to perform a second pass to extract/reduce the other integer values.
The key is to consider reduction not over the individual integer values but over the entire VisitSummaryBySource object. Reduction takes a BinaryOperator, which takes two instances of the type in question and combines them into one. Here's how to do that, by adding a static method to VisitSummaryBySource:
static VisitSummaryBySource merge(VisitSummaryBySource a,
VisitSummaryBySource b) {
assert a.getSource().equals(b.getSource());
return new VisitSummaryBySource(a.getSource(),
a.getRecordCount() + b.getRecordCount(),
a.getErrorCount() + b.getErrorCount());
}
Note that we're not actually merging the source names. Since this reduction is only performed within a group, where the source names are the same, we assert that we can only merge two instances whose names are the same. We also assume the obvious constructor taking a name, record count, and error count, and call that to create the merged object, containing the sums of the counts.
Now our stream looks like this:
Map<String, Optional<VisitSummaryBySource>> map =
data.stream()
.collect(groupingBy(VisitSummaryBySource::getSource,
reducing(VisitSummaryBySource::merge)));
Note that this reduction produces map values of type Optional<VisitSummaryBySource>. This is somewhat odd; we'll deal with it below. We could avoid the Optional by using another form of the reducing collector that takes an identity value. This is possible but somewhat nonsensical, as there's no good value to use for the source name of the identity. (We could use something like the empty string, but we'd have to abandon our assertion that we merge only objects whose source names are equal.)
We don't really care about the map; it only needs to be kept around long enough to reduce the VisitSummaryBySource instances. Once that's done, we can just pull out the map values using values() and throw away the map.
We can also turn this back into a stream and unwrap the Optional by mapping them through Optional::get. This is safe, because a value never ends up in the map unless there's at least one member of the group.
Finally, we collect the results into a list.
The final code looks like this:
List<VisitSummaryBySource> output =
data.stream()
.collect(groupingBy(VisitSummaryBySource::getSource,
reducing(VisitSummaryBySource::merge)))
.values().stream()
.map(Optional::get)
.collect(toList());

Related

Iterating over a List of nested objects and storing them into a HashMap using Stream API

I am trying to iterate over nested java object and from nested object I'm trying to populate a map using Java 8.
Here is the list and structure of objects
IssueTypeDto.java
public class IssueTypeDto {
private String id;
private String name;
private List<CustomFieldDto> customFields;
}
CustomFieldDto.java
private String id;
private TypeDto type;
I am getting List of IssueTypeDto as response. Each IssueTypeDto object can have a list of CustomFieldDto objects. My requirement is I need to iterate over a list of IssueTypeDto objects and for each CustomFieldDto object I need to get id and the corresponding TypeDto object and insert it into a map as key and value. Here is what I am trying to do currently
Map<String, TypeDto> map = issueTypes.stream()
.flatMap(issueType->issueType.getCustomFields().stream()
.collect(Collectors.toMap(
CustomFieldDto::getId,
CustomFieldDto::getType)));
But i am getting compile time error as Type mismatch:
cannot convert from Stream<Object> to Map<String,TypeDto>"
I am new to Java 8 streams. So I am not able to figure out the issue. Any help would be appreciated.
Let's know the general concepts. You want to transform one object to another one. It can be 1 to 1 relationship or 1 to many. In case of 1 to 1 you need to use .map. 1 to many - .flatMap. So in other words map convert Object A to Object B. flatMap convert Object A to Stream of Object B. You was near you missed ')'.
Your case should be like this:
issueTypes.stream()
.flatMap(issueType -> issueType.getCustomFields().stream())
.collect(Collectors.toMap(CustomFieldDto::getId CustomFieldDto::getType));
Also please be sure that there are no duplicates on keys of different objects, otherwise Collectors.toMap() will throw exception due to key duplication.
The code you've provided does not compile for two reasons:
flatMap() is an intermediate operation which expects a function that returns a stream as an argument. Instead, you've created a nested stream that generates a map. Therefore, your function doesn't match what flatMap() expects.
Your stream pipeline lacks the terminal operation, therefore this assignment is incorrect
Map<String, TypeDto> map = issueTypes.stream().flatMap();
There's a map on the left side and a stream on the right side (because flatMap() is an intermediate operation, returns you stream).
To fix it, you need to make the function passed into the flatMap() return a stream, and add a terminal operation to the pipeline, i.e. produce the result from the main stream by moving collect() out from the flatMap().
That's how it can be done by using Collectors.toMap() (in the comments you've mentioned that there could be duplicated ids ):
Map<String, TypeDto> map = issueTypes.stream()
.flatMap(issueType -> issueType.getCustomFields().stream())
.collect(Collectors.toMap(
CustomFieldDto::getId, // mapping a key
CustomFieldDto::getType, // mapping a value
(left, right) -> left // resolving duplicates - preserve the first encountered value
));
Another way of handling duplicates is preserve all of them by storing values mapped to the same key into a list, which can be done with Collectors.groupingBy():
Map<String, List<TypeDto>> map = issueTypes.stream()
.flatMap(issueType -> issueType.getCustomFields().stream())
.collect(Collectors.groupingBy(CustomFieldDto::getId));

What is the benefit of using a custom class over a map? [duplicate]

This question already has answers here:
Class Object vs Hashmap
(3 answers)
Closed 3 years ago.
I have some piece of code that returns a min and max values from some input that it takes. I need to know what are the benefits of using a custom class that has a minimum and maximum field over using a map that has these two values?
//this is the class that holds the min and max values
public class MaxAndMinValues {
private double minimum;
private double maximum;
//rest of the class code omitted
}
//this is the map that holds the min and max values
Map<String, Double> minAndMaxValuesMap
The most apparent answer would be Object Oriented Programming aspects like the possibility to data with functionality, and the possibility to derive that class.
But let's for the moment assume, that is not a major factor, and your example is so simplistic, that I wouldn't use a Map either. What I would use is the Pair class from Apache Commons: https://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/tuple/Pair.html
(ImmutablePair):
https://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/tuple/ImmutablePair.html
The Pair class is generic, and has two generic types, one for each field. You can basically define a Pair of something, and get type safety, IDE support, autocompletion, and the big benefit of knowing what is inside. Also a Pair features stuff that a Map can not. For example, a Pair is potentially Comparable. See also ImmutablePair, if you want to use it as key in another Map.
public Pair<Double, Double> foo(...) {
// ...
Pair<Double, Double> range = Pair.of(minimum, maximum);
return range;
}
The big advantage of this class is, that the type you return exposes the contained types. So if you need to, you could return different types from a single method execution (without using a map or complicated inner class).
e.g. Pair<String, Double> or Pair<String, List<Double>>...
In simple situation, you just need to store min and max value from user input, your custom class will be ok than using Map, the reason is: in Java, a Map object can be a HashMap, LinkedHashMap or and TreeMap. it get you a short time to bring your data into its structure and also when you get value from the object. So in simple case, as you just described, just need to use your custom class, morever, you can write some method in your class to process user input, what the Map could not process for you.
I would say to look from perspective of the usage of a programming language. Let it be any language, there will be multiple ways to achieve the result (easy/bad/complicated/performing ...). Considering an Object oriented language like java, this question points more on to the design side of your solution.
Think of accessibility.
The values in a Map is kind of public that , you can modify the contents as you like from any part of the code. If you had a condition that the min and max should be in the range [-100 ,100] & if some part of your code inserts a 200 into map - you have a bug. Ok we can cover it up with a validation , but how many instances of validations would you write? But an Object ? there is always the encapsulation possibilities.
Think of re-use
. If you had the same requirement in another place of code, you have to rewrite the map logic again(probably with all validations?) Doesn't look good right?
Think of extensibility
. If you wanted one more data like median or average -either you have to dirty the map with bad keys or create a new map. But a object is always easy to extend.
So it all relates to the design. If you think its a one time usage probably a map will do ( not a standard design any way. A map must contain one kind of data technically and functionally)
Last but not least, think of the code readability and cognitive complexity. it will be always better with objects with relevant responsibilities than unclear generic storage.
Hope I made some sense!
The benefit is simple : make your code clearer and more robust.
The MaxAndMinValues name and its class definition (two fields) conveys a min and a max value but overall it makes sure that will accept only these two things and its class API is self explanatory to know how to store/get values from it.
While Map<String, Double> minAndMaxValuesMap conveys also the idea that a min and a max value are stored in but it has also multiple drawbacks in terms of design :
we don't know how to retrieve values without looking how these were added.
About it, how to name the keys we we add entries in the map ? String type for key is too broad. For example "MIN", "min", "Minimum" will be accepted. An enum would solve this issue but not all.
we cannot ensure that the two values (min and max) were added in (while an arg constructor can do that)
we can add any other value in the map since that is a Map and not a fixed structure in terms of data.
Beyond the idea of a clearer code in general, I would add that if MaxAndMinValues was used only as a implementation detail inside a specific method or in a lambda, using a Map or even an array {15F, 20F} would be acceptable. But if these data are manipulated through methods, you have to do their meaning the clearest possible.
We used custom class over Hashmap to sort Map based on values part

Is using a HashMap the simplest solution for storing an Object that has an ID?

I have a class in some code, ChatChannel (some unneccessary code omitted), that I'm having a bit of trouble with.
public class ChatChannel {
private static HashMap<String, ChatChannel> registeredChannels = new HashMap<>(); // ChannelID, ChatChannel Object
public static void registerChannel(ChatChannel channel) {
registeredChannels.put(channel.getId(), channel);
}
public static ChatChannel getChannelById(String id) {
return registeredChannels.getOrDefault(id, null);
}
/** The actual ChatChannel item is defined BELOW THIS LINE **/
private String name;
private String id;
public ChatChannel(String name, String id) {
this.name = name;
this.id = id;
}
public static String getId() {
return id;
}
}
Essentially, this class will allow me to separate messages sent by users into "channels." Users may only receive messages in joined channels, and may only send a message to their active channel. Channels should be accessible using their ID (for example, global).
However, my problem is I don't know whether I should use a HashMap or Collection in order to keep the code light and simple. Ideally, I'd like to be able to reference any ChatChannel by its id at any point in the code, so I don't need to constantly pass around these ChatChannels. What, if any, would the performance gain of using HashMap (and external IDs) be? Would it be roughly equal to using a Collection and then iterating through it using my getId() method? If so, which is considered "proper" Java?
To answer the stated question "Should I be using a HashMap or Collection for performance?" — you can't and won't use a "Collection" in this sense because a Collection is an abstract concept, represented in Java as an interface.
A Collection could be a List, or a Map, or a Set, among other things. You can write a method that, for example, accepts (any kind of) a Collection and performs an operation on everything in the Collection, but in your case here you must decide on what kind of collection to use in your implementation.
Since you're retrieving a channel given an identifier String, a Map is a useful choice because it is a key-to-value mapping; you don't have to iterate through it to find the element that has the desired key.
You should generally declare things generically, then instantiate them with a specific implementation. That is, when working with it in your code you don't care what sort of Map it is, just that it's a Map. The actual map that you allocate could be a HashMap or a LinkedHashMap or a TreeMap — since maintaining the insertion order or keeping things sorted doesn't seem to matter here, the plain HashMap appears appropriate.
private static Map<String, ChatChannel> registeredChannels = new HashMap<>();
// ^^^ generic declarat | specific implementation ^^^^
You might also know something about how many channels there are likely to be, or at least the size of the starting set of channels, so you may also consider the initialCapacity and the loadFactor parameters to the constructor, for example
// Allocate with room for 10 initial channels, expand the map size when 75% full
private static Map<String, ChatChannel> registeredChannels =
new HashMap<>(10, 0.75);
It is quite likely you have IDs from a continuous range.. like 1,2,3,4... or 110,111,112,113,114... with maybe some holes. It then becomes easy to hash such sequences to a sequence like 0,1,2,3,4.... .
Now you can use a pure array(!) which is super fast. The numbers 0..n map to index in array and access cannot be faster. The index contains pointer to the session data.
Basically an array is a map. Key is index number, and value is what it contains or points to.

Write a Collector or nest existing ones

I have a method in a data Structure that I wish to use to pass various collectors and apply them to my object.
The following is the Method -->
public <R> R applyCollector(String key, Collector a)
{
this.key = key;
this.a = a;
R result = (R) this.stateList.stream().
map(state -> state.getKey(key)).collect(a);
return result;
}
The above method basically takes in a "key" and a Collector that it applies over the values got by key.
This is the way I'm using it -->
Collector stringToListCollector =
Collectors.toList();
List<String> values =
myObject.applyCollector("key",
stringToListCollector);
This works fine for simple things like getting count, average etc.
But, what if I wish to send something more complex, like a Nested Collector.
For example, say my "key" returns me a String, which is actually an IP or even an Integer.
What I'd like to do is to send a collector that first Converts my String to Integer by doing a "integer::parseInt" and then doing the toList.
Right now I have to first retrieve a list named values (defined above). And then do a values.stream().Map(Integer::ParseString).collect(Collectors::averagingInt).
I might need to do this operation multiple times, I have two options.
Make the ToList and Map and Collect as a function. and call it. This
beats my purpose of lambda.
Write or nest existing Collectors to directly do that for me. This
option looks neater to me because if I can do it correctly, I'll be
able to do everything in One Pass instead of the 2 passes it takes
me now, as well as maybe save the Memory I need to first create a
list.
How do I do this? Write a Collector that does -->
Gets an object, and runs Integer::ParseInt upon it and then do an
Average Operation.
For your example it would look like
applyCollector("key", mapping(Integer::parseInt, averagingInt(i -> i))
Collectors can be composed to some extend:
Collectors.mapping executes function before collecting
Collectors.collectingAndThen executes function after collecting
additionally some collectors accept downstream collectors i.e groupingBy

Java 8 Streams - collect vs reduce

When would you use collect() vs reduce()? Does anyone have good, concrete examples of when it's definitely better to go one way or the other?
Javadoc mentions that collect() is a mutable reduction.
Given that it's a mutable reduction, I assume it requires synchronization (internally) which, in turn, can be detrimental to performance. Presumably reduce() is more readily parallelizable at the cost of having to create a new data structure for return after every step in the reduce.
The above statements are guesswork however and I'd love an expert to chime in here.
reduce is a "fold" operation, it applies a binary operator to each element in the stream where the first argument to the operator is the return value of the previous application and the second argument is the current stream element.
collect is an aggregation operation where a "collection" is created and each element is "added" to that collection. Collections in different parts of the stream are then added together.
The document you linked gives the reason for having two different approaches:
If we wanted to take a stream of strings and concatenate them into a
single long string, we could achieve this with ordinary reduction:
String concatenated = strings.reduce("", String::concat)
We would get the desired result, and it would even work in parallel.
However, we might not be happy about the performance! Such an
implementation would do a great deal of string copying, and the run
time would be O(n^2) in the number of characters. A more performant
approach would be to accumulate the results into a StringBuilder,
which is a mutable container for accumulating strings. We can use the
same technique to parallelize mutable reduction as we do with ordinary
reduction.
So the point is that the parallelisation is the same in both cases but in the reduce case we apply the function to the stream elements themselves. In the collect case we apply the function to a mutable container.
The reason is simply that:
collect() can only work with mutable result objects.
reduce() is designed to work with immutable result objects.
"reduce() with immutable" example
public class Employee {
private Integer salary;
public Employee(String aSalary){
this.salary = new Integer(aSalary);
}
public Integer getSalary(){
return this.salary;
}
}
#Test
public void testReduceWithImmutable(){
List<Employee> list = new LinkedList<>();
list.add(new Employee("1"));
list.add(new Employee("2"));
list.add(new Employee("3"));
Integer sum = list
.stream()
.map(Employee::getSalary)
.reduce(0, (Integer a, Integer b) -> Integer.sum(a, b));
assertEquals(Integer.valueOf(6), sum);
}
"collect() with mutable" example
E.g. if you would like to manually calculate a sum using collect() it can not work with BigDecimal but only with MutableInt from org.apache.commons.lang.mutable for example. See:
public class Employee {
private MutableInt salary;
public Employee(String aSalary){
this.salary = new MutableInt(aSalary);
}
public MutableInt getSalary(){
return this.salary;
}
}
#Test
public void testCollectWithMutable(){
List<Employee> list = new LinkedList<>();
list.add(new Employee("1"));
list.add(new Employee("2"));
MutableInt sum = list.stream().collect(
MutableInt::new,
(MutableInt container, Employee employee) ->
container.add(employee.getSalary().intValue())
,
MutableInt::add);
assertEquals(new MutableInt(3), sum);
}
This works because the accumulator container.add(employee.getSalary().intValue()); is not supposed to return a new object with the result but to change the state of the mutable container of type MutableInt.
If you would like to use BigDecimal instead for the container you could not use the collect() method as container.add(employee.getSalary()); would not change the container because BigDecimal it is immutable.
(Apart from this BigDecimal::new would not work as BigDecimal has no empty constructor)
The normal reduction is meant to combine two immutable values such as int, double, etc. and produce a new one; it’s an immutable reduction. In contrast, the collect method is designed to mutate a container to accumulate the result it’s supposed to produce.
To illustrate the problem, let's suppose you want to achieve Collectors.toList() using a simple reduction like
List<Integer> numbers = stream.reduce(
new ArrayList<Integer>(),
(List<Integer> l, Integer e) -> {
l.add(e);
return l;
},
(List<Integer> l1, List<Integer> l2) -> {
l1.addAll(l2);
return l1;
});
This is the equivalent of Collectors.toList(). However, in this case you mutate the List<Integer>. As we know the ArrayList is not thread-safe, nor is safe to add/remove values from it while iterating so you will either get concurrent exception or ArrayIndexOutOfBoundsException or any kind of exception (especially when run in parallel) when you update the list or the combiner tries to merge the lists because you are mutating the list by accumulating (adding) the integers to it. If you want to make this thread-safe you need to pass a new list each time which would impair performance.
In contrast, the Collectors.toList() works in a similar fashion. However, it guarantees thread safety when you accumulate the values into the list. From the documentation for the collect method:
Performs a mutable reduction operation on the elements of this stream using a Collector. If the stream is parallel, and the Collector is concurrent, and either
the stream is unordered or the collector is unordered, then a
concurrent reduction will be performed. When executed in parallel, multiple intermediate results may be instantiated, populated, and merged so as to maintain isolation of mutable data structures. Therefore, even when executed in parallel with non-thread-safe data structures (such as ArrayList), no additional synchronization is needed for a parallel reduction.
So to answer your question:
When would you use collect() vs reduce()?
if you have immutable values such as ints, doubles, Strings then normal reduction works just fine. However, if you have to reduce your values into say a List (mutable data structure) then you need to use mutable reduction with the collect method.
Let the stream be a <- b <- c <- d
In reduction,
you will have ((a # b) # c) # d
where # is that interesting operation that you would like to do.
In collection,
your collector will have some kind of collecting structure K.
K consumes a.
K then consumes b.
K then consumes c.
K then consumes d.
At the end, you ask K what the final result is.
K then gives it to you.
They are very different in the potential memory footprint during the runtime. While collect() collects and puts all data into the collection, reduce() explicitly asks you to specify how to reduce the data that made it through the stream.
For example, if you want to read some data from a file, process it, and put it into some database, you might end up with java stream code similar to this:
streamDataFromFile(file)
.map(data -> processData(data))
.map(result -> database.save(result))
.collect(Collectors.toList());
In this case, we use collect() to force java to stream data through and make it save the result into the database. Without collect() the data is never read and never stored.
This code happily generates a java.lang.OutOfMemoryError: Java heap space runtime error, if the file size is large enough or the heap size is low enough. The obvious reason is that it tries to stack all the data that made it through the stream (and, in fact, has already been stored in the database) into the resulting collection and this blows up the heap.
However, if you replace collect() with reduce() -- it won't be a problem anymore as the latter will reduce and discard all the data that made it through.
In the presented example, just replace collect() with something with reduce:
.reduce(0L, (aLong, result) -> aLong, (aLong1, aLong2) -> aLong1);
You do not need even to care to make the calculation depend on the result as Java is not a pure FP (functional programming) language and cannot optimize out the data that is not being used at the bottom of the stream because of the possible side-effects.
Here is the code example
List<Integer> list = Arrays.asList(1,2,3,4,5,6,7);
int sum = list.stream().reduce((x,y) -> {
System.out.println(String.format("x=%d,y=%d",x,y));
return (x + y);
}).get();
System.out.println(sum);
Here is the execute result:
x=1,y=2
x=3,y=3
x=6,y=4
x=10,y=5
x=15,y=6
x=21,y=7
28
Reduce function handle two parameters, the first parameter is the previous return value int the stream, the second parameter is the current
calculate value in the stream, it sum the first value and current value as the first value in next caculation.
According to the docs
The reducing() collectors are most useful when used in a multi-level reduction, downstream of groupingBy or partitioningBy. To perform a simple reduction on a stream, use Stream.reduce(BinaryOperator) instead.
So basically you'd use reducing() only when forced within a collect.
Here's another example:
For example, given a stream of Person, to calculate the longest last name
of residents in each city:
Comparator<String> byLength = Comparator.comparing(String::length);
Map<String, String> longestLastNameByCity
= personList.stream().collect(groupingBy(Person::getCity,
reducing("", Person::getLastName, BinaryOperator.maxBy(byLength))));
According to this tutorial reduce is sometimes less efficient
The reduce operation always returns a new value. However, the accumulator function also returns a new value every time it processes an element of a stream. Suppose that you want to reduce the elements of a stream to a more complex object, such as a collection. This might hinder the performance of your application. If your reduce operation involves adding elements to a collection, then every time your accumulator function processes an element, it creates a new collection that includes the element, which is inefficient. It would be more efficient for you to update an existing collection instead. You can do this with the Stream.collect method, which the next section describes...
So the identity is "re-used" in a reduce scenario, so slightly more efficient to go with .reduce if possible.
There is a very good reason to always prefer collect() vs the reduce() method. Using collect() is much more performant, as explained here:
Java 8 tutorial
*A mutable reduction operation(such as Stream.collect()) collects the stream elements in a mutable result container(collection) as it processes them.
Mutable reduction operations provide much improved performance when compared to an immutable reduction operation(such as Stream.reduce()).
This is due to the fact that the collection holding the result at each step of reduction is mutable for a Collector and can be used again in the next step.
Stream.reduce() operation, on the other hand, uses immutable result containers and as a result needs to instantiate a new instance of the container at every intermediate step of reduction which degrades performance.*

Categories