java: Collect and combine data in a list - java

In my program I have a List of Plants, each plant has a measurement (String), day (int), camera (int), and replicate number(int). I obtain a List of all plants wanted by using filters:
List<Plant> selectPlants = allPlants.stream().filter(plant -> passesFilters(plant, filters)).collect(Collectors.toList());
What I would like to do now is take all Plants that have the same camera, measurement, and replicate values. And combine them in order of day. So if I have days 1,2,3,5 I would want to find all similar plants and append the values to one plant where the getValues (function).
I added a method to Plant that appends values by just using addAll( new plant values ).
Is there any way of doing this without iterating through the list over and over to find the similar plants, and then sorting each time by day then appending? I'm sorry for the horrible wording of this question.

While Vakh’s answer is correct, it is unnecessarily complex.
Often, the work of implementing your own key class does not pay off. You can use a List as a key which implies a slight overhead due to boxing primitive values but given the fact that we do operations like hashing here, it will be negligible.
And sorting doesn’t have to be done by using a for loop and, even worse, an anonymous inner class for the Comparator. Why do we have streams and lambdas? You can implement the Comparator using a lambda like (p1,p2) -> p1.getDay()-p2.getDay() or, even better, Comparator.comparing(Plant::getDay).
Further, you can do the entire operation in one step. The sort step will create an ordered stream and the collector will maintain the encounter order, so you can use one stream to sort and group:
Map<List<?>, List<Plant>> groupedPlants = allPlants.stream()
.filter(plant -> passesFilters(plant, filters))
.sorted(Comparator.comparing(Plant::getDay))
.collect(Collectors.groupingBy(p ->
Arrays.asList(p.getMeasurement(), p.getCamera(), p.getReplicateNumber())));
That’s all.

Using Collectors.groupBy:
private static class PlantKey {
private String measurement;
private int camera;
private int replicateNumber;
// + constructor, getters, setters and haschode equals
}
Map<PlantKey, List<Plant>> groupedPlants =
allPlants.stream().filter(plant -> passesFilters(plant, filters))
.collect(Collectors.groupBy(p ->
new PlantKey(p.getMeasurement(),
p.getCamera(),
p.getReplicateNumber())));
// order the list
for(List values : groupedPlants.values()) {
Collections.sort(values, new Comparator<Plant>(){
#Override
public int compare(Plant p1, Plant p2) {
return p1.getDay() - p2.getDay();
}
});
}

I would group them by the common characteristics and compare similar results.
for(List<Plant> plantGroup : allPlants.stream().collect(Collectors.groupingBy(
p -> p.camera+'/'+p.measurement+'/'+p.replicate)).values()) {
// compare the plants in the same group
}

There is a function called sorted which operates on a stream
selectPlants.stream().sorted(Comparator.comparingInt(i -> i.day)).collect(Collectors.toList());

Related

How to Add up the values in a Nested Collection using Streams

I have the following TicketDTO Object:
public class TicketDTO {
private LocalDate date;
private Set<OffenceDTO> offences;
}
And every OffenceDTO has an int field - penalty points.
public class OffenceDTO {
private int penaltyPoints;
}
I would like to add up the penalty points to a single int value by streaming the Set of Offenses of each Ticket. But only if the ticket's date is between the last two years.
I have collected tickets from the last two years, but now I have a problem in how to go through the offenses and count their points.
This is what I've written so far:
tickets().stream()
.filter(ticketEntity -> isDateBetween(LocalDate.now(), ticketEntity.getDate()))
.collect(Collectors.toList());
I would like to collect the penalty points in a single int value by streaming the set of tickets
It can be done in the following steps:
Turn the stream of filtered tickets into a stream of OffenceDTO using flatMap();
Extract penalty points from OffenceDTO with mapToInt(), that will transform a stream of objects into a IntStream;
Apply sum() to get the total.
int totalPenalty = tickets().stream()
.filter(ticketEntity -> isDateBetween(LocalDate.now(), ticketEntity.getDate()))
.flatMap(ticketDTO -> ticketDTO.getOffences().stream())
.mapToInt(OffenceDTO::getPenaltyPoints)
.sum();
Assuming that tickets() is a method that returns a List of TicketDTO, you could stream the List and filter its elements with your custom method isDateBetween (as you were doing).
Then flat the mapping of each ticket to their corresponding offences. This will provide you a stream of OffenceDTO whose TicketDTO is between the last two years (according to your isDateBetween method).
Ultimately, you can collect the points of each OffenceDTO by summing them with the summingInt method of the Collectors class.
int res = tickets().stream()
.filter(ticketEntity -> isDateBetween(LocalDate.now(), ticketEntity.getDate()))
.flatMap(ticketDTO -> ticketDTO.getOffences().stream())
.collect(Collectors.summingInt(OffenceDTO::getPenaltyPoints));

Partition Strategy for applying multiple JOINs on a Flink DataSet

I am using Flink 1.4.0.
Suppose I have a POJO as follows:
public class Rating {
public String name;
public String labelA;
public String labelB;
public String labelC;
...
}
and a JOIN function:
public class SetLabelA implements JoinFunction<Tuple2<String, Rating>, Tuple2<String, String>, Tuple2<String, Rating>> {
#Override
public Tuple2<String, Rating> join(Tuple2<String, Rating> rating, Tuple2<String, String> labelA) {
rating.f1.setLabelA(labelA)
return rating;
}
}
and suppose I want to apply a JOIN operation to set the values of each field in a DataSet<Tuple2<String, Rating>>, which I can do as follows:
DataSet<Tuple2<String, Rating>> ratings = // [...]
DataSet<Tuple2<String, Double>> aLabels = // [...]
DataSet<Tuple2<String, Double>> bLabels = // [...]
DataSet<Tuple2<String, Double>> cLabels = // [...]
...
DataSet<Tuple2<String, Rating>>
newRatings =
ratings.leftOuterJoin(aLabels, JoinOperatorBase.JoinHint.REPARTITION_SORT_MERGE)
// key of the first input
.where("f0")
// key of the second input
.equalTo("f0")
// applying the JoinFunction on joining pairs
.with(new SetLabelA());
Unfortunately, this is necessary as both ratings and all xLabels are very big DataSets and I am forced to look into each of the xlabels to find the field values I require, while at the same time it is not the case that all rating keys exist in each xlabels.
This practically means that I have to perform a leftOuterJoin per xlabel, for which I need to also create the respective JoinFunction implementation that utilises the correct setter from the Rating POJO.
Is there a more efficient way to solve this that anyone can think of?
As far as the partitioning strategy goes, I have made sure to sort the DataSet<Tuple2<String, Rating>> ratings with:
DataSet<Tuple2<String, Rating>> sorted_ratings = ratings.sortPartition(0, Order.ASCENDING).setParallelism(1);
By setting parallelism to 1 I can be sure that the whole dataset will be ordered. I then use .partitionByRange:
DataSet<Tuple2<String, Rating>> partitioned_ratings = sorted_ratings.partitionByRange(0).setParallelism(N);
where N is the number of cores I have on my VM. Another side question I have here is whether the first .setParallelism which is set to 1 is restrictive in terms of how the rest of the pipeline is executed, i.e. can the follow up .setParallelism(N) change how the DataSet is processed?
Finally, I did all these so that when partitioned_ratings is joined with a xlabels DataSet, the JOIN operation will be done with JoinOperatorBase.JoinHint.REPARTITION_SORT_MERGE. According to Flink docs for v.1.4.0:
REPARTITION_SORT_MERGE: The system partitions (shuffles) each input (unless the input is already partitioned) and sorts each input (unless it is already sorted). The inputs are joined by a streamed merge of the sorted inputs. This strategy is good if one or both of the inputs are already sorted.
So in my case, ratings is sorted (I think) and each of the xlabels DataSets are not, hence it makes sense that this is the most efficient strategy. Anything wrong with this? Any alternative approaches?
So far I haven't been able to pull through this strategy. It seems like relying on JOINs is too troublesome as they are expensive operations and one should avoid them unless they are really necessary.
For instance, JOINs should be used if both Datasets are very big in size. If they are not, a convenient alternative is the use of BroadCastVariables by which one of the two Datasets (the smallest), is broadcasted across workers for whatever purpose it is used. A example appears below (copied from this link for convenience)
DataSet<Point> points = env.readCsv(...);
DataSet<Centroid> centroids = ... ; // some computation
points.map(new RichMapFunction<Point, Integer>() {
private List<Centroid> centroids;
#Override
public void open(Configuration parameters) {
this.centroids = getRuntimeContext().getBroadcastVariable("centroids");
}
#Override
public Integer map(Point p) {
return selectCentroid(centroids, p);
}
}).withBroadcastSet("centroids", centroids);
Also, since populating fields of a POJO implies that a quite similar code will be leverage repeatedly, one should definitely use jlens to avoid code repetition and write a more concise and easy to follow solution.

Removing duplicates from list where duplication logic is based on custom field

I have a list of following info
public class TheInfo {
private int id;
private String fieldOne;
private String fieldTwo;
private String fieldThree;
private String fieldFour;
//Standard Getters, Setters, Equals, Hashcode, ToString methods
}
The list is required to be processed in such a way that
Among duplicates, select the one with minimum ID, and remove others. In this particular case, entries are considered duplicate when their values of fieldOne and fieldTwo are equal.
Get concatenated value of fieldThree and fieldFour.
I want to process this list Java8 Streams. Currently I don't know how to remove duplicates base on custom fields. I think I can't use distinct() because I can't change equals/hashcode method as logic is just for this specific case.
How can I achieve this?
Assuming you have
List<TheInfo> list;
you can use
List<TheInfo> result = new ArrayList<>(list.stream().collect(
Collectors.groupingBy(info -> Arrays.asList(info.getFieldOne(), info.getFieldOne()),
Collectors.collectingAndThen(
Collectors.minBy(Comparator.comparingInt(TheInfo::getId)),
Optional::get))).values());
the groupingBy collector produces groups according to a function whose results determine the equality. A list already implements this for a sequence of values, so Arrays.asList(info.getFieldOne(), info.getFieldOne()) produces a suitable key. In Java 9, you would most probably use List.of(info.getFieldOne(), info.getFieldOne()) instead.
The second argument to groupingBy is another collector determining how to process the groups, Collectors.minBy(…) will fold them to the minimum element according to a comparator and Comparator.comparingInt(TheInfo::getId) is the right comparator for getting the element with the minimum id.
Unfortunately, the minBy collector produces an Optional that would be empty if there are no elements, but since we know that the groups can’t be empty (groups without elements wouldn’t be created in the first place), we can unconditionally call get on the optional to retrieve the actual value. This is what wrapping this collector in Collectors.collectingAndThen(…, Optional::get) does.
Now, the result of the grouping is a Map mapping from the keys created by the function to the TheInfo instance with the minimum id. Calling values() on the Map gives as a Collection<TheInfo> and since you want a List, a final new ArrayList<>(collection) will produce it.
Thinking about it, this might be one of the cases, where the toMap collector is simpler to use, especially as the merging of the group elements doesn’t benefit from mutable reduction:
List<TheInfo> result = new ArrayList<>(list.stream().collect(
Collectors.toMap(
info -> Arrays.asList(info.getFieldOne(), info.getFieldOne()),
Function.identity(),
BinaryOperator.minBy(Comparator.comparingInt(TheInfo::getId)))).values());
This uses the same function for determining the key and another function determining a single value, which is just an identity function and a reduction function that will be called, if a group has more than one element. This will again be a function returning the minimum according to the ID comparator.
Using streams, you can process it using just the collector, if you provide it with proper classifier:
private static <T> T min(T first, T second, Comparator<? super T> cmp) {
return cmp.compare(first, second) <= 0 ? first : second;
}
private static void process(Collection<TheInfo> data) {
Comparator<TheInfo> cmp = Comparator.comparing(info -> info.id);
data.stream()
.collect(Collectors.toMap(
info -> Arrays.asList(info.fieldOne, info.fieldTwo), // Your classifier uses a tuple. Closest thing in JDK currently would be a list or some custom class. I chose List for brevity.
info -> info, // or Function.identity()
(a, b) -> min(a, b, cmp) // what do we do with duplicates. Currently we take min according to Comparator.
));
}
The above stream will be collected into Map<List<String>, TheInfo>, which will contain minimal element with lists of two strings as key. You can extract the map.values() and return then in new collection or whatever you need them for.

Manipulating data in java List<Object> in a structured way

I have the following class :
class Students{
int age;
int dept;
}
Lets say i have a List<Students> and I want to manipulate the list by doing simple calculations like : calculate the mean, calculate the middle value (e.g. (age+debt)/2), find the closest value to the mean and so on. How can I do this in a structured way?. I want to be in a position where I can use different combinations on the list. e.g. calculate mean of age // calculate mean of the middle value from age/debt, find the closest value of the age etc.
How should i approach this?. Would appreciate it if someone could point me in the right direction.
Apache Math has a nice Descriptive Statistics package that does this sought of thing.
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
If you're using Java 8 this works well with Lambdas:
DescriptiveStatistics stats = new DescriptiveStatistics();
students.forEach(s -> stats.add(s.age));
double mean = stats.getMean();
And to filter etc:
//Only students with an age > 18
students.stream.filter(s -> s.age > 18).forEach(s -> stats.add(s.age));
If you're not using Java 8 then simply foreach it.
You can create a separate class (StudentCalculator) that will require a List of Students (perhaps pass the List in the constructor) and have the instance methods perform calculations on the List.
Or you can create a utility (e.g. StudentCalculatorUtility) where you would define a series of methods that would accept a List of Students as a parameter, that would perform all the calculations you would need on the students(middle value,closest to mean, etc.)
There is a concept where you step through a list and perform an operation on each item in turn which may or may not change the item.
In this case, you want a method that takes an item from the list does some stuff and returns a running total.
int sumItems(Student stu, int sum){
return (stu.age + stu.debt)/2;
}
To use this method, either use either a forEach or an iterator.
Iterator itr = Students.iterator(); // assuming List<Student> Students = new List<Student>()
int sum = 0;
while(itr.hasnext()){
sum = sumItems(itr.next(), sum)
}
Now do something with your sum.

How do I add to a set based on a specific predicate in Java?

I have a set in Java containing people:
Set<Person> uniquePeople = new HashSet<Person>();
I also have a list of a ton of people (of whom some possess the same name, eg. there is more than one "Bob" in the world).
List<Person> theWorld = // ... a BIG list of people
I want to iterate through this list and add a person to the uniquePeople set if and only if their name doesn't exist in the set, eg:
for (Person person : theWorld) {
uniquePeople.add(person IFF uniquePeople.doesNotContain(person.name));
}
Is there an easy way to do this in Java? Also, Guava might do this (?) but I haven't used it at all so I would appreciate a point in the right direction.
A better option would be to abandon using a Set and instead use a Map<String, Person> (keyed off of the name).
If you want to use a set, I suggest you use a new object type (that will just contain a name and maybe a reference to a Person).
Make sure you override equals so that it will only compare the names and then you can get a set of all unique people.
You could also subclass person to override the equals to do what you want.
Sets by definition will not do what you want with just a person since they depend entirely on using equals so these are your workaround options. You could also implement (or find online) a set that takes a comparator to use instead of relying on equals but I don't think such a class exists in standard java.
Use Guava's Equivalence to wrap your objects if you don't want to (or can't) override equals and hashCode:
Set<Equivalence.Wrapper<Person>> set = Sets.newHashSet();
Equivalence<Person> personEquivalence = Equivalence.onResultOf(
new Function<Person, String>() {
#Override public String apply(Person p) {
return p.name;
}
});
set.add(personEquivalence.wrap(new Person("Joe", "Doe")));
set.add(personEquivalence.wrap(new Person("Joe", "Doe")));
set.add(personEquivalence.wrap(new Person("Jane", "Doe")));
System.out.println(set);
// [PersonEquivalence#8813f2.wrap(Person{firstName=Jane, lastName=Doe}),
// PersonEquivalence#8813f2.wrap(Person{firstName=Joe, lastName=Doe})]
#DanielWilliams has a good idea too, but using Equivalence.Wrapper is more self-documenting - after all you don't want to create new object other than wrapper.
I am not sure why people got downvoted here.
You absolutely want a Set. Not only do your requirements meet the definition and functionality of 'Set' but Set implementations are designed to quickly identify duplicates either via hash or Comparative identity.
Let's say you had a List implementation that took a deligate and a predicate:
List uniquePeople = new PredicatedList(new ArrayList(),UnqiuePersonPredicate.getInstance())
public class PredicatedList<T> implements List<T> {
private List<T> delegate = null;
private Predicate<T> predicate;
public PredicatedList<List<T> delegate, Predicate p) {
this.delegate = delegate;
this.predicate = p;
}
// implement list methods here and apply 'p' before calling your insertion functions
public boolean add(Person p) {
if(predicate.apply(p))
delegate.add(p);
}
}
For this to work you would need to have a predicate that iterates over the list to find an equal element. This is an O(N) operation. If you use HashSet then it's O(1) < n < O(N). Your amortized identity check is the load factor * N. And, usually much closer to O(1)
If you use TreeSet you will get O(log(n)) because the elements are sorted by identity and you need only log(n) time to binary search.
Define hashCode()/equals based on 'name' or whatever you want and use HashSet or use TreeSet and define Comparable/Comparator
If your return type MUST be a List then do:
Set uniquePeople = new HashSet();
uniquePeople.add(...);
List people = new LinkedList(uniquePeople);
You could do it with guava, the only thing is that Person is going to need an equals/hashcode method.
ImmutableSet<String> smallList = ImmutableSet.of("Eugene","Bob");
ImmutableSet<String> bigList = ImmutableSet.of("Eugene","Bob","Alex","Bob","Alex");
System.out.println(Iterables.concat(smallList, Sets.difference(bigList, smallList)));
//output is going to be : [Eugene, Bob, Alex]

Categories