group objects using stream collector - java

I have list of objects, let's say of class Document:
class Document {
private final String id;
private final int length;
public Document(String id, int length) {
this.id = id;
this.length = length;
}
public int getLength() {
return length;
}
}
Task at hand is to group them in Envelopes so that number of pages (Document.length) does not exceed certain number.
class Envelope {
private final List<Document> documents = new ArrayList<>();
}
So for example, if I had follwing List of Documents:
Document doc0 = new Document("doc0", 2);
Document doc1 = new Document("doc1", 5);
Document doc2 = new Document("doc2", 5);
Document doc3 = new Document("doc3", 5);
and max page count in envelope is let's say 7, than I expect 3 envelopes with following documents:
Assert.assertEquals(3, envelopeList.size());
Assert.assertEquals(2, envelopeList.get(0).getDocuments().size()); // doc0, doc1
Assert.assertEquals(1, envelopeList.get(1).getDocuments().size()); // doc2
Assert.assertEquals(1, envelopeList.get(2).getDocuments().size()); // doc3
I have implemented this with traditional for loop and bunch of if's but question is, is it possible to do this more elegant way with streams and collectors?
thank you and best regards
Dalibor

For batching the documents based on the length, we need to maintain the state of accumulated lengths. Streams are not the best choice when external state needs to be maintained and custom loop should be simpler and efficient option.
If we force fit, streams for this scenario, the DocumentSpliterator would change as below:
public static List<Couvert> splitDocuments(List<Document> docs) {
IntUnaryOperator helper = new IntUnaryOperator() {
private int bucketIndex = 0;
private int accumulated = 0;
public synchronized int applyAsInt(int length) {
if (length + accumulated > MAX) {
bucketIndex++;
accumulated = 0;
}
accumulated += length;
return bucketIndex;
}
};
return new ArrayList<>(docs.stream()
.map(d -> new AbstractMap.SimpleEntry<>(helper.applyAsInt(d.getLength()), d))
.collect(Collectors.groupingBy(AbstractMap.SimpleEntry::getKey,
Collector.of(Couvert::new,
(c, e) -> c.getDocuments().add(e.getValue()),
(c1, c2) -> {c1.getDocuments().addAll(c2.getDocuments());return c1;})))
.values());
}
Explanation:
helper maintains the accumulated length and provides a new bucket index when it exceeds max. I have used IntUnaryOperator interface here. Alternatively, we can use any interface that takes an int parame and returns an int.
Regarding the stream,
Document is mapped to a SimpleEntry of bucketIndex and Document.
The stream of SimpleEntry is first grouped based on the bucketIndex. Another Collector transforms the stream of Document for a particuar bucketIndex to a Couvert. Output of the collect() is Map<Integer,Couvert>
Finally, the Collection of Couvert are converted to a list and returned.
Note: For this implementation, I removed the front parameter and included it as part of the docs list.

Related

stream of objects containing multiple properties question

I am trying to solve the following:
Given a list of Data objects try, in a 'one shot' like operation, stream the list, such that the end result will be a generic object or a data object where each prop get its own sum/max/min:
class Data {
int prop1;
int prop2;
...
// constructor
// getters and setters
}
For example, given a list of 2 Data objects as follows:
List<Data> list = Arrays.asList(new Data(1,2), new Data(3,4));
If I apply max to the first property and sum to the second one the result is an object with prop1=3 and prop2=6 or Data(3,6)
Thanks for helping!
I am trying, in a 'one shot' like operation
You are looking for the Teeing Collector introduced in Java 12. Given a list of Data, where a Data class is something like:
#AllArgsConstructor
#Getter
#ToString
public class Data {
int prop1;
int prop2;
}
and a list:
List<Data> data = List.of(
new Data(1,10),
new Data(2,20),
new Data(3,30),
new Data(4,40)
);
the end result will be an generic object or a data object
if I apply max to the first prop and sum to the second
You can use Collectors.teeing to get a new Data object with the result of your operations
Data result =
data.stream().collect(Collectors.teeing(
Collectors.reducing(BinaryOperator.maxBy(Comparator.comparing(Data::getProp1))),
Collectors.summingInt(Data::getProp2),
(optionalMax, sum) -> new Data(optionalMax.get().getProp1(), sum)
));
Or something else, for example a Map<String,Integer>
Map<String,Integer> myMap =
data.stream().collect(Collectors.teeing(
Collectors.reducing(BinaryOperator.maxBy(Comparator.comparing(Data::getProp1))),
Collectors.summingInt(Data::getProp2),
(optionalMax, sum) -> {
HashMap<String, Integer> map = new HashMap();
map.put("max_prop1", optionalMax.get().getProp1());
map.put("sum_prop2", sum);
return map;
}
));
EDIT
After reading your comments I better understood what you meant and need. If your final goal is to create a Data which holds the result of multiple computations, then the stream operation teeing is still a viable solution.
However, since the operation teeing accepts only 2 downstreams and a merger BiFunction to merge their results, you need to nest your teeing calls to include in the first downstream one of the operations you need to perform; while in the second downstream another teeing call. Basically, every second downstream of each nested call uses a teeing operation until you're left with only two computations. Then, the merger function of every outer call takes the first downstream's result and the nested call's result, merges them together, and creates a new Data object with them.
Here is an example with a hypothetical Data class whose properties represent: min value, max value, average and sum:
#lombok.Data
class Data {
private #NonNull int prop1Min;
private #NonNull int prop2Max;
private #NonNull int prop3Avg;
private #NonNull int prop4Sum;
}
public class Main {
public static void main(String[] args) {
List<Data> data = List.of(
new Data(1, 10, 100, 1000),
new Data(2, 20, 200, 2000),
new Data(3, 30, 300, 3000),
new Data(4, 40, 400, 4000)
);
Data result = data.stream().collect(Collectors.teeing(
Collectors.minBy(Comparator.comparing(Data::getProp1Min)),
Collectors.teeing(
Collectors.maxBy(Comparator.comparing(Data::getProp2Max)),
Collectors.teeing(
Collectors.averagingInt(Data::getProp3Avg),
Collectors.summingInt(Data::getProp4Sum),
(avg, count) -> new Data(0, 0, avg.intValue(), count.intValue())),
(max, d) -> new Data(0, max.get().getProp2Max(), d.getProp3Avg(), d.getProp4Sum())
),
(min, d) -> new Data(min.get().getProp1Min(), d.getProp2Max(), d.getProp3Avg(), d.getProp4Sum())
));
System.out.println(result);
}
}
Output
Data(prop1Min=1, prop2Max=40, prop3Avg=250, prop4Sum=10000)
Previous Answer
It sounds like you're trying to retrieve some statistics from a stream of elements.
I am trying [...] to stream the list, such that the end result will be a generic object or a data object where each prop get its own sum/max/min etc.
For this purpose, there is already the IntSummaryStatistics class which includes a set of statistics gathered from a set of int elements. To obtain this information, you just need to stream your elements and invoke the collect operation by supplying Collectors.summarizingInt(); this will return the statistics of your elements. Moreover, Java also provides LongSummaryStatistics and DoubleSummaryStatistics to retrieve statistics of long and double types.
List<Integer> list = new ArrayList<>(List.of(0, 1, 2, 3, 4, 5, 6, 7, 8, 9));
IntSummaryStatistics stats = list.stream()
.collect(Collectors.summarizingInt(Integer::intValue));
System.out.println("Count: " + stats.getCount());
System.out.println("Sum: " + stats.getSum());
System.out.println("Min Value: " + stats.getMin());
System.out.println("Max Value: " + stats.getMax());
System.out.println("Average: " + stats.getAverage());
//In case Data had not been designed in place of IntSummaryStatistics and it's an actual needed class,
//then you could set up the properties you need from the IntSummaryStatistics
Data d = new Data();
d.setMinProp(stats.getMin());
d.setMaxProp(stats.getMax());
d.setSumProp(stats.getSum());
//--------- Data class ---------
class Data {
private int minProp, maxProp, sumProp;
//... rest of the implementation ...
public void setMinProp(int minProp) {
this.minProp = minProp;
}
public void setMaxProp(int maxProp) {
this.maxProp = maxProp;
}
public void setSumProp(int sumProp) {
this.sumProp = sumProp;
}
}
Output
Count: 10
Sum: 45
Min Value: 0
Max Value: 9
Average: 4.5
Well, if they're only reducing functions, then you could utilize reduce as well:
Data someNewData = someData.stream()
.reduce((Data l, Data r) -> {
int a = l.prop1() + r.prop1(); // Find sum of prop1
int b = Math.max(l.prop2(), r.prop2()); // Find max value of prop2
int c = Math.min(l.prop3(), r.prop3()); // Find min value of prop3
return new Data(a, b, c);
})
.orElseThrow();

Find a successive sequence of String attributes in a Set of objects (by stream API)

I have to write a method that takes a SortedSet<MyEvent> and a List<String>. It has to determine if there is a successive sequence of MyEvent representing the given List<String> by a certain class attribute.
Let's assume there was the following (code-)situation:
List<String> list = new ArrayList<String>()
list.add("AAA");
list.add("BBB");
and a Set<MyEvent>
SortedSet<MyEvent> events = new TreeSet<MyEvent>();
with objects of type MyEvent which implements Comparable<MyEvent> (comparison by LocalDateTime only).
The given List<String> represents a sequence of abbreviations and I need to find the most recent occurrence of a sequence of MyEvents whose class attributes abbreviation have the values of the sequence.
This is what I have done so far:
public static void main(String[] args) {
SortedSet<MyEvent> events = generateSomeElements();
List<String> sequence = new ArrayList<>();
sequence.add("AAA");
sequence.add("BBB");
MyEvent desired = getMostRecentLastEventOfSequence(events, sequence);
System.out.println("Result: " + desired.toString());
}
public static MyEvent getMostRecentLastEventOfSequence(SortedSet<MyEvent> events,
List<String> sequence) {
// "convert" the events to a List in order to be able to access indexes
List<MyEvent> myEvents = new ArrayList<MyEvent>();
events.forEach(event -> myEvents.add(event));
// provide a temporary data structure for possible results
SortedSet<MyEvent> possibleReturnValues = new TreeSet<MyEvent>();
// iterate the events in order to find those with a specified predecessor
for (int i = 0; i < myEvents.size(); i++) {
if (i > 0) {
// consider only successive elements
MyEvent a = myEvents.get(i - 1);
MyEvent b = myEvents.get(i);
// check if there is a
if (a.getAbbreviation().equals(sequence.get(0))
&& b.getAbbreviation().equals(sequence.get(1))) {
// if a sequence was found, add the last element to the possible results
possibleReturnValues.add(b);
}
}
}
// check if there were possible results
if (possibleReturnValues.size() == 0) {
return null;
} else {
// if there are any, return the most recent / latest one
return possibleReturnValues.stream().max(MyEvent::compareTo).orElse(null);
}
}
The method is working (for this 2-element sequence, at least).
Is it possible to do that in a single call using the stream API (and for an unknown size of the sequence)?
Your task is not so hard, just create a Stream, apply a filter, and ask for the maximum value. There is the the obstacle that we need a previous element in the predicate, but we have hands on the source collection, which can provide it.
In practice, every SortedSet is also a NavigableSet which provides a lower method to get the previous element, if there is one, but since your requirement is to support a SortedSet input, we have to provide a fall-back for the theoretical case of a SortedSet not being a NavigableSet.
Then, the operation can be implemented as
public static MyEvent getMostRecentLastEventOfSequence(
SortedSet<MyEvent> events, List<String> sequence) {
String first = sequence.get(0), second = sequence.get(1);
UnaryOperator<MyEvent> previous;
if (events instanceof NavigableSet) {
NavigableSet<MyEvent> navigableSet = (NavigableSet<MyEvent>) events;
previous = navigableSet::lower;
}
else previous = event -> events.headSet(event).last();
return events.stream()
.filter(event -> event.getAbbreviation().equals(second))
.filter(event -> {
MyEvent p = previous.apply(event);
return p != null && p.getAbbreviation().equals(first);
})
.max(Comparator.naturalOrder()).orElse(null);
}
but we can do better than that. Since we now we are searching for a maximum in a sorted input, we know that the first match is sufficient when iterating backwards. Again, it is much smoother when the input is actually a NavigableSet:
public static MyEvent getMostRecentLastEventOfSequence(
SortedSet<MyEvent> events, List<String> sequence) {
String first = sequence.get(0), second = sequence.get(1);
UnaryOperator<MyEvent> previous;
Stream<MyEvent> stream;
if (events instanceof NavigableSet) {
NavigableSet<MyEvent> navigableSet = (NavigableSet<MyEvent>) events;
previous = navigableSet::lower;
stream = navigableSet.descendingSet().stream();
}
else {
previous = event -> events.headSet(event).last();
stream = Stream.iterate(events.last(), previous).limit(events.size());
}
return stream
.filter(event -> event.getAbbreviation().equals(second))
.filter(event -> {
MyEvent p = previous.apply(event);
return p != null && p.getAbbreviation().equals(first);
})
.findFirst().orElse(null);
}
So this method will search backwards and stop at the first match, which will already be the maximum element, without the need to traverse all elements.
Another solution is using index. The size of sequence won't be limited to 2.
public static MyEvent getMostRecentLastEventOfSequence(SortedSet<MyEvent> events, List<String> sequence) {
final List<MyEvent> eventList = new ArrayList<>(events);
Collections.reverse(eventList);
final int seqLength = sequence.size();
OptionalInt first = IntStream.range(0, eventList.size() - seqLength + 1)
.filter(i -> IntStream.range(0, seqLength)
.allMatch(j -> eventList.get(i + j).getAbbreviation().equals(sequence.get(seqLength - j - 1))))
.findFirst();
return first.isPresent() ? eventList.get(first.getAsInt()) : null;
}
Here is sample test code:
#Test
void test_56005015() throws Exception {
List<String> sequence = Arrays.asList("AAA", "BBB", "CCC");
SortedSet<MyEvent> events = new TreeSet<>();
events.add(new MyEvent("AAA", LocalDateTime.now().plusDays(1)));
events.add(new MyEvent("BBB", LocalDateTime.now().plusDays(2)));
events.add(new MyEvent("CCC", LocalDateTime.now().plusDays(3)));
events.add(new MyEvent("AAA", LocalDateTime.now().plusDays(4)));
events.add(new MyEvent("BBB", LocalDateTime.now().plusDays(5)));
events.add(new MyEvent("CCC", LocalDateTime.now().plusDays(6)));
MyEvent result = getMostRecentLastEventOfSequence(events, sequence);
System.out.println(result);
}
with MyEvent class annotated with Lombok.
#Getter
#Setter
#AllArgsConstructor
#ToString
public static class MyEvent implements Comparable<MyEvent> {
private String abbreviation;
private LocalDateTime eventTime;
#Override
public int compareTo(MyEvent o) {
return eventTime.compareTo(o.eventTime);
}
}

Java how to build a stream on top of an existing interface

I have an existing interface that allows me to access a theoretically infinite collection as follows:
List<Element> retrieve(int start, int end);
//example
retrieve(5, 10); // retrieves the elements 5 through 10.
Now I would like to build a Java stream on top of this existing interface so that I can stream as many elements as I need without requesting a large list at once.
How would I go about doing this?
I looked at examples of Java streams and all I can find are examples of how to create stream from collections that are completely in memory. I currently load in 30 elements at a time and do the necessary processing but it would be cleaner if I could abstract that logic away and just return a stream instead.
class Chunk implements Supplier<Element> {
private final Generator generator;
private final int chunkSize;
private List<Element> list = Collections.emptyList();
private int index = 0;
public Chunk(Generator generator, int chunkSize) {
assert chunkSize > 0;
this.generator = generator;
this.chunkSize = chunkSize;
}
#Override
public Element get() {
if (list.isEmpty()) {
list = generator.retrieve(index, index + chunkSize);
index += chunkSize;
}
return list.remove(0);
}
}
Here I'm assuming retrieve returns a mutable list. If not then you'd need to create a new ArrayList or equivalent at this point.
This can be used as Stream.generate(new Chuck(generator, 30)). It generates an infinite stream starting at index 0. You could add a constructor that allows the starting index to be set if that would be useful.
I assume you can't edit retrieve method.
You can do this:
IntStream.iterate(1, x -> x + 1).mapToObj(x -> retrieve(x, x).get(0))
If one term of the sequence depends on the previous term, this would mean recalculating every term up to n if you want the nth term.
This slightly solves the problem by getting it in chunks of 100:
IntStream.iterate(1, x -> x + 1).mapToObj(x -> retrieve(1 + (x - 1) * 100, x * 100)).flatMap(List::stream)
If you can edit what's behind that interface, you can just make that return a Stream<Element>, using IntStream.iterate as above.

Java - Basic streams usage of forEach

I have a class called Data which has only one method:
public boolean isValid()
I have a Listof Dataand I want to loop through them via a Java 8 stream. I need to count how many valid Data objects there are in this List and print out only the valid entries.
Below is how far I've gotten but I don't understand how.
List<Data> ar = new ArrayList<>();
...
// ar is now full of Data objects.
...
int count = ar.stream()
.filter(Data::isValid)
.forEach(System.out::println)
.count(); // Compiler error, forEach() does not return type stream.
My second attempt: (horrible code)
List<Data> ar = new ArrayList<>();
...
// Must be final or compiler error will happen via inner class.
final AtomicInteger counter = new AtomicInteger();
ar.stream()
.filter(Data:isValid)
.forEach(d ->
{
System.out.println(d);
counter.incrementAndGet();
};
System.out.printf("There are %d/%d valid Data objects.%n", counter.get(), ar.size());
If you don’t need the original ArrayList, containing a mixture of valid and invalid objects, later-on, you might simply perform a Collection operation instead of the Stream operation:
ar.removeIf(d -> !d.isValid());
ar.forEach(System.out::println);
int count = ar.size();
Otherwise, you can implement it like
List<Data> valid = ar.stream().filter(Data::isValid).collect(Collectors.toList());
valid.forEach(System.out::println);
int count = valid.size();
Having a storage for something you need multiple times is not so bad. If the list is really large, you can reduce the storage memory by (typically) factor 32, using
BitSet valid = IntStream.range(0, ar.size())
.filter(index -> ar.get(index).isValid())
.collect(BitSet::new, BitSet::set, BitSet::or);
valid.stream().mapToObj(ar::get).forEach(System.out::println);
int count = valid.cardinality();
Though, of course, you can also use
int count = 0;
for(Data d: ar) {
if(d.isValid()) {
System.out.println(d);
count++;
}
}
Peek is similar to foreach, except that it lets you continue the stream.
ar.stream().filter(Data::isValid)
.peek(System.out::println)
.count();

Encounter Order wrong when sorting a parallel stream

I have a Record class:
public class Record implements Comparable<Record>
{
private String myCategory1;
private int myCategory2;
private String myCategory3;
private String myCategory4;
private int myValue1;
private double myValue2;
public Record(String category1, int category2, String category3, String category4,
int value1, double value2)
{
myCategory1 = category1;
myCategory2 = category2;
myCategory3 = category3;
myCategory4 = category4;
myValue1 = value1;
myValue2 = value2;
}
// Getters here
}
I create a big list of a lot of records. Only the second and fifth values, i / 10000 and i, are used later, by the getters getCategory2() and getValue1() respectively.
List<Record> list = new ArrayList<>();
for (int i = 0; i < 115000; i++)
{
list.add(new Record("A", i / 10000, "B", "C", i, (double) i / 100 + 1));
}
Note that first 10,000 records have a category2 of 0, then next 10,000 have 1, etc., while the value1 values are 0-114999 sequentially.
I create a Stream that is both parallel and sorted.
Stream<Record> stream = list.stream()
.parallel()
.sorted(
//(r1, r2) -> Integer.compare(r1.getCategory2(), r2.getCategory2())
)
//.parallel()
;
I have a ForkJoinPool that maintains 8 threads, which is the number of cores I have on my PC.
ForkJoinPool pool = new ForkJoinPool(8);
I use the trick described here to submit a stream processing task to my own ForkJoinPool instead of the common ForkJoinPool.
List<Record> output = pool.submit(() ->
stream.collect(Collectors.toList()
)).get();
I expected that the parallel sorted operation would respect the encounter order of the stream, and that it would be a stable sort, because the Spliterator returned by ArrayList is ORDERED.
However, simple code that prints out the elements of the resultant List output in order shows that it's not quite the case.
for (Record record : output)
{
System.out.println(record.getValue1());
}
Output, condensed:
0
1
2
3
...
69996
69997
69998
69999
71875 // discontinuity!
71876
71877
71878
...
79058
79059
79060
79061
70000 // discontinuity!
70001
70002
70003
...
71871
71872
71873
71874
79062 // discontinuity!
79063
79064
79065
79066
...
114996
114997
114998
114999
The size() of output is 115000, and all elements appear to be there, just in a slightly different order.
So I wrote some checking code to see if the sort was stable. If it's stable, then all of the value1 values should remain in order. This code verifies the order, printing any discrepancies.
int prev = -1;
boolean verified = true;
for (Record record : output)
{
int curr = record.getValue1();
if (prev != -1)
{
if (prev + 1 != curr)
{
System.out.println("Warning: " + prev + " followed by " + curr + "!");
verified = false;
}
}
prev = curr;
}
System.out.println("Verified: " + verified);
Output:
Warning: 69999 followed by 71875!
Warning: 79061 followed by 70000!
Warning: 71874 followed by 79062!
Warning: 99999 followed by 100625!
Warning: 107811 followed by 100000!
Warning: 100624 followed by 107812!
Verified: false
This condition persists if I do any of the following:
Replace the ForkJoinPool with a ThreadPoolExecutor.
ThreadPoolExecutor pool = new ThreadPoolExecutor(8, 8, 0, TimeUnit.SECONDS, new ArrayBlockingQueue<>(10));
Use the common ForkJoinPool by processing the Stream directly.
List<Record> output = stream.collect(Collectors.toList());
Call parallel() after I call sorted.
Stream<Record> stream = list.stream().sorted().parallel();
Call parallelStream() instead of stream().parallel().
Stream<Record> stream = list.parallelStream().sorted();
Sort using a Comparator. Note that this sort criterion is different that the "natural" order I defined for the Comparable interface, although starting with the results already in order from the beginning, the result should still be the same.
Stream<Record> stream = list.stream().parallel().sorted(
(r1, r2) -> Integer.compare(r1.getCategory2(), r2.getCategory2())
);
I can only get this to preserve the encounter order if I don't do one of the following on the Stream:
Don't call parallel().
Don't call any overload of sorted.
Interestingly, the parallel() without a sort preserved the order.
In both of the above cases, the output is:
Verified: true
My Java version is 1.8.0_05. This anomaly also occurs on Ideone, which appears to be running Java 8u25.
Update
I've upgraded my JDK to the latest version as of this writing, 1.8.0_45, and the problem is unchanged.
Question
Is the record order in the resultant List (output) out of order because the sort is somehow not stable, because the encounter order is not preserved, or some other reason?
How can I ensure that the encounter order is preserved when I create a parallel stream and sort it?
It looks like Arrays.parallelSort isn't stable in some circumstances. Well spotted. The stream parallel sort is implemented in terms of Arrays.parallelSort, so it affects streams as well. Here's a simplified example:
public class StableSortBug {
static final int SIZE = 50_000;
static class Record implements Comparable<Record> {
final int sortVal;
final int seqNum;
Record(int i1, int i2) { sortVal = i1; seqNum = i2; }
#Override
public int compareTo(Record other) {
return Integer.compare(this.sortVal, other.sortVal);
}
}
static Record[] genArray() {
Record[] array = new Record[SIZE];
Arrays.setAll(array, i -> new Record(i / 10_000, i));
return array;
}
static boolean verify(Record[] array) {
return IntStream.range(1, array.length)
.allMatch(i -> array[i-1].seqNum + 1 == array[i].seqNum);
}
public static void main(String[] args) {
Record[] array = genArray();
System.out.println(verify(array));
Arrays.sort(array);
System.out.println(verify(array));
Arrays.parallelSort(array);
System.out.println(verify(array));
}
}
On my machine (2 core x 2 threads) this prints the following:
true
true
false
Of course, it's supposed to print true three times. This is on the current JDK 9 dev builds. I wouldn't be surprised if it occurs in all the JDK 8 releases thus far, given what you've tried. Curiously, reducing the size or the divisor will change the behavior. A size of 20,000 and a divisor of 10,000 is stable, and a size of 50,000 and a divisor of 1,000 is also stable. It seems like the problem has to do with a sufficiently large run of values comparing equal versus the parallel split size.
The OpenJDK issue JDK-8076446 covers this bug.

Categories