I need to process events coming from a Flux in groups (by id) so that within an individual group each event is processed sequentially, but groups are processed in parallel. As far as I know, this can be achieved with groupBy and concatMap.
When I implemented this my tests started to hang indefinitely on some big numbers of unique ids. I isolated the problem to the code below and found a specific number on which the code starts to hang - 256. I definitely don't get why this happens at all and where 256 comes from.
Here is the code which hangs:
#ParameterizedTest
#ValueSource(ints = {250, 251, 252, 253, 254, 255, 256})
void freezeTest(int uniqueStringsCount) {
var scheduler = Schedulers
.newBoundedElastic(
1000,
1000,
"really-big-scheduler"
);
Flux.range(0, uniqueStringsCount)
.map(Object::toString)
.repeat()
// this represents "a lot of events"
.take(50_000)
.groupBy(x -> x)
// this gets the same results
// .parallel(400)
.parallel()
.flatMap(group ->
group.concatMap(e ->
// this represents a processing operation on each event
Mono.fromRunnable(() -> {
try {
Thread.sleep(0);
} catch (InterruptedException ex) {
throw new RuntimeException(ex);
}
})
// this also doesn't work
// Mono.delay(Duration.ofMillis(0))
// Mono.empty()
// big scheduler doesn't help either
// ).subscribeOn(scheduler)
)
// ).runOn(scheduler)
).runOn(Schedulers.parallel())
.then()
.block();
}
We first construct a Flux with a lot of (50k, just an example) Strings. But there are only some number of unique strings in that Flux, which is that split up in that number of groups that are processed in parallel. But events within each group are processed sequentially via concatMap. And this code hangs only on 256 unique strings.
Initially, I thought that some thread pool somewhere is exhausted, so I added a really-big-scheduler to test that - but it only executes slower and also hangs on 256.
Then I tried removing blocking Thread.sleep (I started with that since my real implementation is possibly blocking) - but it also hangs on 256
Also, changing parallelism (400 in the code above) doesn't change anything.
Flux.groupBy needs extra care when dealing with a large amount of groups, as stated in its javadoc:
Note that groupBy works best with a low cardinality of groups, so chose your keyMapper function accordingly.
The groups need to be drained and consumed downstream for groupBy to work correctly. Notably when the criteria produces a large amount of groups, it can lead to hanging if the groups are not suitably consumed downstream (eg. due to a flatMap with a maxConcurrency parameter that is set too low).
Here the prefetch amount is set too low: by default it is set to Queues.SMALL_BUFFER_SIZE, which is by default 256 (this can be changed with the property reactor.bufferSize.small). Flux.groupBy has a method to set the prefetch amount manually: Flux.groupBy(Function, int), so I suggest to replace your operator with .groupBy(x -> x, 1024) or another suitable high amount.
The prefetch amount is important as it is the amount of uncompleted items it can process. In your case, first 255 Scheduler.createWorker() calls are made, each item is put on a Worker, and then put it and the created GroupedFlux in groupBy's internal queues waiting for the Worker to complete. When the 256th item appears before any Worker completes, it is unable to put it in the queues, and hangs.
Related
I have part of a system that processes a BlockingQueue of input items within a worker thread, and puts the results on an BlockingQueue of output items, where the relevant code (simplified) looks something like this:
while (running()) {
InputObject a=inputQueue.take(); // Get from input BlockingQueue
OutputObject b=doProcessing(a); // Process the item
outputQueue.put(b); // Place on output BlockingQueue
}
doProcessing is the main performance bottleneck in this code, but the processing of queue items could be parallelised since the processing steps are all independent of each other.
I would therefore like to improve this so that items can be processed concurrently by multiple threads, with the constraint that this must not change the order of outputs (e.g. I can't simply have 10 threads running the loop above, because that might result in outputs being ordered differently depending on processing times).
What is the best way to achieve this in pure, idiomatic Java?
Parallel streams from List preserve ordering:
List<T> input = ...
List<T> output = input.parallelStream()
.filter(this::running)
.map(this::doProcessing)
.collect(Collectors.toList());
PriorityBlockingQueue can be used if your work items can be compared to one another, and you will wait until running() is false before reading from the output queue:
outputQueue = new PriorityBlockingQueue<>();
Or you could order them after they have all been processed (if they can be compared to one another):
outputQueue.drainTo(outputList);
outputList.sort(null);
A simple way to implement comparation would be assigning a progressive ID to each element put into the input queue.
Create X event-loop threads, where X is the amount of steps that can be processed in parallel.
They will be processed in parallel, except one after another, i.e. not on the same item. While one step will be carried on on one item, the previous step will be carried on on the previous item, etc.
To further optimize it, you can use concurrent queues provided by JCTools, which are optimized for Single-Producer Single-Consumer scenarios (JDK's BlockingQueue implementations support Multiple-Producer Multiple-Consumer).
// Thread 1
while (running()) {
InputObject a = inputQueue.take();
OutputObject b = doProcessingStep1(a);
queue1.put(b);
}
// Thread 2
while (running()) {
InputObject a = queue1.take();
OutputObject b = doProcessingStep2(a);
queue2.put(b);
}
// Thread 3
while (running()) {
InputObject a = queue2.take();
OutputObject b = doProcessingStep3(a);
outputQueue.put(b);
}
I want change my code for single subscriber. Now i have
auctionFlux.window(Duration.ofSeconds(120), Duration.ofSeconds(120)).subscribe(
s -> s.groupBy(Auction::getItem).subscribe( longAuctionGroupedFlux -> longAuctionGroupedFlux.reduce(new ItemDumpStats(), this::calculateStats )
));
This code is working correctly reduce method is very simple. I tried change my code for single subscriber
auctionFlux.window(Duration.ofSeconds(120), Duration.ofSeconds(120))
.flatMap(window -> window.groupBy(Auction::getItem))
.flatMap(longAuctionGroupedFlux -> longAuctionGroupedFlux.reduce(new ItemDumpStats(), this::calculateStats))
.subscribe(itemDumpStatsMono -> log.info(itemDumpStatsMono.toString()));
This is my code, and this code is not working. No errors and no results. After debugging i found code is stuck on second flatMap when i reducing stream. I think problem is on flatMap merging, stucking on Mono resolve. Some one now how to fix this problem and use only single subscriber?
How to replicate, you can use another class or create one. In small size is working but on bigger is dying
List<Auction> auctionList = new ArrayList<>();
for (int i = 0;i<100000;i++){
Auction a = new Auction((long) i, "test");
a.setItem((long) (i%50));
auctionList.add(a);
}
Flux.fromIterable(auctionList).groupBy(Auction::getId).flatMap(longAuctionGroupedFlux ->
longAuctionGroupedFlux.reduce(new ItemDumpStats(), (itemDumpStats, auction) -> itemDumpStats)).collectList().subscribe(itemDumpStats -> System.out.println(itemDumpStats.toString()));
On this approach is instant result but I using 3 subscribers
Flux.fromIterable(auctionList)
.groupBy(Auction::getId)
.subscribe(
auctionIdAuctionGroupedFlux -> auctionIdAuctionGroupedFlux.reduce(new ItemDumpStats(), (itemDumpStats, auction) -> itemDumpStats).subscribe(itemDumpStats -> System.out.println(itemDumpStats.toString()
)
));
I think the behavior you described is related to the interaction between groupBy chained with flatMap.
Check groupBy documentation. It states that:
The groups need to be drained and consumed downstream for groupBy to work correctly. Notably when the criteria produces a large amount of groups, it can lead to hanging if the groups are not suitably consumed downstream (eg. due to a flatMap with a maxConcurrency parameter that is set too low).
By default, maxConcurrency (flatMap) is set to 256 (i checked the source code of 3.2.2). So,
selecting more than 256 groups may cause the execution to hang (particularly when all execution happens on the same thread).
The following code helps in understanding what happens when you chain the operators groupBy and flatMap:
#Test
public void groupAndFlatmapTest() {
val groupCount = 257;
val groupSize = 513;
val list = rangeClosed(1, groupSize * groupCount).boxed().collect(Collectors.toList());
val source = Flux.fromIterable(list)
.groupBy(i -> i % groupCount)
.flatMap(Flux::collectList);
StepVerifier.create(source).expectNextCount(groupCount).expectComplete().verify();
}
The execution of this code hangs. Changing groupCount to 256 or less makes the test pass (for every value of groupSize).
So, regarding your original problem, it is very possible that you are creating a large amount of groups with your key-selector Auction::getItem.
Adding parallel fixed problem, but i looking answer why reduce dramatically slow flatMap.
If the input size is too small the library automatically serializes the execution of the maps in the stream, but this automation doesn't and can't take in account how heavy is the map operation. Is there a way to force parallelStream() to actually parallelize CPU heavy maps?
There seems to be a fundamental misunderstanding. The linked Q&A discusses that the stream apparently doesn’t work in parallel, due to the OP not seeing the expected speedup. The conclusion is that there is no benefit in parallel processing if the workload is too small, not that there was an automatic fallback to sequential execution.
It’s actually the opposite. If you request parallel, you get parallel, even if it actually reduces the performance. The implementation does not switch to the potentially more efficient sequential execution in such cases.
So if you are confident that the per-element workload is high enough to justify the use of a parallel execution regardless of the small number of elements, you can simply request a parallel execution.
As can easily demonstrated:
Stream.of(1, 2).parallel()
.peek(x -> System.out.println("processing "+x+" in "+Thread.currentThread()))
.forEach(System.out::println);
On Ideone, it prints
processing 2 in Thread[main,5,main]
2
processing 1 in Thread[ForkJoinPool.commonPool-worker-1,5,main]
1
but the order of messages and details may vary. It may even be possible that in some environments, both task may happen to get executed by the same thread, if it can steel the second task before another thread gets started to pick it up. But of course, if the tasks are expensive enough, this won’t happen. The important point is that the overall workload has been split and enqueued to be potentially picked up by other worker threads.
If execution by a single thread happens in your environment for the simple example above, you may insert simulated workload like this:
Stream.of(1, 2).parallel()
.peek(x -> System.out.println("processing "+x+" in "+Thread.currentThread()))
.map(x -> {
LockSupport.parkNanos("simulated workload", TimeUnit.SECONDS.toNanos(3));
return x;
})
.forEach(System.out::println);
Then, you may also see that the overall execution time will be shorter than “number of elements”דprocessing time per element” if the “processing time per element” is high enough.
Update: the misunderstanding might be cause by Brian Goetz’ misleading statement: “In your case, your input set is simply too small to be decomposed”.
It must be emphasized that this is not a general property of the Stream API, but the Map that has been used. A HashMap has a backing array and the entries are distributed within that array depending on their hash code. It might be the case that splitting the array into n ranges doesn’t lead to a balanced split of the contained element, especially, if there are only two. The implementors of the HashMap’s Spliterator considered searching the array for elements to get a perfectly balanced split to be too expensive, not that splitting two elements was not worth it.
Since the HashMap’s default capacity is 16 and the example had only two elements, we can say that the map was oversized. Simply fixing that would also fix the example:
long start = System.nanoTime();
Map<String, Supplier<String>> input = new HashMap<>(2);
input.put("1", () -> {
System.out.println(Thread.currentThread());
LockSupport.parkNanos("simulated workload", TimeUnit.SECONDS.toNanos(2));
return "a";
});
input.put("2", () -> {
System.out.println(Thread.currentThread());
LockSupport.parkNanos("simulated workload", TimeUnit.SECONDS.toNanos(2));
return "b";
});
Map<String, String> results = input.keySet()
.parallelStream().collect(Collectors.toConcurrentMap(
key -> key,
key -> input.get(key).get()));
System.out.println("Time: " + TimeUnit.NANOSECONDS.toMillis(System.nanoTime()- start));
on my machine, it prints
Thread[main,5,main]
Thread[ForkJoinPool.commonPool-worker-1,5,main]
Time: 2058
The conclusion is that the Stream implementation always tries to use parallel execution, if you request it, regardless of the input size. But it depends on the input’s structure how well the workload can be distributed to the worker threads. Things could be even worse, e.g. if you stream lines from a file.
If you think that the benefit of a balanced splitting is worth the cost of a copying step, you could also use new ArrayList<>(input.keySet()).parallelStream() instead of input.keySet().parallelStream(), as the distribution of elements within ArrayList always allows a perflectly balanced split.
According to the documentation of groupBy:
Note: A GroupedObservable will cache the items it is to emit until such time as it is subscribed to. For this reason, in order to avoid memory leaks, you should not simply ignore those GroupedObservables that do not concern you. Instead, you can signal to them that they may discard their buffers by applying an operator like take(int)(0) to them.
There's a RxJava tutorial which says:
Internally, every Rx operator does 3 things
It subscribes to the source and observes the values.
It transforms the observed sequence according to the operator's purpose.
It pushes the modified sequence to its own subscribers, by calling onNext, onError and onCompleted.
Let's take a look at the following code block which extracts only even numbers from range(0, 10):
Observable.range(0, 10)
.groupBy(i -> i % 2)
.filter(g -> g.getKey() % 2 == 0)
.flatMap(g -> g)
.subscribe(System.out::println, Throwable::printStackTrace);
My questions are:
Does it mean filter operator already implies a subscription to every group resulted from groupBy or just the Observable<GroupedObservable> one?
Will there be a memory leak in this case? If so,
How to properly discard those groups? Replace filter with a custom one, which does a take(0) followed by a return Observable.empty()? You may ask why I don't just return take(0) directly: it's because filter doesn't neccessarily follow right after groupBy, but can be anywhere in the chain and involve more complex conditions.
Apart from the memory leak, the current implementation may end up hanging completely due to internal request coordination problems.
Note that using take(0), the group may be recreated all the time. I'd instead use ignoreElements which drops values, no items reach flatMap and the group itself won't be recreated all the time.
Your suspicions are correct in that to properly handle the grouped observable each of the inner observables (g) must be subscribed to. As filter is subscribing to the outer observable only it's a bad idea. Just do what you need in the flatMap using ignoreElements to filter out undesired groups.
Observable.range(0, 10)
.groupBy(i -> i % 2)
.flatMap(g -> {
if (g.getKey() % 2 == 0)
return g;
else
return g.ignoreElements();
})
.subscribe(System.out::println, Throwable::printStackTrace);
I am working on a large scale dataset and after building a model, I use multithreading (whole project in Java) as follows:
OutputStream out = new BufferedOutputStream(new FileOutputStream(outFile));
int i=0;
Collection<Track1Callable> callables = new ArrayList<Track1Callable>();
// For each entry in the test file, do watever needs to be done.
// Track1Callable actually processes that entry and returns a double value.
for (Pair<PreferenceArray, long[]> tests : new DataFileIterable(
KDDCupDataModel.getTestFile(dataFileDirectory))) {
PreferenceArray userTest = tests.getFirst();
callables.add(new Track1Callable(recommender, userTest));
i++;
}
ExecutorService executor = Executors.newFixedThreadPool(cores); //24 cores
List<Future<byte[]>> results = executor.invokeAll(callables);
executor.shutdown();
for (Future<byte[]> result : results) {
for (byte estimate : result.get()) {
out.write(estimate);
}
}
out.flush();
out.close();
When I receive the result from each callable, output it to a file. Does this output in the exact order as the list of initial Callables was made? In spite of some completing before others? Seems it should but not sure.
Also, I expect a total of 6.2 million bytes to be written to the outfile. But I get an additional 2000 bytes (Yeah for free). That messes up my submission and I think it is because of some concurrency issues. I tested this on small dataset and it seems to work fine there (264 bytes expected and received).
Anyhing wrong I am doing with the Executor framework or Futures?
Q: Does the order is the same as the one specified for the tasks? Yes.
From the API:
Returns: A list of Futures
representing the tasks, in the same
sequential order as produced by the
iterator for the given task list. If
the operation did not time out, each
task will have completed. If it did
time out, some of these tasks will not
have completed.
As for the "extra" bytes: have you tried doing all of this in sequential order (i.e., without using an executor) and checking if you obtain different results? It seems that your problem is outside the code provided (and probably is not due to concurrency).
The order in which the callable's are executed doesn't matter from the code you have here. You write the results in the order you store the futures in the list. Even if they were executed in reverse order, the file should appear the same as your file writing is single threaded.
I suspect your callables are interacting with each other and you get different results depending on the number of core you use. e.g. You might be using SimpleDateFormat.
I suggest you run this twice in the same program with a dataset which completes in a short time. Run it first with only one thread in the thread pool and a second time with 24 threads You should be able to compare the results from both runs with Arrays.equals(byte[], byte[]) and see that you get exactly the same results.