Does windowAll operator in Flink scale down the parallelization to 1?

Does windowAll operator in Flink scale down the parallelization to 1? - java

I have a stream in Flink which sends cubes from a source, make a transformation on the cube(adding 1 to each element in the cube), then lastly send it downstream to print the throughput of every second.
The stream is parallelized over 4 threads.
If I understand correctly the windowAll operator is a non-parallel transformation and should therefore scale down the parallelization back to 1, and by using it together with TumblingProcessingTimeWindows.of(Time.seconds(1)), sum the throughput of all parallelized subtasks within the latest second and print it. I'm unsure if I get correct output since the throughput every second is printed like this:
1> 25
2> 226
3> 354
4> 372
1> 382
2> 403
3> 363
...
Question: Does the stream printer print the throughput from each thread(1,2,3 & 4), or is it only that it chooses e.g. thread 3 to print the throughput sum of all the subtasks on?
When I set the parallelism of the environment to 1 in the beginningenv.setParallelism(1), I don't get the "x> " before the throughput, but I seem to get the same(or even better) throughput as when it is set to 4. Like this:
45
429
499
505
1
503
524
530
...
Here is a code-snippet of the program:
imports...
public class StreamingCase {
public static void main(String[] args) throws Exception {
int parallelism = 4;
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
env.setParallelism(parallelism);
DataStream<Cube> start = env
.addSource(new CubeSource());
DataStream<Cube> adder = start
.map(new MapFunction<Cube, Cube>() {
#Override
public Cube map(Cube cube) throws Exception {
return cube.cubeAdd(1);
}
});
DataStream<Integer> throughput = ((SingleOutputStreamOperator<Cube>) adder)
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(1)))
.apply(new AllWindowFunction<Cube, Integer, TimeWindow>() {
#Override
public void apply(TimeWindow tw,
Iterable<Cube> values,
Collector<Integer> out) throws Exception {
int sum = 0;
for (Cube c : values)
sum++;
out.collect(sum);
}
});
throughput.print();
env.execute("Cube Stream of Sweetness");
}
}

If the parallelism of the environment is set to 3 and you are using a WindowAll operator, only the window operator runs in parallelism 1. The sink will still be running with parallelism 3. Hence, the plan looks as follows:
In_1 -\ /- Out_1
In_2 --- WindowAll_1 --- Out_2
In_3 -/ \- Out_3
The WindowAll operator emits its output to its subsequent tasks using a round-robin strategy. That's the reason for the different threads emitting the result records of program.
When you set the environment parallelism to 1, all operators run with a single task.

Related

How to block until next data is emitted from hot Flux?

I have some function that return some Flux<Integer>. This flux is hot, it is being emitting live data. After some time of execution, I want to block until the next Integer is emitted, and assign to a variable. This Integer may not be the first and will not be the last.
I considered blockFirst(), but this would block indefinitely as the Flux has already emitted an Integer. I do not care if the Integer is the first or last in the Flux, I just want to block till the next Integer is emitted and assign it to a variable.
How would I do this? I think I could subscribe to the Flux, and after the first value, unsubscribe, but I am hoping there is a better way to do this.

It depends on the replay/buffering behavior of your hot flux. Both blockFirst() and next() operator do the same things: they wait for the first value received in the current subscription.
It is very important to understand that, because in the case of hot fluxes, subscription is independent of source data emission. The first value is not necessarily the first value emitted upstream. It is the first value received by your current subscriber, and that depends on the upstream flow behaviour.
In case of hot fluxes, how they pass values to the subscribers depends both on their buffering and broadcast strategies. For this answer, I will focus only on the buffering aspect:
If your hot flux does not buffer any emitted value (Ex: Flux.share(), Sinks.multicast().directBestEffort()), then both blockFirst() and next().block() operators meet your requirement: wait until the next emitted live data in a blocking fashion.
NOTE: next() has the advantage to allow to become non-blocking if replacing block with cache and subscribe
If your upstream flux does buffer some past values, then your subscriber / downsream flow will not only receive live stream. Before it, it will receive (part of) upstream history. In such case, you will have to use a more advanced strategy to skip values until the one you want.
From your question, I would say that skipping values until an elapsed time has passed can be done using skipUntilOther(Mono.delay(wantedDuration)).
But be careful, because the delay starts from your subscription, not from upstream subscription (to do so, you would require the upstream to provide timed elements, and switch to another strategy).
It is also important to know that Reactor forbids calling block() from some Threads (the one used by non-elastic schedulers).
Let's demonstrate all of that with code. In the below program, there's 4 examples:
Use next/blockFirst directly on a non-buffering hot flux
Use skipUntilOther on a buffering hot flux
Show that blocking can fail sometimes
Try to avoid block operation
All examples are commented for clarity, and launched sequentially in a main function:
import java.time.Duration;
import java.util.concurrent.CountDownLatch;
import reactor.core.publisher.Flux;
import reactor.core.publisher.Mono;
public class HotFlux {
public static void main(String[] args) throws Exception {
System.out.println("== 1. Hot flux without any buffering");
noBuffer();
System.out.println("== 2. Hot flux with buffering");
withBuffer();
// REMINDER: block operator is not always accepted by Reactor
System.out.println("== 3. block called from a wrong context");
blockFailsOnSomeSchedulers();
System.out.println("== 4. Next value without blocking");
avoidBlocking();
}
static void noBuffer() {
// Prepare a hot flux thanks to share().
var hotFlux = Flux.interval(Duration.ofMillis(100))
.share();
// Prepare an operator that fetch the next value from live stream after a delay.
var nextValue = Mono.delay(Duration.ofMillis(300))
.then(hotFlux.next());
// Launch live data emission
var livestream = hotFlux.subscribe(i -> System.out.println("Emitted: "+i));
try {
// Trigger value fetching after a delay
var value = nextValue.block();
System.out.println("next() -> " + value);
// Immediately try to block until next value is available
System.out.println("blockFirst() -> " + hotFlux.blockFirst());
} finally {
// stop live data production
livestream.dispose();
}
}
static void withBuffer() {
// Prepare a hot flux replaying all values emitted in the past to each subscriber
var hotFlux = Flux.interval(Duration.ofMillis(100))
.cache();
// Launch live data emission
var livestream = hotFlux.subscribe(i -> System.out.println("Emitted: "+i));
try {
// Wait half a second, then get next emitted value.
var value = hotFlux.skipUntilOther(Mono.delay(Duration.ofMillis(500)))
.next()
.block();
System.out.println("skipUntilOther + next: " + value);
// block first can also be used
value = hotFlux.skipUntilOther(Mono.delay(Duration.ofMillis(500)))
.blockFirst();
System.out.println("skipUntilOther + blockFirst: " + value);
} finally {
// stop live data production
livestream.dispose();
}
}
public static void blockFailsOnSomeSchedulers() throws InterruptedException {
var hotFlux = Flux.interval(Duration.ofMillis(100)).share();
var barrier = new CountDownLatch(1);
var forbiddenInnerBlock = Mono.delay(Duration.ofMillis(200))
// This block will fail, because delay op above is scheduled on parallel scheduler by default.
.then(Mono.fromCallable(() -> hotFlux.blockFirst()))
.doFinally(signal -> barrier.countDown());
forbiddenInnerBlock.subscribe(value -> System.out.println("Block success: "+value),
err -> System.out.println("BLOCK FAILED: "+err.getMessage()));
barrier.await();
}
static void avoidBlocking() throws InterruptedException {
var hotFlux = Flux.interval(Duration.ofMillis(100)).share();
var asyncValue = hotFlux.skipUntilOther(Mono.delay(Duration.ofMillis(500)))
.next()
// time wil let us verify that the value has been fetched once then cached properly
.timed()
.cache();
asyncValue.subscribe(); // launch immediately
// Barrier is required because we're in a main/test program. If you intend to reuse the mono in a bigger scope, you do not need it.
CountDownLatch barrier = new CountDownLatch(2);
// We will see that both subscribe methods return the same timestamped value, because it has been launched previously and cached
asyncValue.subscribe(value -> System.out.println("Get value (1): "+value), err -> barrier.countDown(), () -> barrier.countDown());
asyncValue.subscribe(value -> System.out.println("Get value (2): "+value), err -> barrier.countDown(), () -> barrier.countDown());
barrier.await();
}
}
This program gives the following output:
== 1. Hot flux without any buffering
Emitted: 0
Emitted: 1
Emitted: 2
Emitted: 3
next() -> 3
Emitted: 4
blockFirst() -> 4
== 2. Hot flux with buffering
Emitted: 0
Emitted: 1
Emitted: 2
Emitted: 3
Emitted: 4
Emitted: 5
skipUntilOther + next: 5
Emitted: 6
Emitted: 7
Emitted: 8
Emitted: 9
Emitted: 10
Emitted: 11
skipUntilOther + blockFirst: 11
== 3. block called from a wrong context
BLOCK FAILED: block()/blockFirst()/blockLast() are blocking, which is not supported in thread parallel-6
== 4. Next value without blocking
Get value (1): Timed(4){eventElapsedNanos=500247504, eventElapsedSinceSubscriptionNanos=500247504, eventTimestampEpochMillis=1674831275873}
Get value (2): Timed(4){eventElapsedNanos=500247504, eventElapsedSinceSubscriptionNanos=500247504, eventTimestampEpochMillis=1674831275873}

Java Spliterator Continually Splits Parallel Stream

I found some surprising behavior with Java parallel streams. I made my own Spliterator, and the resulting parallel stream gets divided up until each stream has only one element in it. That seems way too small and I wonder what I'm doing wrong. I'm hoping there's some characteristics I can set to correct this.
Here's my test code. The Float here is just a dummy payload, my real stream class is somewhat more complicated.
public static void main( String[] args ) {
TestingSpliterator splits = new TestingSpliterator( 10 );
Stream<Float> test = StreamSupport.stream( splits, true );
double total = test.mapToDouble( Float::doubleValue ).sum();
System.out.println( "Total: " + total );
}
This code will continually split this stream until each Spliterator has only one element. That seems way too much to be efficient.
Output:
run:
Split on count: 10
Split on count: 5
Split on count: 3
Split on count: 5
Split on count: 2
Split on count: 2
Split on count: 3
Split on count: 2
Split on count: 2
Total: 5.164293184876442
BUILD SUCCESSFUL (total time: 0 seconds)
Here's the code of the Spliterator. My main concern is what characteristics I should be using, but perhaps there's a problem somewhere else?
public class TestingSpliterator implements Spliterator<Float> {
int count;
int splits;
public TestingSpliterator( int count ) {
this.count = count;
}
#Override
public boolean tryAdvance( Consumer<? super Float> cnsmr ) {
if( count > 0 ) {
cnsmr.accept( (float)Math.random() );
count--;
return true;
} else
return false;
}
#Override
public Spliterator<Float> trySplit() {
System.err.println( "Split on count: " + count );
if( count > 1 ) {
splits++;
int half = count / 2;
TestingSpliterator newSplit = new TestingSpliterator( count - half );
count = half;
return newSplit;
} else
return null;
}
#Override
public long estimateSize() {
return count;
}
#Override
public int characteristics() {
return IMMUTABLE | SIZED;
}
}
So how can I get the stream to be split in to much larger chunks? I was hoping in the neighborhood of 10,000 to 50,000 would be better.
I know I can return null from the trySplit() method, but that seems like a backwards way of doing it. It seems like the system should have some notion of number of cores, current load, and how complex the code is that uses the stream, and adjust itself accordingly. In other words, I want the stream chunk size to be externally configured, not internally fixed by the stream itself.
EDIT: re. Holger's answer below, when I increase the number of elements in the original stream, the stream splits are somewhat less, so StreamSupport does stop splitting eventually.
At an initial stream size of 100 elements, StreamSupport stops splitting when it reaches a stream size of 2 (the last line I see on my screen is Split on count: 4).
And for an initial stream size of 1000 elements, the final size of the individual stream chunks is about 32 elements.
Edit part deux: After looking at the output of the above, I changed my code to list out the individual Spliterators created. Here's the changes:
public static void main( String[] args ) {
TestingSpliterator splits = new TestingSpliterator( 100 );
Stream<Float> test = StreamSupport.stream( splits, true );
double total = test.mapToDouble( Float::doubleValue ).sum();
System.out.println( "Total Spliterators: " + testers.size() );
for( TestingSpliterator t : testers ) {
System.out.println( "Splits: " + t.splits );
}
}
And to the TestingSpliterator's ctor:
static Queue<TestingSpliterator> testers = new ConcurrentLinkedQueue<>();
public TestingSpliterator( int count ) {
this.count = count;
testers.add( this ); // OUCH! 'this' escape
}
The result of this code is that the first Spliterator gets split 5 times. The nextSpliterator gets split 4 times. The next set of Spliterators get split 3 times. Etc. The result is that 36 Spliterators get made and the stream is split into as many parts. On typical desktop systems this seems to be the way that the API thinks is the best for parallel operations.
I'm going to accept Holger's answer below, which is essentially that the StreamSupport class is doing the right thing, don't worry, be happy. Part of the issue for me was that I was doing my early testing on very small stream sizes and I was surprised at the number of splits. Don't make the same mistake yourself.

You are looking on it from the wrong angle. The implementation did not split “until each spliterator has one element”, it rather split “until having ten spliterators”.
A single spliterator instance can only be processed by one thread. A spliterator is not required to support splitting after its traversal has been started. Therefore any splitting opportunity that has not been used beforehand may lead to limited parallel processing capabilities afterwards.
It’s important to keep in mind that the Stream implementation received a ToDoubleFunction with an unknown workload¹. It doesn’t know that it is as simple as Float::doubleValue in your case. It could be a function taking a minute to evaluate and then, having a spliterator per CPU core would be righteous right. Even having more than CPU cores is a valid strategy to handle the possibility that some evaluations take significantly longer than others.
A typical number of initial spliterators will be “number of CPU cores” × 4, though here might be more split operations later when more knowledge about actual workloads exist. When your input data has less than that number, it’s not surprising when it gets split down until one element per spliterator is left.
You may try with new TestingSpliterator( 10000 ) or 1000 or 100 to see that the number of splits will not change significantly, once the implementation assumes to have enough chunks to keep all CPU cores busy.
Since your spliterator does not know anything about the per-element workload of the consuming stream either, you shouldn’t be concerned about this. If you can smoothly support splitting down to single elements, just do that.
¹ It doesn’t have special optimizations for the case that no operations have been chained, though.

Unless I am missing the obvious, you could always pass a bufferSize in the constructor and use that for your trySplit:
#Override
public Spliterator<Float> trySplit() {
if( count > 1 ) {
splits++;
if(count > bufferSize) {
count = count - bufferSize;
return new TestingSpliterator( bufferSize, bufferSize);
}
}
return null;
}
And with this:
TestingSpliterator splits = new TestingSpliterator(12, 5);
Stream<Float> test = StreamSupport.stream(splits, true);
test.map(x -> new AbstractMap.SimpleEntry<>(
x.doubleValue(),
Thread.currentThread().getName()))
.collect(Collectors.groupingBy(
Map.Entry::getValue,
Collectors.mapping(
Map.Entry::getKey,
Collectors.toList())))
.forEach((x, y) -> System.out.println("Thread : " + x + " processed : " + y));
You would see that there are 3 threads. Two of them process 5 elements and one 2.

RxJava: onBackpressureBlock() strange behavior

I am playing around with RxJava (RxKotlin to be precise). Here I have the following Observables:
fun metronome(ms: Int) = observable<Int> {
var i = 0;
while (true) {
if (ms > 0) {
Thread.sleep(ms.toLong())
}
if (it.isUnsubscribed()) {
break
}
it.onNext(++i)
}
}
And I'd like to have a few of them merged and running concurrently. They ignore backpressure so the backpressure operators have to be applied to them.
Then I create
val cores = Runtime.getRuntime().availableProcessors()
val threads = Executors.newFixedThreadPool(cores)
val scheduler = Schedulers.from(threads)
And then I merge the metronomes:
val o = Observable.merge(listOf(metronome(0),
metronome(1000).map { "---------" })
.map { it.onBackpressureBlock().subscribeOn(scheduler) })
.take(5000, TimeUnit.MILLISECONDS)
The first one is supposed to emit items incessantly.
If I do so in the last 3 seconds of the run I get the following output:
...
[RxComputationThreadPool-5]: 369255
[RxComputationThreadPool-5]: 369256
[RxComputationThreadPool-5]: 369257
[RxComputationThreadPool-5]: ---------
[RxComputationThreadPool-5]: ---------
[RxComputationThreadPool-5]: ---------
Seems that the Observables are subscribed on the same one thread, and the first observable is blocked for 3+ seconds.
But when I swap onBackpressureBlock() and subscribeOn(scheduler) calls the output becomes what I expected, the output gets merged during the whole execution.
It's obvious to me that calls order matters in RxJava, but I don't quite understand what happens in this particular situation.
So what happens when onBackpressureBlock operator is applied before subscribeOn and what if after?

The onBackpressureBlock operator is a failed experiment; it requires care where to apply. For example, subscribeOn().onBackpressureBlock() works but not the other way around.
RxJava has non-blocking periodic timer called interval so you don't need to roll your own.

Best method for parallel log aggregation

My program needs to analyze a bunch of log files daily which are generated on a hourly basis from each application server.
So if I have 2 app servers I will be processing 48 files (24 files * 2 app servers).
file sizes range 100-300 mb. Each line in every file is a log entry which is of the format
[identifier]-[number of pieces]-[piece]-[part of log]
for example
xxx-3-1-ABC
xxx-3-2-ABC
xxx-3-3-ABC
These can be distributed over the 48 files which I mentioned, I need to merge these logs like so
xxx-PAIR-ABCABCABC
My implementation uses a thread pool to read through files in parallel and then aggregate them using a ConcurrentHashMap
I define a class LogEvent.scala
class LogEvent (val id: String, val total: Int, var piece: Int, val json: String) {
var additions: Long = 0
val pieces = new Array[String](total)
addPiece(json)
private def addPiece (json: String): Unit = {
pieces(piece) = json
additions += 1
}
def isDone: Boolean = {
additions == total
}
def add (slot: Int, json: String): Unit = {
piece = slot
addPiece(json)
}
The main processing happens over multiple threads and the code is something on the lines of
//For each file
val logEventMap = new ConcurrentHashMap[String, LogEvent]().asScala
Future {
Source.fromInputStream(gis(file)).getLines().foreach {
line =>
//Extract the id part of the line
val idPart: String = IDPartExtractor(line)
//Split line on '-'
val split: Array[String] = idPart.split("-")
val id: String = split(0) + "-" + split(1)
val logpart: String = JsonPartExtractor(line)
val total = split(2) toInt
val piece = split(3) toInt
def slot: Int = {
piece match {
case x if x - 1 < 0 => 0
case _ => piece - 1
}
}
def writeLogEvent (logEvent: LogEvent): Unit = {
if (logEvent.isDone) {
//write to buffer
val toWrite = id + "-PAIR-" + logEvent.pieces.mkString("")
logEventMap.remove(logEvent.id)
writer.writeLine(toWrite)
}
}
//The LOCK
appendLock {
if (!logEventMap.contains(id)) {
val logEvent = new LogEvent(id, total, slot, jsonPart)
logEventMap.put(id, logEvent)
//writeLogEventToFile()
}
else {
val logEvent = logEventMap.get(id).get
logEvent.add(slot, jsonPart)
writeLogEvent(logEvent)
}
}
}
}
The main thread blocks till all the futures complete
Using this approach I have been able to cut the processing time from an hour+ to around 7-8 minutes.
My questions are as follows -
Can this be done in a better way, I am reading multiple files using different threads and I need to lock at the block where the aggregation happens, are there better ways of doing this?
The Map grows very fast in memory, any suggestions for off heap storage for such a use case
Any other feedback.
Thanks

A common way to do this is to sort each file and then merge the sorted files. The result is a single file that has the individual items in the order that you want them. Your program then just needs to do a single pass through the file, combining adjacent matching items.
This has some very attractive benefits:
The sort/merge is done by standard tools that you don't have to write
Your aggregator program is very simple. Or, there might even be a standard tool that will do it.
Memory requirements are lessened. The sort/merge programs know how to manage memory, and your aggregation program's memory requirements are minimal.
There are, of course some drawbacks. You'll use more disk space and the process will be somewhat slower due to the I/O cost.
When I'm faced with something like this, I almost always go with using the standard tools and a simple aggregator program. The increased performance I get from a custom program just doesn't justify the time it takes to develop the thing.

For this sort of thing, if you can, use Splunk, if not, copy what it does which is index the log files for aggregation on demand at a later point.
For off heap storage, look at distributed caches - Hazelcast or Coherence. Both support provide java.util.Map implementations that are stored over multiple JVMs.

What could cause a java process to get gradually decreasing share of CPU?

I have a very simple java program that prints out 1 million random numbers. In linux, I observed the %CPU that this program takes during its lifespan, it starts off at 98% then gradually decreases to 2%, thus causing the program to be very slow. What are some of the factors that might cause the program to gradually get less CPU time?
I've tried running it with nice -20 but I still see the same results.
EDIT: running the program with /usr/bin/time -v I'm seeing an unusual amount of involuntary context switches (588 voluntary vs 16478 involuntary), which suggests that the OS is letting some other higher priority process run.

It boils down to two things:
I/O is expensive, and
Depending on how you're storing the numbers as you go along, that can have an adverse effect on performance as well.
If you're mainly doing System.out.println(randInt) in a loop a million times, then that can get expensive. I/O isn't one of those things that comes for free, and writing to any output stream costs resources.

I would start by profiling via JConsole or VisualVM to see what it's actually doing when it has low CPU %. As mentioned in comments there's a high chance it's blocking, e.g. waiting for IO (user input, SQL query taking a long time, etc.)

If your application is I/O bound - for example waiting for responses from network calls, or disk read/write

If you want to try and balance everything, you should create a queue to hold numbers to print, then have one thread generate them (the producer) and the other read and print them (the consumer). This can easily be done with a LinkedBlockingQueue.
public class PrintQueueExample {
private BlockingQueue<Integer> printQueue = new LinkedBlockingQueue<Integer>();
public static void main(String[] args) throws InterruptedException {
PrinterThread thread = new PrinterThread();
thread.start();
for (int i = 0; i < 1000000; i++) {
int toPrint = ...(i) ;
printQueue.put(Integer.valueOf(toPrint));
}
thread.interrupt();
thread.join();
System.out.println("Complete");
}
private static class PrinterThread extends Thread {
#Override
public void run() {
try {
while (true) {
Integer toPrint = printQueue.take();
System.out.println(toPrint);
}
} catch (InterruptedException e) {
// Interruption comes from main, means processing numbers has stopped
// Finish remaining numbers and stop thread
List<Integer> remainingNumbers = new ArrayList<Integer>();
printQueue.drainTo(remainingNumbers);
for (Integer toPrint : remainingNumbers)
System.out.println(toPrint);
}
}
}
}
There may be a few problems with this code, but this is the gist of it.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.