Parallelizing deserialization step

Parallelizing deserialization step - java

There is the following pipeline:
item is produced (the producer is external to the pipeline);
item is deserialized (JSON to Java object);
item is processed;
At the moment it all happens synchronously in a single thread:
while(producer.next()) {
var item = gson.deserialize(producer.item());
processItem(item);
}
Or schematically:
PRODUCER -> DESERIALIZATION -> CONSUMER
(sync) (sync) (sync)
The concern is that the deserialization step has no side-effects and could be parallelized saving some world time.
The overall code should like the following:
var pipeline = new Pipeline<Item>();
pipeline.setProducer(producer);
pipeline.setDeserialization(gson::deserialize);
pipeline.setConsumer(item -> {
...
});
pipeline.run();
Or schematically:
-> DESERIALIZATION
-> DESERIALIZATION
-> DESERIALIZATION
PRODUCER -> ... -> CONSUMER
-> DESERIALIZATION
-> DESERIALIZATION
-> DESERIALIZATION
(sync) (parallel) (sync)
Important notice. Deserialized items should be produced:
synchronously;
in the same order the original producer produces encoded items.
Q. Is there a standardized way to code such a pipeline?

Try
while(producer.next()) {
CompletableFuture.supplyAsync(()-> gson.deserialize(producer.item()))
.thenRunAsync(item->processItem(item));
}

One way you can achieve your pattern is to:
Construct a multi-threaded executor to process the decoding requests
Have a consumer queue; each time you submit an item to be decoded, also add the corresponding Future object to the consumer queue
Have a consumer thread sit waiting to take items off the queue [which therefore consumes them in the order they were posted], call the corresponding get() method [which waits for the item to be decoded]
So the 'consumer' would look like this:
BlockingQueue<Future<Item>> consumerQueue = new LinkedBlockingDeque<>();
Thread consumerThread = new Thread(() -> {
try {
while (true) {
Future<Item> item = consumerQueue.take();
try {
// Get the next decoded item that's ready
Item decodedItem = item.get();
// 'Consume' the item
...
} catch (ExecutionException ex) {
}
}
} catch (InterruptedException irr) {
}
});
consumerThread.start()
Meanwhile, the 'producer' end, with its multi-threaded 'decoder', would look like this:
ExecutorService decoder = Executors.newFixedThreadPool(4);
while (!producer.hasNext()) {
Item item = producer.next()
// Submit the decode job for asynchronous processing
Future<Item> p = decoder.submit(() -> {
item.decode();
}, item);
// Also queue this decode job for future consumption once complete
consumerQueue.add(p);
}
As a separate matter, I wonder if you will actually see much benefit in practice, since by insisting on consumption in the same order, you are inherently introducing a serial condition on the process. But technically, this is one way that you could achieve what you are after.
P.S. If you didn't want a separate consumer thread, then the same 'producer' thread could poll the queue for completed items and execute in line.

Related

Leverage PriorityBlockingQueue to build producer-comsumer pattern in Java Reactor

In my project, there is a Spring scheduler periodically scans "TO BE DONE" tasks from DB, then distributing them to task consumer for subsequent handling. So, the current implementation is to construct a Reactor Sinks between producer and consumer.
Sinks.Many<Task> taskSink = Sinks.many().multicast().onBackpressureBuffer(1000, false);
Producer:
Flux<Date> dates = loadDates();
dates.filterWhen(...)
.concatMap(date -> taskManager.getTaskByDate(date))
.doOnNext(taskSink::tryEmitNext)
.subscribe();
Consumer:
taskProcessor.process(taskSink.asFlux())
.subscribeOn(Schedulers.boundedElastic())
.subscribe();
By using Sink, it works fine for most of cases. But when the system under heavy load, system maintainer would want to know:
How many tasks still sitting in the Sink?
If it is possible to clear all tasks within the Sink.
If it is possible to prioritize tasks within the Sink.
Unfortunately, Sink it's impossible to fulfill all the needs mentioned above.
So, I created a wrapper class that includes a Map and PriorityBlockingQueue. I refrerenced the implementation from this link https://stackoverflow.com/a/71009712/19278017.
After that, the original producer-consumer code revised as below:
Task queue:
MergingQueue<Task> taskQueue = new PriorityMergingQueue();
Producer:
Flux<Date> dates = loadDates();
dates.filterWhen(...)
.concatMap(date -> taskManager.getTaskByDate(date))
.doOnNext(taskQueue::enqueue)
.subscribe();
Consumer:
taskProcessor.process(Flux.create((sink) -> {
sink.onRequest(n -> {
Task task;
try {
while(!sink.isCancel() && n > 0) {
if(task = taskQueue.poll(1, TimeUnit.SECOND) != null) {
sink.next(task);
n--;
}
} catch() {
....
})
.subscribeOn(Schedulers.boundedElastic())
.subscribe();
I got some questions as below:
Will that be an issue the code doing a .poll()? Since, I came across thread hang issue during the longevity testing. Just not sure if it's due to the poll() call.
Is there any alternative solution in Reactor, which works like a PriorityBlockingQueue?

The goal of reactive programming is to avoid blocking operations. PriorityBlockingQueue.poll() will cause issues as it will block the thread waiting for the next element.
There is however an alternative solution in Reactor: the unicast version of Sinks.Many allows using an arbitrary Queue for buffering using Sinks.many().unicast().onBackPressureBuffer(Queue<T>). By using a PriorityQueue instanced outside of the Sink, you can fulfill all three requirements.
Here is a short demo where I emit a Task every 100ms:
public record Task(int prio) {}
private static void log(Object message) {
System.out.println(LocalTime.now(ZoneOffset.UTC).truncatedTo(ChronoUnit.MILLIS) + ": " + message);
}
public void externalBufferDemo() throws InterruptedException {
Queue<Task> taskQueue = new PriorityQueue<>(Comparator.comparingInt(Task::prio).reversed());
Sinks.Many<Task> taskSink = Sinks.many().unicast().onBackpressureBuffer(taskQueue);
taskSink.asFlux()
.delayElements(Duration.ofMillis(100))
.subscribe(task -> log(task));
for (int i = 0; i < 10; i++) {
taskSink.tryEmitNext(new Task(i));
}
// Show amount of tasks sitting in the Sink:
log("Nr of tasks in sink: " + taskQueue.size());
// Clear all tasks in the sink after 350ms:
Thread.sleep(350);
taskQueue.clear();
log("Nr of tasks after clear: " + taskQueue.size());
Thread.sleep(1500);
}
Output:
09:41:11.347: Nr of tasks in sink: 9
09:41:11.450: Task[prio=0]
09:41:11.577: Task[prio=9]
09:41:11.687: Task[prio=8]
09:41:11.705: Nr of tasks after clear: 0
09:41:11.799: Task[prio=7]
Note that delayElements has an internal queue of size 1, which is why Task 0 was picked up before Task 1 was emitted, and why Task 7 was picked up after the clear.
If multicast is required, you can transform your flux using one of the many operators enabling multicasting.

How to check when all CompleteableFuture are done?

I have a Stream<Item> which I'm mapping to a CompleteableFuture<ItemResult>
What I'd like to do is to know when all the futures are completed.
One may suggest to:
collect all the futures to an array and use CompleteableFuture.allOf(). This is somewhat problematic since there could be hundreds of thousands of items
just continue with forEach(CompleteableFuture::join). This is problematic too as calling forEach with join will just block the stream and it will be essentially a serial processing and not concurrent
Inject a poisoned item in the end of the stream. This could work but it's not that elegant in my view
check if the executor queue is empty - This is quite limiting because I might use more than one executor in the future. Also, the queue can be momentarily empty
Monitor the database instead and check the number of new items
I feel like all the suggested solutions aren't good enough.
What is the appropriate way to monitor the futures?
Thanks
EDIT:
another (vague) idea I had in mind is to use a counter and wait for it to go down to zero. But again, need to check that it's not a momentarily 0..

Disclaimer: I'm not sure whether Phaser is the right tool here, and if yes, whether it's better to have one root with multiple children or to chain them like I'm proposing below, so feel free to correct me.
Here's one approach that uses Phaser.
A Phaser has a limited number of parties, so we need to create a new child Phaser if that limit is about to get reached:
private Phaser register(Phaser phaser) {
if (phaser.getRegisteredParties() < 65534) {
// warning: side-effect,
// conflicts with AtomicReference#updateAndGet recommendation,
// might not fit well if the Stream is parallel:
phaser.register();
return phaser;
} else {
return new Phaser(phaser, 1);
}
}
Register each CompletableFuture against that Phaser chain, and deregister once done:
private void register(CompletableFuture<?> future, AtomicReference<Phaser> phaser) {
Phaser registeredPhaser = phaser.updateAndGet(this::register);
future
.thenRun(registeredPhaser::arriveAndDeregister)
.exceptionally(e -> {
// log e?
registeredPhaser.arriveAndDeregister();
return null;
});
}
Wait for all futures to be finished:
private <T> void await(Stream<CompletableFuture<T>> futures) {
Phaser rootPhaser = new Phaser(1);
AtomicReference<Phaser> phaser = new AtomicReference<>(rootPhaser);
futures.forEach(future -> register(future, phaser));
rootPhaser.arriveAndAwaitAdvance();
rootPhaser.arriveAndDeregister();
}
Example:
ExecutorService executor = Executors.newFixedThreadPool(500);
// creating fake stream with 500,000 futures:
Stream<CompletableFuture<Integer>> stream = IntStream
.rangeClosed(1, 500_000)
.mapToObj(i -> CompletableFuture.supplyAsync(() -> {
try {
TimeUnit.MILLISECONDS.sleep(10);
if (i % 50_000 == 0) {
System.out.println(Thread.currentThread().getName() + ": " + i);
}
return i;
} catch (InterruptedException e) {
throw new IllegalStateException(e);
}
}, executor));
// usage:
await(stream);
System.out.println("Done");
Outputs:
pool-1-thread-348: 50000
pool-1-thread-395: 100000
pool-1-thread-333: 150000
pool-1-thread-30: 200000
pool-1-thread-120: 250000
pool-1-thread-10: 300000
pool-1-thread-241: 350000
pool-1-thread-340: 400000
pool-1-thread-283: 450000
pool-1-thread-176: 500000
Done

How to create blocking backpressure with rxjava Flowables?

I have a Flowable that we are returning in a function that will continually read from a database and add it to a Flowable.
public void scan() {
Flowable<String> flow = Flowable.create((FlowableOnSubscribe<String>) emitter -> {
Result result = new Result();
while (!result.hasData()) {
result = request.query(skip, limit);
partialResult.getResult()
.getFeatures().forEach(feature -> emmitter.emit(feature));
}
}, BackpressureStrategy.BUFFER)
.subscribeOn(Schedulers.io());
return flow;
}
Then I have another object that can call this method.
myObj.scan()
.parallel()
.runOn(Schedulers.computation())
.map(feature -> {
//Heavy Computation
})
.sequential()
.blockingSubscribe(msg -> {
logger.debug("Successfully processed " + msg);
}, (e) -> {
logger.error("Failed to process features because of error with scan", e);
});
My heavy computation section could potentially take a very long time. So long in fact that there is a good chance that the database requests will load the whole database into memory before the consumer finishes the first couple entries.
I have read up on backpressure with rxjava but the only 4 options essentially make me drop data or replace it with the last.
Is there a way to make it so that when I call emmitter.emit(feature) the call blocks until there is more room in the Flowable?
I.E I want to treat the Flowable as a blocking queue where push will sleep if the queue is past the capacity.

Call multiple synchronous tasks asynchronously using RxJava

I have an async task represented by Futures executing in a separate threadpool that I want to join using RxJava. The "old" way of doing it using Java 5 constructs would be something like this (omitting collecting the results):
final Future<Response> future1 = wsClient.callAsync();
final Future<Response> future2 = wsClient.callAsync();
final Future<Response> future3 = wsClient.callAsync();
final Future<Response> future4 = wsClient.callAsync();
future1.get();
future2.get();
future3.get();
future4.get();
This would block my current thread until all futures are completed, but the calls would be in parallell and the whole operation would only take the time equal to the longest call.
I want to do the same using RxJava, but I'm a bit noob when it comes to how to model it correctly.
I've tried the following, and it seems to work:
Observable.from(Arrays.asList(1,2,3,4))
.flatMap(n -> Observable.from(wsClient.callAsync(), Schedulers.io()))
.toList()
.toBlocking()
.single();
The problem with this approach is that I introduce the Schedulers.io threadpool which causes unnecessary thread switching as I'm already blocking the current thread (using toBlocking()).
Is there any way I can model the Rx flow to execute the tasks in parallel, and block until all has been completed?

You should use zip function.
For example like this:
Observable.zip(
Observable.from(wsClient.callAsync(), Schedulers.io()),
Observable.from(wsClient.callAsync(), Schedulers.io()),
Observable.from(wsClient.callAsync(), Schedulers.io()),
Observable.from(wsClient.callAsync(), Schedulers.io()),
(response1, response2, response3, response4) -> {
// This is a zipping function...
// You'll end up here when you've got all responses
// Do what you want with them and return a combined result
// ...
return null; //combined result instead of null
})
.subscribe(combinedResult -> {
// Use the combined result
});
Observable.zip can also work with an Iterable so you can wrap your Observable.from(wsClient.callAsync(), Schedulers.io()); around with one (that returns 4 of those).

Passing a Set of Objects between threads

The current project I am working on requires that I implement a way to efficiently pass a set of objects from one thread, that runs continuously, to the main thread. The current setup is something like the following.
I have a main thread which creates a new thread. This new thread operates continuously and calls a method based on a timer. This method fetches a group of messages from an online source and organizes them in a TreeSet.
This TreeSet then needs to be passed back to the main thread so that the messages it contains can be handled independent of the recurring timer.
For better reference my code looks like the following
// Called by the main thread on start.
void StartProcesses()
{
if(this.IsWindowing)
{
return;
}
this._windowTimer = Executors.newSingleThreadScheduledExecutor();
Runnable task = new Runnable() {
public void run() {
WindowCallback();
}
};
this.CancellationToken = false;
_windowTimer.scheduleAtFixedRate(task,
0, this.SQSWindow, TimeUnit.MILLISECONDS);
this.IsWindowing = true;
}
/////////////////////////////////////////////////////////////////////////////////
private void WindowCallback()
{
ArrayList<Message> messages = new ArrayList<Message>();
//TODO create Monitor
if((!CancellationToken))
{
try
{
//TODO fix epochWindowTime
long epochWindowTime = 0;
int numberOfMessages = 0;
Map<String, String> attributes;
// Setup the SQS client
AmazonSQS client = new AmazonSQSClient(new
ClasspathPropertiesFileCredentialsProvider());
client.setEndpoint(this.AWSSQSServiceUrl);
// get the NumberOfMessages to optimize how to
// Receive all of the messages from the queue
GetQueueAttributesRequest attributesRequest =
new GetQueueAttributesRequest();
attributesRequest.setQueueUrl(this.QueueUrl);
attributesRequest.withAttributeNames(
"ApproximateNumberOfMessages");
attributes = client.getQueueAttributes(attributesRequest).
getAttributes();
numberOfMessages = Integer.valueOf(attributes.get(
"ApproximateNumberOfMessages")).intValue();
// determine if we need to Receive messages from the Queue
if (numberOfMessages > 0)
{
if (numberOfMessages < 10)
{
// just do it inline it's less expensive than
//spinning threads
ReceiveTask(numberOfMessages);
}
else
{
//TODO Create a multithreading version for this
ReceiveTask(numberOfMessages);
}
}
if (!CancellationToken)
{
//TODO testing
_setLock.lock();
Iterator<Message> _setIter = _set.iterator();
//TODO
while(_setIter.hasNext())
{
Message temp = _setIter.next();
Long value = Long.valueOf(temp.getAttributes().
get("Timestamp"));
if(value.longValue() < epochWindowTime)
{
messages.add(temp);
_set.remove(temp);
}
}
_setLock.unlock();
// TODO deduplicate the messages
// TODO reorder the messages
// TODO raise new Event with the results
}
if ((!CancellationToken) && (messages.size() > 0))
{
if (messages.size() < 10)
{
Pair<Integer, Integer> range =
new Pair<Integer, Integer>(Integer.valueOf(0),
Integer.valueOf(messages.size()));
DeleteTask(messages, range);
}
else
{
//TODO Create a way to divide this work among
//several threads
Pair<Integer, Integer> range =
new Pair<Integer, Integer>(Integer.valueOf(0),
Integer.valueOf(messages.size()));
DeleteTask(messages, range);
}
}
}catch (AmazonServiceException ase){
ase.printStackTrace();
}catch (AmazonClientException ace) {
ace.printStackTrace();
}
}
}
As can be seen by some of the commenting, my current preferred way to handle this is by creating an event in the timer thread if there are messages. The main thread will then be listening for this event and handle it appropriately.
Presently I am unfamiliar with how Java handles events, or how to create/listen for them. I also do not know if it is possible to create events and have the information contained within them passed between threads.
Can someone please give me some advice/insight on whether or not my methods are possible? If so, where might I find some information on how to implement them as my current searching attempts are not proving fruitful.
If not, can I get some suggestions on how I would go about this, keeping in mind I would like to avoid having to manage sockets if at all possible.
EDIT 1:
The main thread will also be responsible for issuing commands based on the messages it receives, or issuing commands to get required information. For this reason the main thread cannot wait on receiving messages, and should handle them in an event based manner.

Producer-Consumer Pattern:
One thread(producer) continuosly stacks objects(messages) in a queue.
another thread(consumer) reads and removes objects from the queue.
If your problem fits to this, Try "BlockingQueue".
http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/BlockingQueue.html
It is easy and effective.
If the queue is empty, consumer will be "block"ed, which means the thread waits(so do not uses cpu time) until producer puts some objects. otherwise cosumer continuosly consumes objects.
And if the queue is full, prducer will be blocked until consumer consumes some objects to make a room in the queue, vice versa.
Here's a example:
(a queue should be same object in both producer and consumer)
(Producer thread)
Message message = createMessage();
queue.put(message);
(Consumer thread)
Message message = queue.take();
handleMessage(message);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parallelizing deserialization step - java

Try while(producer.next()) { CompletableFuture.supplyAsync(()-> gson.deserialize(producer.item())) .thenRunAsync(item->processItem(item)); }

Related

Leverage PriorityBlockingQueue to build producer-comsumer pattern in Java Reactor

How to check when all CompleteableFuture are done?

How to create blocking backpressure with rxjava Flowables?

Call multiple synchronous tasks asynchronously using RxJava

Passing a Set of Objects between threads

Categories

Resources