Akka Streams - define timeout for mapAsync - java

I am using Akka stream with an external service which its methods takes time.
I don't want this methods to block so I defined a flow which uses mapAsyncUnordered as follows:
Flow.of(SomeClass.class).mapAsyncUnordered(numThreads, someService::someMethod);
I want to define a timeout to the method running in the mapAsyncUnordered stage, so that if the method takes too much time, it won't fill the mapAsyncUnordered queue and prevent from others to pass.
The only option I saw is to define a different actor and use "ask" method of it, but as described in the documentation, if we arrive to timeout, the whole stream will be terminated with failure and this is not good for me (although it can be recovered, since I can't lose data).
https://doc.akka.io/docs/akka/current/stream/futures-interop.html
Is there another option?

Related

Is it a bad practice to use a BlockingObservable in this context?

I have a use case where I'm calling four separate downstream endpoints and they can all be called in parallel. After every call is completed, I return a container object from the lambda function, its only purpose being to contain the raw responses from the downstream calls on it. From there, the container object will be transformed into the required model for the consumer.
Here's the structure of the code, roughly speaking:
Observable.zip(o1, o2, o3, o4,
(resp1, resp2, resp3, resp4)
-> new RawResponseContainer(resp1, resp2, resp3, resp4)
).toBlocking().first();
Is there a better way to do this? I 100% need every observable to complete; otherwise, the transformation of the consumer model will be incomplete. While I suppose I could transform each individual response from each observable "on the fly", rather than waiting to transform every response at once, I still need every call to finish before the transformation's done.
I've read it's a bad practice to ever use toBlocking() when using rx (aside from for 'legacy' apps), so any help's appreciated.
This is not a response, a comment:
You are asking, essentially a sequential vs. parallel processing question. What you are doing is sequential processing (by blocking), what is recommended is parallel. Though which is better over the other is complete on the context. In your case, you need all the responses, even in parallel model all has to complete successfully. If even one fails, the entire processing is for naught. In parallel, every processing will occur if one fails all 3 would go to waste. In sequential, it would generate error in the middle. If you can live with the latency sequential processing brings, stay with it. Sequential processing are (in general) are less complicated implementations.

How to make multiple call of #Transactional method to a single transaction

I have a method
#Transactional
public void updateSharedStateByCommunity(List[]idList)
This method is called from the following REST API:
#RequestMapping(method = RequestMethod.POST)
public ret_type updateUser(param) {
// call updateSharedStateByCommunity
}
Now the ID lists are very large, like 200000, When I try to process it, then it takes lots of time and on client side timeout error occurred.
So, I want to split it to two calls with list size of 100000 each.
But, the problem is, it is considered as 2 independent transactions.
NB: The 2 calls is an example, it can be divided to many times, if number ids are more larger.
I need to ensure two separate call to a single transaction. If any one of the 2 calls fails, then it should rollback to all operation.
Also, in the client side, we need to show progress dialog, so I can't use only timeout.
The most obvious direct answer to your question IMO is to slightly change the code:
#RequestMapping(method = RequestMethod.POST)
public ret_type updateUser(param) {
updateSharedStateByCommunityBlocks(resolveIds);
}
...
And in Service introduce a new method (if you can't change the code of the service provide an intermediate class that you'll call from controller with the following functionality):
#Transactional
public updateSharedStatedByCommunityBlocks(resolveIds) {
List<String> [] blocks = split(resolveIds, 100000); // 100000 - bulk size
for(List<String> block :blocks) {
updateSharedStateByCommunity(block);
}
}
If this method is in the same service, the #Transactional in the original updateSharedStateByCommunity won't do anything so it will work. If you'll put this code into some other class, then it will work since the default propagation level of spring transaction is "Required"
So it addresses harsh requirements: you wanted to have a single transaction - you've got it. Now all the code runs in the same transaction. Each method now runs with 100000 and not with all the ids, everything is synchronous :)
However, this design is problematic for many different reasons.
It doesn't allow to track the progress (show it to the user) as you've stated by yourself in the last sentence of the question. REST is synchronous.
It assumes that network is reliable and waiting for 30 minutes is technically not a problem (leaving alone the UX and 'nervous' user that will have to wait :) )
In addition to that, the network equipment can force closing the connection (like load balancers with pre-configured request timeout).
That's why people suggest some kind of asyncrhonous flow.
I can say that you still can use the async flow, spawn the task, and after each bulk update some shared state (in-memory in the case of a single instance) and persistent (like database in the case of cluster).
So that the interaction with the client will change:
Client calls "updateUser" with 200000 ids
Service responds "immediately" with something like "I've got your request, here is a request Id, ping me once in a while to see what happens.
Service starts an async task and process the data chunk by chunk in a single transaction
Client calls "get" method with that id and server reads the progress from the shared state.
Once ready, the "Get" methods will respond "done".
If something fails during the transaction execution, the rollback is done, and the process updates the database status with "failure".
You can also use more modern technologies to notify the server (web sockets for example), but it's kind of out of scope for this question.
Another thing to consider here: from what I know, processing 200000 objects should be done in much less then 30 minutes, its not that much for modern RDBMSs.
Of course, without knowing your use case its hard to tell what happens there, but maybe you can optimize the flow itself (using bulk operations, reducing the number of requests to db, caching and so forth).
My preferred approach in those scenarios is make the call asynchronous (Spring Boot allow this using the #Async annotation), hence the client won't expect for any HTTP response. The notification could be done via a WebSocket that will push a message to the client with the progress each X items processed.
Surely it will add more complexity to your application, but if you design the mechanism properly, you'll be able to reuse it for any other similar operation you may face in the future.
The #Transactional annotation accepts a timeout (although not all underlying implementations will support it). I would argue against trying to split the IDs into two calls, and instead try to fix the timeout (after all, what you really want is a single, all-or-nothing transaction). You can set timeouts for the whole application instead of on a per-method basis.
From technical point, it can be done with the org.springframework.transaction.annotation.Propagation#NESTED Propagation, The NESTED behavior makes nested Spring transactions to use the same physical transaction but sets savepoints between nested invocations so inner transactions may also rollback independently of outer transactions, or let them propagate. But the limitation is only works with org.springframework.jdbc.datasource.DataSourceTransactionManager datasource.
But for really large dataset, it still need more time to processing and make the client waiting, so from solution point of view, maybe using async approach will be more better but it depends on your requirement.

Tracking Async Lambda execution on AWS

I am trying to build a process that invokes AWS lambda, which then utilizes AWS SNS to send messages that trigger more lambdas. Each such triggered lambdas write an output file to S3. The process is as depicted below -
My question is this - How can I know that all lambdas are done with writing files? I want to execute another process that collects all these files and does merging. I could think of two obvious ways -
Constantly monitor s3 for as many output files as SNS messages. Once, total count reaches, invoke the final merging lambda.
Use a db as sync source, write counts for that particular job/session and keep monitoring it till the count reaches SNS messages count.
Both solutions require constant polling, which i would like to avoid. I want to do this in an event driven manner. I was hoping for Amazon SQS would come to my rescue with some sort of "empty queue lambda trigger", but SQS only supports lambdas triggering on new messages. Is there any known way to achieve this in an event driven manner in AWS? Your suggestions/comments/answers are much appreciated.
I would propose a couple of options here:
Step Functions:
This is a managed service for state machines. It's great for co-ordinating workflows.
Atomic Counting:
If you know the number of things in advance, you could initialize an Atomic Counter in DynamoDB and then atomically decrement it as work completes. Use DynamoDB Streams to trigger Lambda invocation when the counter is mutated, and trigger your next phase (or end of work) when the counter hits zero. Note that whenever an application creates, updates, or deletes items in the table, DynamoDB Streams writes a stream record, so every mutation of the counter would trigger your Lambda.
Note that DynamoDB Streams guarantees the following:
Each stream record appears exactly once in the stream.
For each item that is modified in a DynamoDB table, the stream records appear in the same sequence as the actual modifications to the item.
AWS Step Functions (a managed state machine service) would be the obvious choice. AWS has some examples as starting points. I remember one being a looping state that you could probably apply to this use case.
Another idea off top of my head...
Create an "Orchestration Lambda" that has the list of your files...
Orchestration Lambda invokes a "File Writer Lambda" in a loop, passing the file info. The invokeAsync(InvokeRequest request) returns a Future object. Orchestration Lambda can check the future object state for completion.
Orchestration Lambda can make a similar call to the "File Writer Lambda" but instead use the more flexible method: invokeAsync(InvokeRequest request, AsyncHandler asyncHandler). You can make an inner class that implements this AsyncHandler and monitor the completion there in the Orchestration Lambda. It is a little cleaner than all the loops.
There are probably many ways to solve this problem, but there are two ideas.
Personally, I prefer the idea with "Step Functions".
But if you want to simplify your architecture, you could create trigered lambda function. Chose 'S3 trigger' in left side of lambda function designer and configure it bottom.
Check out more - Using AWS Lambda with Amazon S3
But in this case you have to create more sophisticated lambda function wich will check that all apropriate files are uploaded on S3 and after this start your merge.
The stated problem seems a suitable candidate for the Saga Pattern.
Basically Saga is described like any long running , distributed process.
As mentioned earlier , the AWS platform allows using Step functions to implement a Saga, as described here enter

Background a task then end connection before task completion in Java (8)

I've spent a lot of time looking at this and there are a tonne of ways to background in Java (I'm specifically looking at Java 8 solutions, it should be noted).
Ok, so here is my (generic) situation - please note this is an example, so don't spend time over the way it works/what it's doing:
Someone requests something via an API call
The API retrieves some data from a datastore
However, I want to cache this aggregated response in some caching system
I need to call a cache API (via REST) to cache this response
I do not want to wait until this call is done before returning the response to the original API call
Some vague code structure:
#GET
# // api definitions
public Response myAPIMethod(){
// get data from datastore
Object o = getData();
// submit request to cache data, without blocking
saveDataToCache();
// return the response to the Client
return Response.ok(data).build();
}
What is the "best" (optimal, safest, standard) way to run saveDataToCache in the background without having to wait before returning data? Note that this caching should not occur too often (maybe a couple of times a second).
I attempted this a couple of ways, specifically with CompletableFutures but when I put in some logging it seemed that it always waited before returning the response (I did not call get).
Basically the connection from the client might close, before that caching call has finished - but I want it to have finished :) I'm not sure if the rules are the same as this is during the lifetime of a client connection.
Thanks in advance for any advice, let me know if anything is unclear... I tried to define it in a way understandable to those without the domain knowledge of what I'm trying to do (which I cannot disclose).
You could consider adding the objects to cache into a BlockingQueue and have a separate thread taking from the queue and storing into cache.
As per the comments, the cache API is already asynchronous (it actually returns a Future). I suppose it creates and manages an internal ExecutorService or receives one at startup.
My point is that there's no need to take care of the objects to cache, but of the returned Futures. Asynchronous behavior is actually provided by the cache client.
One option would be to just ignore the Future returned by this client. The problem with this approach is that you loose the chance to take a corrective action in case an error occurrs when attempting to store the object in the cache. In fact, you would never know that something went wrong.
Another option would be to take care of the returned Future. One way is with a Queue, as suggested in another answer, though I'd use a ConcurrentLinkedQueue instead, since it's unbounded and you have mentioned that adding objects to the cache would happen as much as twice a second. You could offer() the Future to the queue as soon as the cache client returns it and then, in another thread, that would be running an infinite loop, you could poll() the queue for a Future and, if a non null value is returned, invoke isDone() on it. (If the queue returns null it means it's empty, so you might want to sleep for a few milliseconds).
If isDone() returns true, you can safely invoke get() on the future, surrounded by a try/catch block that catches any ExecutionException and handles it as you wish. (You could retry the operation on the cache, log what happened, etc).
If isDone() returns false, you could simply offer() the Future to the queue again.
Now, here we're talking about handling errors from asynchronous operations of a cache. I wouldn't do anything and let the future returned by the cache client go in peace. If something goes wrong, the worst thing that may happen is that you'd have to go to the datastore again to retrieve the object.

Best practices with Akka in Scala and third-party Java libraries

I need to use memcached Java API in my Scala/Akka code. This API gives you both synchronous and asynchronous methods. The asynchronous ones return java.util.concurrent.Future. There was a question here about dealing with Java Futures in Scala here How do I wrap a java.util.concurrent.Future in an Akka Future?. However in my case I have two options:
Using synchronous API and wrapping blocking code in future and mark blocking:
Future {
blocking {
cache.get(key) //synchronous blocking call
}
}
Using asynchronous Java API and do polling every n ms on Java Future to check if the future completed (like described in one of the answers above in the linked question above).
Which one is better? I am leaning towards the first option because polling can dramatically impact response times. Shouldn't blocking { } block prevent from blocking the whole pool?
I always go with the first option. But i am doing it in a slightly different way. I don't use the blocking feature. (Actually i have not thought about it yet.) Instead i am providing a custom execution context to the Future that wraps the synchronous blocking call. So it looks basically like this:
val ecForBlockingMemcachedStuff = ExecutionContext.fromExecutorService(Executors.newFixedThreadPool(100)) // whatever number you think is appropriate
// i create a separate ec for each blocking client/resource/api i use
Future {
cache.get(key) //synchronous blocking call
}(ecForBlockingMemcachedStuff) // or mark the execution context implicit. I like to mention it explicitly.
So all the blocking calls will use a dedicated execution context (= Threadpool). So it is separated from your main execution context responsible for non blocking stuff.
This approach is also explained in a online training video for Play/Akka provided by Typesafe. There is a video in lesson 4 about how to handle blocking calls. It is explained by Nilanjan Raychaudhuri (hope i spelled it correctly), who is a well known author for Scala books.
Update: I had a discussion with Nilanjan on twitter. He explained what the difference between the approach with blocking and a custom ExecutionContext is. The blocking feature just creates a special ExecutionContext. It provides a naive approach to the question how many threads you will need. It spawns a new thread every time, when all the other existing threads in the pool are busy. So it is actually an uncontrolled ExecutionContext. It could create lots of threads and lead to problems like an out of memory error. So the solution with the custom execution context is actually better, because it makes this problem obvious. Nilanjan also added that you need to consider circuit breaking for the case this pool gets overloaded with requests.
TLDR: Yeah, blocking calls suck. Use a custom/dedicated ExecutionContext for blocking calls. Also consider circuit breaking.
The Akka documentation provides a few suggestions on how to deal with blocking calls:
In some cases it is unavoidable to do blocking operations, i.e. to put
a thread to sleep for an indeterminate time, waiting for an external
event to occur. Examples are legacy RDBMS drivers or messaging APIs,
and the underlying reason is typically that (network) I/O occurs under
the covers. When facing this, you may be tempted to just wrap the
blocking call inside a Future and work with that instead, but this
strategy is too simple: you are quite likely to find bottlenecks or
run out of memory or threads when the application runs under increased
load.
The non-exhaustive list of adequate solutions to the “blocking
problem” includes the following suggestions:
Do the blocking call within an actor (or a set of actors managed by a router), making sure to configure a thread pool which is either
dedicated for this purpose or sufficiently sized.
Do the blocking call within a Future, ensuring an upper bound on the number of such calls at any point in time (submitting an unbounded
number of tasks of this nature will exhaust your memory or thread
limits).
Do the blocking call within a Future, providing a thread pool with an upper limit on the number of threads which is appropriate for the
hardware on which the application runs.
Dedicate a single thread to manage a set of blocking resources (e.g. a NIO selector driving multiple channels) and dispatch events as they
occur as actor messages.
The first possibility is especially well-suited for resources which
are single-threaded in nature, like database handles which
traditionally can only execute one outstanding query at a time and use
internal synchronization to ensure this. A common pattern is to create
a router for N actors, each of which wraps a single DB connection and
handles queries as sent to the router. The number N must then be tuned
for maximum throughput, which will vary depending on which DBMS is
deployed on what hardware.

Categories