Lettuce StatefulRedisConnection async command execution order - java

I'm confused a bit about order of Redis command execution when using a Lettuce driver.
Examples use code like
private val cacheConnection: StatefulRedisConnection<String, String>
// (...)
cacheConnection.async().getset(keyStr, json)
cacheConnection.async().expire(keyStr, expireAfterWrite)
https://github.com/lettuce-io/lettuce-core/issues/1627
https://www.baeldung.com/java-redis-lettuce
However, the documentation states
good example is the async API. Every invocation on the async API returns a Future (response handle) after the command is written to the netty pipeline. A write to the pipeline does not mean, the command is written to the underlying transport. Multiple commands can be written without awaiting the response. Invocations to the API (sync, async and starting with 4.0 also reactive API) can be performed by multiple threads.
(https://github.com/lettuce-io/lettuce-core/wiki/Pipelining-and-command-flushing)
This does not specify when the commands are put in the pipeline. Shouldn't I use thenAccept instead?
cacheConnection.async().getset(keyStr, json)
.thenAccept { expire(keyStr, expireAfterWrite) }
That would mean that all these examples are wrong which is... improbable?
Can you please explain how does it work? Is execution order preservation just a systematic coincidence (ie an implementation detail)?

Related

Sink in Flink blocks the task execution

I have a Sink in Flink, which extends from RichSinkFunction.
It delays the execution of all the Flink task (if I remove it, it takes a half, from 10' to less than 5'). This is its configuration:
OutputTag<List<SessionSinkModel>> inProgressSessionOutputTag = new OutputTag<>(ProcessorConstants.IN_PROGRESS_SESSIONS_SINK_NAME) {};
SingleOutputStreamOperator<SessionAccumulator> aggregatedSessionStream =
collectionMessageDataStream
.keyBy(CollectionMessage::getSessionId)
.process(sessionKeyedProcessFunction)
.uid("SessionWindow")
.name("Session Window")
.setParallelism(4);
DataStream<List<SessionSinkModel>> inProgressSessionStream = aggregatedSessionStream
.getSideOutput(inProgressSessionOutputTag);
inProgressSessionStream
.broadcast()
.addSink(new SessionAPISink(config))
.uid("Sessions side output")
.name("Sessions side output");
This Sink sends massive data by POST to an endpoint, this POST call is asynchronous (as far as I know like the Sink call). I use the standard call using the output from KeyedBroadcastProcessFunction.ReadOnlyContext ctx.
ctx.output(outputTag, message);
How can I do to make this Sink not to block the task execution?
There are two issues that I see with the workflow...
You shouldn't be doing a inProgressSessionStream.broadcast()
For efficient async IO, you want to use Flink's AsyncIO support, and then follow that with a DiscardingSink.

passing an Akka stream to an upstream service to populate

I need to call an upstream service (Azure Blob Service) to push data to an OutputStream, which then i need to turn around and push it back to the client, thru akka. Without akka (and just servlet code), i'd just get the ServletOutputStream and pass it to the azure service's method.
The closest i can try to stumble upon, and clearly this is wrong, is something like this
Source<ByteString, OutputStream> source = StreamConverters.asOutputStream().mapMaterializedValue(os -> {
blobClient.download(os);
return os;
});
ResponseEntity resposeEntity = HttpEntities.create(ContentTypes.APPLICATION_OCTET_STREAM, preAuthData.getFileSize(), source);
sender().tell(new RequestResult(resposeEntity, StatusCodes.OK), self());
The idea is i'm calling an upstream service to get an outputstream populated by calling
blobClient.download(os);
It seems like the the lambda function gets called and returns, but then afterwards it fails, because there's no data or something. As if i'm not supposed to be have that lambda function do the work, but perhaps return some object that does the work? Not sure.
How does one do this?
The real issue here is that the Azure API is not designed for back-pressuring. There is no way for the output stream to signal back to Azure that it is not ready for more data. To put it another way: if Azure pushes data faster than you are able to consume it, there will have to be some ugly buffer overflow failure somewhere.
Accepting this fact, the next best thing we can do is:
Use Source.lazySource to only start downloading data when there is downstream demand (aka. the source is being run and data is being requested).
Put the download call in some other thread so that it continues executing without blocking the source from being returned. Once way to do this is with a Future (I'm not sure what Java best practices are, but should work fine either way). Although it won't matter initially, you may need to choose an execution context other than system.dispatcher - it all depends on whether download is blocking or not.
I apologize in advance if this Java code is malformed - I use Akka with Scala, so this is all from looking at the Akka Java API and Java syntax reference.
ResponseEntity responseEntity = HttpEntities.create(
ContentTypes.APPLICATION_OCTET_STREAM,
preAuthData.getFileSize(),
// Wait until there is downstream demand to intialize the source...
Source.lazySource(() -> {
// Pre-materialize the outputstream before the source starts running
Pair<OutputStream, Source<ByteString, NotUsed>> pair =
StreamConverters.asOutputStream().preMaterialize(system);
// Start writing into the download stream in a separate thread
Futures.future(() -> { blobClient.download(pair.first()); return pair.first(); }, system.getDispatcher());
// Return the source - it should start running since `lazySource` indicated demand
return pair.second();
})
);
sender().tell(new RequestResult(responseEntity, StatusCodes.OK), self());
The OutputStream in this case is the "materialized value" of the Source and it will only be created once the stream is run (or "materialized" into a running stream). Running it is out of your control since you hand the Source to Akka HTTP and that will later actually run your source.
.mapMaterializedValue(matval -> ...) is usually used to transform the materialized value but since it is invoked as a part of materialization you can use that to do side effects such as sending the matval in a message, just like you have figured out, there isn't necessarily anything wrong with that even if it looks funky. It is important to understand that the stream will not complete its materialization and become running until that lambda completes. This means problems if download() is blocking rather than forking off some work on a different thread and immediately returning.
There is however another solution: Source.preMaterialize(), it materializes the source and gives you a Pair of the materialized value and a new Source that can be used to consume the already started source:
Pair<OutputStream, Source<ByteString, NotUsed>> pair =
StreamConverters.asOutputStream().preMaterialize(system);
OutputStream os = pair.first();
Source<ByteString, NotUsed> source = pair.second();
Note that there are a few additional things to think of in your code, most importantly if the blobClient.download(os) call blocks until it is done and you call that from the actor, in that case you must make sure that your actor does not starve the dispatcher and stop other actors in your application from executing (see Akka docs: https://doc.akka.io/docs/akka/current/typed/dispatchers.html#blocking-needs-careful-management ).

How do I use transactions in Spring Data Redis Reactive?

I'm trying to use ReactiveRedisOperations from spring-data-redis 2.1.8 to do transactions, for example:
WATCH mykey
val = GET mykey
val = val + 1
MULTI
SET mykey $val
EXEC
But I cannot seem to find a way to do this when browsing the docs or the ReactiveRedisOperations. Is this not available in the reactive client, or how can you achieve this?
TL;DR: There's no proper support for Redis Transactions using the Reactive API
The reason lies in the execution model: How Redis executes transactions and how the reactive API is supposed to work.
When using transactions, a connection enters transactional state, then commands are queued and finally executed with EXEC. Executing queued commands with exec makes the execution of the individual commands conditional on the EXEC command.
Consider the following snippet (Lettuce code):
RedisReactiveCommands<String, String> commands = …;
commands.multi().then(commands.set("key", "value")).then(commands.exec());
This sequence shows command invocation in a somewhat linear fashion:
Issue MULTI
Once MULTI completes, issue a SET command
Once SET completes, call EXEC
The caveat is with SET: SET only completes after calling EXEC. So this means we have a forward reference to the exec command. We cannot listen to a command that is going to be executed in the future.
You could apply a workaround:
RedisReactiveCommands<String, String> commands = …
Mono<TransactionResult> tx = commands.multi()
.flatMap(ignore -> {
commands.set("key", "value").doOnNext(…).subscribe();
return commands.exec();
});
The workaround would incorporate command subscription within your code (Attention: This is an anti-pattern in reactive programming). After calling exec(), you get the TransactionResult in return.
Please also note: Although you can retrieve results via Mono<TransactionResult>, the actual SET command also emits its result (see doOnNext(…)).
That being said, it allows us circling back to the actual question: Because these concepts do not work well together, there's no API for transactional use in Spring Data Redis.

How to use CompletableFuture with AWS Glue job status?

I have a requirement where I need to get the status of AWS Glue crawler, which is an async request, and based on when the jobs get completed, I would fire certain events. The catch here is that I do not want to use polling. On looking further, AWS docs suggests to use CompletableFuture object to deal with async request in AWS. But when I try to use, I am not able to form CompletableFuture object as it gives me Type mismatch. I have this code :
GetCrawlerMetricsRequest metricsRequest =
new GetCrawlerMetricsRequest().withCrawlerNameList(Arrays.asList("myJavaCrawler"));
GetCrawlerMetricsResult jsonOb = awsglueClient.getCrawlerMetrics(metricsRequest);
CompletableFuture<GetCrawlerMetricsResult> futureResponse = CompletableFuture<GetCrawlerMetricsResult>awsglueClient.getCrawlerMetricsAsync(metricsRequest);
But futureResponse object shows error stating FutureTask cannot be casted to CompletableFuture.
I am following the approach given here
I am not sure how can I make this working. Based on this futureResponse object, I can then use .whenApply function to trigger the certain job which I want to execute such as pushing the above response into a Kafka Queue. Any ideas?
It seems like you are using AWS SDK v1 when the doc you mentioned shows how to do it using v2 (which has 'Developer Preview' status, so it's not recommended for production). Here is a doc showing how to make async calls in v1
For your use case I would recommend another approach. Glue posts few types of events and one of them is "Crawler Succeeded". So you can create a CloudWatch rule to catch these events and trigger a lambda which will make a call to start appropriate job

How do I make an async call to Hive in Java?

I would like to execute a Hive query on the server in an asynchronous manner. The Hive query will likely take a long time to complete, so I would prefer not to block on the call. I am currently using Thirft to make a blocking call (blocks on client.execute()), but I have not seen an example of how to make a non-blocking call. Here is the blocking code:
TSocket transport = new TSocket("hive.example.com", 10000);
transport.setTimeout(999999999);
TBinaryProtocol protocol = new TBinaryProtocol(transport);
Client client = new ThriftHive.Client(protocol);
transport.open();
client.execute(hql); // Omitted HQL
List<String> rows;
while ((rows = client.fetchN(1000)) != null) {
for (String row : rows) {
// Do stuff with row
}
}
transport.close();
The code above is missing try/catch blocks to keep it short.
Does anyone have any ideas how to do an async call? Can Hive/Thrift support it? Is there a better way?
Thanks!
AFAIK, at the time of writing Thrift does not generate asynchronous clients. The reason as explained in this link here (search text for "asynchronous") is that Thrift was designed for the data centre where latency is assumed to be low.
Unfortunately as you know the latency experienced between call and result is not always caused by the network, but by the logic being performed! We have this problem calling into the Cassandra database from a Java application server where we want to limit total threads.
Summary: for now all you can do is make sure you have sufficient resources to handle the required numbers of blocked concurrent threads and wait for a more efficient implementation.
It is now possible to make an asynchronous call in a Java thrift client after this patch was put in:
https://issues.apache.org/jira/browse/THRIFT-768
Generate the async java client using the new thrift and initialize your client as follows:
TNonblockingTransport transport = new TNonblockingSocket("127.0.0.1", 9160);
TAsyncClientManager clientManager = new TAsyncClientManager();
TProtocolFactory protocolFactory = new TBinaryProtocol.Factory();
Hive.AsyncClient client = new Hive.AsyncClient(protocolFactory, clientManager, transport);
Now you can execute methods on this client as you would on a synchronous interface. The only change is that all methods take an additional parameter of a callback.
I know nothing about Hive, but as a last resort, you can use Java's concurrency library:
Callable<SomeResult> c = new Callable<SomeResult>(){public SomeResult call(){
// your Hive code here
}};
Future<SomeResult> result = executorService.submit(c);
// when you need the result, this will block
result.get();
Or, if you do not need to wait for the result, use Runnable instead of Callable.
After talking to the Hive mailing list, Hive does not support async calls using Thirft.
I don't know about Hive in particular but any blocking call can be turned in an asynch call by spawning a new thread and using a callback. You could look at java.util.concurrent.FutureTask which has been designed to allow easy handling of such asynchronous operation.
We fire off asynchronous calls to AWS Elastic MapReduce. AWS MapReduce can run hadoop/hive jobs on Amazon's cloud with a call to the AWS MapReduce web services.
You can also monitor the status of your jobs and grab the results off S3 once the job is completed.
Since the calls to the web services are asynchronous in nature, we never block our other operations. We continue to monitor the status of our jobs in a separate thread and grab the results when the job is complete.

Categories