passing an Akka stream to an upstream service to populate - java

I need to call an upstream service (Azure Blob Service) to push data to an OutputStream, which then i need to turn around and push it back to the client, thru akka. Without akka (and just servlet code), i'd just get the ServletOutputStream and pass it to the azure service's method.
The closest i can try to stumble upon, and clearly this is wrong, is something like this
Source<ByteString, OutputStream> source = StreamConverters.asOutputStream().mapMaterializedValue(os -> {
blobClient.download(os);
return os;
});
ResponseEntity resposeEntity = HttpEntities.create(ContentTypes.APPLICATION_OCTET_STREAM, preAuthData.getFileSize(), source);
sender().tell(new RequestResult(resposeEntity, StatusCodes.OK), self());
The idea is i'm calling an upstream service to get an outputstream populated by calling
blobClient.download(os);
It seems like the the lambda function gets called and returns, but then afterwards it fails, because there's no data or something. As if i'm not supposed to be have that lambda function do the work, but perhaps return some object that does the work? Not sure.
How does one do this?

The real issue here is that the Azure API is not designed for back-pressuring. There is no way for the output stream to signal back to Azure that it is not ready for more data. To put it another way: if Azure pushes data faster than you are able to consume it, there will have to be some ugly buffer overflow failure somewhere.
Accepting this fact, the next best thing we can do is:
Use Source.lazySource to only start downloading data when there is downstream demand (aka. the source is being run and data is being requested).
Put the download call in some other thread so that it continues executing without blocking the source from being returned. Once way to do this is with a Future (I'm not sure what Java best practices are, but should work fine either way). Although it won't matter initially, you may need to choose an execution context other than system.dispatcher - it all depends on whether download is blocking or not.
I apologize in advance if this Java code is malformed - I use Akka with Scala, so this is all from looking at the Akka Java API and Java syntax reference.
ResponseEntity responseEntity = HttpEntities.create(
ContentTypes.APPLICATION_OCTET_STREAM,
preAuthData.getFileSize(),
// Wait until there is downstream demand to intialize the source...
Source.lazySource(() -> {
// Pre-materialize the outputstream before the source starts running
Pair<OutputStream, Source<ByteString, NotUsed>> pair =
StreamConverters.asOutputStream().preMaterialize(system);
// Start writing into the download stream in a separate thread
Futures.future(() -> { blobClient.download(pair.first()); return pair.first(); }, system.getDispatcher());
// Return the source - it should start running since `lazySource` indicated demand
return pair.second();
})
);
sender().tell(new RequestResult(responseEntity, StatusCodes.OK), self());

The OutputStream in this case is the "materialized value" of the Source and it will only be created once the stream is run (or "materialized" into a running stream). Running it is out of your control since you hand the Source to Akka HTTP and that will later actually run your source.
.mapMaterializedValue(matval -> ...) is usually used to transform the materialized value but since it is invoked as a part of materialization you can use that to do side effects such as sending the matval in a message, just like you have figured out, there isn't necessarily anything wrong with that even if it looks funky. It is important to understand that the stream will not complete its materialization and become running until that lambda completes. This means problems if download() is blocking rather than forking off some work on a different thread and immediately returning.
There is however another solution: Source.preMaterialize(), it materializes the source and gives you a Pair of the materialized value and a new Source that can be used to consume the already started source:
Pair<OutputStream, Source<ByteString, NotUsed>> pair =
StreamConverters.asOutputStream().preMaterialize(system);
OutputStream os = pair.first();
Source<ByteString, NotUsed> source = pair.second();
Note that there are a few additional things to think of in your code, most importantly if the blobClient.download(os) call blocks until it is done and you call that from the actor, in that case you must make sure that your actor does not starve the dispatcher and stop other actors in your application from executing (see Akka docs: https://doc.akka.io/docs/akka/current/typed/dispatchers.html#blocking-needs-careful-management ).

Related

Lettuce StatefulRedisConnection async command execution order

I'm confused a bit about order of Redis command execution when using a Lettuce driver.
Examples use code like
private val cacheConnection: StatefulRedisConnection<String, String>
// (...)
cacheConnection.async().getset(keyStr, json)
cacheConnection.async().expire(keyStr, expireAfterWrite)
https://github.com/lettuce-io/lettuce-core/issues/1627
https://www.baeldung.com/java-redis-lettuce
However, the documentation states
good example is the async API. Every invocation on the async API returns a Future (response handle) after the command is written to the netty pipeline. A write to the pipeline does not mean, the command is written to the underlying transport. Multiple commands can be written without awaiting the response. Invocations to the API (sync, async and starting with 4.0 also reactive API) can be performed by multiple threads.
(https://github.com/lettuce-io/lettuce-core/wiki/Pipelining-and-command-flushing)
This does not specify when the commands are put in the pipeline. Shouldn't I use thenAccept instead?
cacheConnection.async().getset(keyStr, json)
.thenAccept { expire(keyStr, expireAfterWrite) }
That would mean that all these examples are wrong which is... improbable?
Can you please explain how does it work? Is execution order preservation just a systematic coincidence (ie an implementation detail)?

Spark Stream new Job after stream start

I have a situation where I am trying to stream using spark streaming from kafka. The stream is a direct stream. I am able to create a stream and then start streaming, also able to get any updates (if any) on kafka via the streaming.
The issue comes in when i have a new request to stream a new topic. Since SparkStreaming context can be only 1 per jvm, I cannot create a new stream for every new request.
The way I figured out is
Once a DStream is created and spark streaming is already in progress, just attach a new stream to it. This does not seem to work, the createDStream (for a new topic2) does not return a stream and further processing is stopped. The streaming keep on continuing on the first request (say topic1).
Second, I thought to stop the stream, create DStream and then start streaming again. I cannot use the same streaming context (it throws an excpection that jobs cannot be added after streaming has been stopped), and if I create a new stream for new topic (topic2), the old stream topic (topic1) is lost and it streams only the new one.
Here is the code, have a look
JavaStreamingContext javaStreamingContext;
if(null == javaStreamingContext) {
javaStreamingContext = JavaStreamingContext(sparkContext, Durations.seconds(duration));
} else {
StreamingContextState streamingContextState = javaStreamingContext.getState();
if(streamingContextState == StreamingContextState.STOPPED) {
javaStreamingContext = JavaStreamingContext(sparkContext, Durations.seconds(duration));
}
}
Collection<String> topics = Arrays.asList(getTopicName(schemaName));
SparkVoidFunctionImpl impl = new SparkVoidFunctionImpl(getSparkSession());
KafkaUtils.createDirectStream(javaStreamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, getKafkaParamMap()))
.map((stringStringConsumerRecord) -> stringStringConsumerRecord.value())
.foreachRDD(impl);
if (javaStreamingContext.getState() == StreamingContextState.ACTIVE) {
javaStreamingContext.start();
javaStreamingContext.awaitTermination();
}
Don't worry about SparkVoidFunctionImpl, this is a custom class with is the implementation of VoidFunction.
The above is approach 1, where i do not stop the existing streaming. When a new request comes into this method, it does not get a new streaming object, it tries to create a dstream. The issue is the DStream object is never returned.
KafkaUtils.createDirectStream(javaStreamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, getKafkaParamMap()))
This does not return a dstream, the control just terminates without an error.The steps further are not executed.
I have tried many things and read multiple article, but I belive this is a very common production level issue. Any streaming done is to be done on multiple different topics and each of them is handled differently.
Please help
The thing is spark master sends out code to workers and although the data is streaming, underlying code and variable values remain static unless job is restarted.
Few options I could think:
Spark Job server: Every time you want to subscribe/stream from a different topic instead of touching already running job, start a new job. From your API body you can supply the parameters or topic name. If you want to stop streaming from a specific topic, just stop respective job. It will give you a lot of flexibility and control on resources.
[Theoritical] Topic Filter: Subscribe all topics you think you will want, when records are pulled for a duration, filter out records based on a LIST of topics. Manipulate this list of topics through API to increase or decrease your scope of topics, it could be a broadcast variable as well. This is just an idea, I have not tried this option at all.
Another work around is to relay your Topic-2 data to Topic-1 using a microservice whenever you need it & stop if you don't want to.

How to use CompletableFuture with AWS Glue job status?

I have a requirement where I need to get the status of AWS Glue crawler, which is an async request, and based on when the jobs get completed, I would fire certain events. The catch here is that I do not want to use polling. On looking further, AWS docs suggests to use CompletableFuture object to deal with async request in AWS. But when I try to use, I am not able to form CompletableFuture object as it gives me Type mismatch. I have this code :
GetCrawlerMetricsRequest metricsRequest =
new GetCrawlerMetricsRequest().withCrawlerNameList(Arrays.asList("myJavaCrawler"));
GetCrawlerMetricsResult jsonOb = awsglueClient.getCrawlerMetrics(metricsRequest);
CompletableFuture<GetCrawlerMetricsResult> futureResponse = CompletableFuture<GetCrawlerMetricsResult>awsglueClient.getCrawlerMetricsAsync(metricsRequest);
But futureResponse object shows error stating FutureTask cannot be casted to CompletableFuture.
I am following the approach given here
I am not sure how can I make this working. Based on this futureResponse object, I can then use .whenApply function to trigger the certain job which I want to execute such as pushing the above response into a Kafka Queue. Any ideas?
It seems like you are using AWS SDK v1 when the doc you mentioned shows how to do it using v2 (which has 'Developer Preview' status, so it's not recommended for production). Here is a doc showing how to make async calls in v1
For your use case I would recommend another approach. Glue posts few types of events and one of them is "Crawler Succeeded". So you can create a CloudWatch rule to catch these events and trigger a lambda which will make a call to start appropriate job

Http Websocket as Akka Stream Source

I'd like to listen on a websocket using akka streams. That is, I'd like to treat it as nothing but a Source.
However, all official examples treat the websocket connection as a Flow.
My current approach is using the websocketClientFlow in combination with a Source.maybe. This eventually results in the upstream failing due to a TcpIdleTimeoutException, when there are no new Messages being sent down the stream.
Therefore, my question is twofold:
Is there a way – which I obviously missed – to treat a websocket as just a Source?
If using the Flow is the only option, how does one handle the TcpIdleTimeoutException properly? The exception can not be handled by providing a stream supervision strategy. Restarting the source by using a RestartSource doesn't help either, because the source is not the problem.
Update
So I tried two different approaches, setting the idle timeout to 1 second for convenience
application.conf
akka.http.client.idle-timeout = 1s
Using keepAlive (as suggested by Stefano)
Source.<Message>maybe()
.keepAlive(Duration.apply(1, "second"), () -> (Message) TextMessage.create("keepalive"))
.viaMat(Http.get(system).webSocketClientFlow(WebSocketRequest.create(websocketUri)), Keep.right())
{ ... }
When doing this, the Upstream still fails with a TcpIdleTimeoutException.
Using RestartFlow
However, I found out about this approach, using a RestartFlow:
final Flow<Message, Message, NotUsed> restartWebsocketFlow = RestartFlow.withBackoff(
Duration.apply(3, TimeUnit.SECONDS),
Duration.apply(30, TimeUnit.SECONDS),
0.2,
() -> createWebsocketFlow(system, websocketUri)
);
Source.<Message>maybe()
.viaMat(restartWebsocketFlow, Keep.right()) // One can treat this part of the resulting graph as a `Source<Message, NotUsed>`
{ ... }
(...)
private Flow<Message, Message, CompletionStage<WebSocketUpgradeResponse>> createWebsocketFlow(final ActorSystem system, final String websocketUri) {
return Http.get(system).webSocketClientFlow(WebSocketRequest.create(websocketUri));
}
This works in that I can treat the websocket as a Source (although artifically, as explained by Stefano) and keep the tcp connection alive by restarting the websocketClientFlow whenever an Exception occurs.
This doesn't feel like the optimal solution though.
No. WebSocket is a bidirectional channel, and Akka-HTTP therefore models it as a Flow. If in your specific case you care only about one side of the channel, it's up to you to form a Flow with a "muted" side, by using either Flow.fromSinkAndSource(Sink.ignore, mySource) or Flow.fromSinkAndSource(mySink, Source.maybe), depending on the case.
as per the documentation:
Inactive WebSocket connections will be dropped according to the
idle-timeout settings. In case you need to keep inactive connections
alive, you can either tweak your idle-timeout or inject ‘keep-alive’
messages regularly.
There is an ad-hoc combinator to inject keep-alive messages, see the example below and this Akka cookbook recipe. NB: this should happen on the client side.
src.keepAlive(1.second, () => TextMessage.Strict("ping"))
I hope I understand your question correctly. Are you looking for asSourceOf?
path("measurements") {
entity(asSourceOf[Measurement]) { measurements =>
// measurement has type Source[Measurement, NotUsed]
...
}
}

How do I make an async call to Hive in Java?

I would like to execute a Hive query on the server in an asynchronous manner. The Hive query will likely take a long time to complete, so I would prefer not to block on the call. I am currently using Thirft to make a blocking call (blocks on client.execute()), but I have not seen an example of how to make a non-blocking call. Here is the blocking code:
TSocket transport = new TSocket("hive.example.com", 10000);
transport.setTimeout(999999999);
TBinaryProtocol protocol = new TBinaryProtocol(transport);
Client client = new ThriftHive.Client(protocol);
transport.open();
client.execute(hql); // Omitted HQL
List<String> rows;
while ((rows = client.fetchN(1000)) != null) {
for (String row : rows) {
// Do stuff with row
}
}
transport.close();
The code above is missing try/catch blocks to keep it short.
Does anyone have any ideas how to do an async call? Can Hive/Thrift support it? Is there a better way?
Thanks!
AFAIK, at the time of writing Thrift does not generate asynchronous clients. The reason as explained in this link here (search text for "asynchronous") is that Thrift was designed for the data centre where latency is assumed to be low.
Unfortunately as you know the latency experienced between call and result is not always caused by the network, but by the logic being performed! We have this problem calling into the Cassandra database from a Java application server where we want to limit total threads.
Summary: for now all you can do is make sure you have sufficient resources to handle the required numbers of blocked concurrent threads and wait for a more efficient implementation.
It is now possible to make an asynchronous call in a Java thrift client after this patch was put in:
https://issues.apache.org/jira/browse/THRIFT-768
Generate the async java client using the new thrift and initialize your client as follows:
TNonblockingTransport transport = new TNonblockingSocket("127.0.0.1", 9160);
TAsyncClientManager clientManager = new TAsyncClientManager();
TProtocolFactory protocolFactory = new TBinaryProtocol.Factory();
Hive.AsyncClient client = new Hive.AsyncClient(protocolFactory, clientManager, transport);
Now you can execute methods on this client as you would on a synchronous interface. The only change is that all methods take an additional parameter of a callback.
I know nothing about Hive, but as a last resort, you can use Java's concurrency library:
Callable<SomeResult> c = new Callable<SomeResult>(){public SomeResult call(){
// your Hive code here
}};
Future<SomeResult> result = executorService.submit(c);
// when you need the result, this will block
result.get();
Or, if you do not need to wait for the result, use Runnable instead of Callable.
After talking to the Hive mailing list, Hive does not support async calls using Thirft.
I don't know about Hive in particular but any blocking call can be turned in an asynch call by spawning a new thread and using a callback. You could look at java.util.concurrent.FutureTask which has been designed to allow easy handling of such asynchronous operation.
We fire off asynchronous calls to AWS Elastic MapReduce. AWS MapReduce can run hadoop/hive jobs on Amazon's cloud with a call to the AWS MapReduce web services.
You can also monitor the status of your jobs and grab the results off S3 once the job is completed.
Since the calls to the web services are asynchronous in nature, we never block our other operations. We continue to monitor the status of our jobs in a separate thread and grab the results when the job is complete.

Categories