Sink in Flink blocks the task execution - java

I have a Sink in Flink, which extends from RichSinkFunction.
It delays the execution of all the Flink task (if I remove it, it takes a half, from 10' to less than 5'). This is its configuration:
OutputTag<List<SessionSinkModel>> inProgressSessionOutputTag = new OutputTag<>(ProcessorConstants.IN_PROGRESS_SESSIONS_SINK_NAME) {};
SingleOutputStreamOperator<SessionAccumulator> aggregatedSessionStream =
collectionMessageDataStream
.keyBy(CollectionMessage::getSessionId)
.process(sessionKeyedProcessFunction)
.uid("SessionWindow")
.name("Session Window")
.setParallelism(4);
DataStream<List<SessionSinkModel>> inProgressSessionStream = aggregatedSessionStream
.getSideOutput(inProgressSessionOutputTag);
inProgressSessionStream
.broadcast()
.addSink(new SessionAPISink(config))
.uid("Sessions side output")
.name("Sessions side output");
This Sink sends massive data by POST to an endpoint, this POST call is asynchronous (as far as I know like the Sink call). I use the standard call using the output from KeyedBroadcastProcessFunction.ReadOnlyContext ctx.
ctx.output(outputTag, message);
How can I do to make this Sink not to block the task execution?

There are two issues that I see with the workflow...
You shouldn't be doing a inProgressSessionStream.broadcast()
For efficient async IO, you want to use Flink's AsyncIO support, and then follow that with a DiscardingSink.

Related

Lettuce StatefulRedisConnection async command execution order

I'm confused a bit about order of Redis command execution when using a Lettuce driver.
Examples use code like
private val cacheConnection: StatefulRedisConnection<String, String>
// (...)
cacheConnection.async().getset(keyStr, json)
cacheConnection.async().expire(keyStr, expireAfterWrite)
https://github.com/lettuce-io/lettuce-core/issues/1627
https://www.baeldung.com/java-redis-lettuce
However, the documentation states
good example is the async API. Every invocation on the async API returns a Future (response handle) after the command is written to the netty pipeline. A write to the pipeline does not mean, the command is written to the underlying transport. Multiple commands can be written without awaiting the response. Invocations to the API (sync, async and starting with 4.0 also reactive API) can be performed by multiple threads.
(https://github.com/lettuce-io/lettuce-core/wiki/Pipelining-and-command-flushing)
This does not specify when the commands are put in the pipeline. Shouldn't I use thenAccept instead?
cacheConnection.async().getset(keyStr, json)
.thenAccept { expire(keyStr, expireAfterWrite) }
That would mean that all these examples are wrong which is... improbable?
Can you please explain how does it work? Is execution order preservation just a systematic coincidence (ie an implementation detail)?

passing an Akka stream to an upstream service to populate

I need to call an upstream service (Azure Blob Service) to push data to an OutputStream, which then i need to turn around and push it back to the client, thru akka. Without akka (and just servlet code), i'd just get the ServletOutputStream and pass it to the azure service's method.
The closest i can try to stumble upon, and clearly this is wrong, is something like this
Source<ByteString, OutputStream> source = StreamConverters.asOutputStream().mapMaterializedValue(os -> {
blobClient.download(os);
return os;
});
ResponseEntity resposeEntity = HttpEntities.create(ContentTypes.APPLICATION_OCTET_STREAM, preAuthData.getFileSize(), source);
sender().tell(new RequestResult(resposeEntity, StatusCodes.OK), self());
The idea is i'm calling an upstream service to get an outputstream populated by calling
blobClient.download(os);
It seems like the the lambda function gets called and returns, but then afterwards it fails, because there's no data or something. As if i'm not supposed to be have that lambda function do the work, but perhaps return some object that does the work? Not sure.
How does one do this?
The real issue here is that the Azure API is not designed for back-pressuring. There is no way for the output stream to signal back to Azure that it is not ready for more data. To put it another way: if Azure pushes data faster than you are able to consume it, there will have to be some ugly buffer overflow failure somewhere.
Accepting this fact, the next best thing we can do is:
Use Source.lazySource to only start downloading data when there is downstream demand (aka. the source is being run and data is being requested).
Put the download call in some other thread so that it continues executing without blocking the source from being returned. Once way to do this is with a Future (I'm not sure what Java best practices are, but should work fine either way). Although it won't matter initially, you may need to choose an execution context other than system.dispatcher - it all depends on whether download is blocking or not.
I apologize in advance if this Java code is malformed - I use Akka with Scala, so this is all from looking at the Akka Java API and Java syntax reference.
ResponseEntity responseEntity = HttpEntities.create(
ContentTypes.APPLICATION_OCTET_STREAM,
preAuthData.getFileSize(),
// Wait until there is downstream demand to intialize the source...
Source.lazySource(() -> {
// Pre-materialize the outputstream before the source starts running
Pair<OutputStream, Source<ByteString, NotUsed>> pair =
StreamConverters.asOutputStream().preMaterialize(system);
// Start writing into the download stream in a separate thread
Futures.future(() -> { blobClient.download(pair.first()); return pair.first(); }, system.getDispatcher());
// Return the source - it should start running since `lazySource` indicated demand
return pair.second();
})
);
sender().tell(new RequestResult(responseEntity, StatusCodes.OK), self());
The OutputStream in this case is the "materialized value" of the Source and it will only be created once the stream is run (or "materialized" into a running stream). Running it is out of your control since you hand the Source to Akka HTTP and that will later actually run your source.
.mapMaterializedValue(matval -> ...) is usually used to transform the materialized value but since it is invoked as a part of materialization you can use that to do side effects such as sending the matval in a message, just like you have figured out, there isn't necessarily anything wrong with that even if it looks funky. It is important to understand that the stream will not complete its materialization and become running until that lambda completes. This means problems if download() is blocking rather than forking off some work on a different thread and immediately returning.
There is however another solution: Source.preMaterialize(), it materializes the source and gives you a Pair of the materialized value and a new Source that can be used to consume the already started source:
Pair<OutputStream, Source<ByteString, NotUsed>> pair =
StreamConverters.asOutputStream().preMaterialize(system);
OutputStream os = pair.first();
Source<ByteString, NotUsed> source = pair.second();
Note that there are a few additional things to think of in your code, most importantly if the blobClient.download(os) call blocks until it is done and you call that from the actor, in that case you must make sure that your actor does not starve the dispatcher and stop other actors in your application from executing (see Akka docs: https://doc.akka.io/docs/akka/current/typed/dispatchers.html#blocking-needs-careful-management ).

Spark Stream new Job after stream start

I have a situation where I am trying to stream using spark streaming from kafka. The stream is a direct stream. I am able to create a stream and then start streaming, also able to get any updates (if any) on kafka via the streaming.
The issue comes in when i have a new request to stream a new topic. Since SparkStreaming context can be only 1 per jvm, I cannot create a new stream for every new request.
The way I figured out is
Once a DStream is created and spark streaming is already in progress, just attach a new stream to it. This does not seem to work, the createDStream (for a new topic2) does not return a stream and further processing is stopped. The streaming keep on continuing on the first request (say topic1).
Second, I thought to stop the stream, create DStream and then start streaming again. I cannot use the same streaming context (it throws an excpection that jobs cannot be added after streaming has been stopped), and if I create a new stream for new topic (topic2), the old stream topic (topic1) is lost and it streams only the new one.
Here is the code, have a look
JavaStreamingContext javaStreamingContext;
if(null == javaStreamingContext) {
javaStreamingContext = JavaStreamingContext(sparkContext, Durations.seconds(duration));
} else {
StreamingContextState streamingContextState = javaStreamingContext.getState();
if(streamingContextState == StreamingContextState.STOPPED) {
javaStreamingContext = JavaStreamingContext(sparkContext, Durations.seconds(duration));
}
}
Collection<String> topics = Arrays.asList(getTopicName(schemaName));
SparkVoidFunctionImpl impl = new SparkVoidFunctionImpl(getSparkSession());
KafkaUtils.createDirectStream(javaStreamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, getKafkaParamMap()))
.map((stringStringConsumerRecord) -> stringStringConsumerRecord.value())
.foreachRDD(impl);
if (javaStreamingContext.getState() == StreamingContextState.ACTIVE) {
javaStreamingContext.start();
javaStreamingContext.awaitTermination();
}
Don't worry about SparkVoidFunctionImpl, this is a custom class with is the implementation of VoidFunction.
The above is approach 1, where i do not stop the existing streaming. When a new request comes into this method, it does not get a new streaming object, it tries to create a dstream. The issue is the DStream object is never returned.
KafkaUtils.createDirectStream(javaStreamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, getKafkaParamMap()))
This does not return a dstream, the control just terminates without an error.The steps further are not executed.
I have tried many things and read multiple article, but I belive this is a very common production level issue. Any streaming done is to be done on multiple different topics and each of them is handled differently.
Please help
The thing is spark master sends out code to workers and although the data is streaming, underlying code and variable values remain static unless job is restarted.
Few options I could think:
Spark Job server: Every time you want to subscribe/stream from a different topic instead of touching already running job, start a new job. From your API body you can supply the parameters or topic name. If you want to stop streaming from a specific topic, just stop respective job. It will give you a lot of flexibility and control on resources.
[Theoritical] Topic Filter: Subscribe all topics you think you will want, when records are pulled for a duration, filter out records based on a LIST of topics. Manipulate this list of topics through API to increase or decrease your scope of topics, it could be a broadcast variable as well. This is just an idea, I have not tried this option at all.
Another work around is to relay your Topic-2 data to Topic-1 using a microservice whenever you need it & stop if you don't want to.

Apache Flink: Correctly make async webservice calls within MapReduce()

I've a program with the following mapPartition function:
public void mapPartition(Iterable<Tuple> values, Collector<Tuple2<Integer, String>> out)
I collect batches of 100 from the inputted values & send them to a web-service for conversion. The result I add back to the out collection.
In order to speed up the process, I made the web-service calls async through the use of Executors. This created issues, either I get the taskManager released exception, or AskTimeoutException. I increased memory & timeouts, but it didn't help. There's quite a lot of input data. I believe this resulted in a lot of jobs being queued up with ExecutorService & hence taking up lots of memory.
What would be the best approach for this?
I was also looking at the taskManager vs taskSlot configuration, but got a little confused on the differences between the two (I guess they're similar to process vs threads?). Wasn't sure at what point do I increase the taskManagers vs taskSlots? e.g. if I've got three machines with 4cpus per machine, so then should my taskManager=3 while my taskSlot=4?
I was also considering increasing the mapPartition's parallelism alone to say 10 to get more threads hitting the web-service. Comments or suggestions?
You should check out Flink Asyncio which would enable you to query your webservice in an asynchronous way in your streaming application.
One thing to note is that the Asyncio function is not called multithreaded and is called once per record per partition sequentially, so your web application needs to deterministically return and potentially return fast for the job to not being held up.
Also, potentially higher number of partitions would help your case but again your webservice needs to fulfil those requests fast enough
Sample code block from Flinks Website:
// This example implements the asynchronous request and callback with Futures that have the
// interface of Java 8's futures (which is the same one followed by Flink's Future)
/**
* An implementation of the 'AsyncFunction' that sends requests and sets the callback.
*/
class AsyncDatabaseRequest extends RichAsyncFunction<String, Tuple2<String, String>> {
/** The database specific client that can issue concurrent requests with callbacks */
private transient DatabaseClient client;
#Override
public void open(Configuration parameters) throws Exception {
client = new DatabaseClient(host, post, credentials);
}
#Override
public void close() throws Exception {
client.close();
}
#Override
public void asyncInvoke(final String str, final AsyncCollector<Tuple2<String, String>> asyncCollector) throws Exception {
// issue the asynchronous request, receive a future for result
Future<String> resultFuture = client.query(str);
// set the callback to be executed once the request by the client is complete
// the callback simply forwards the result to the collector
resultFuture.thenAccept( (String result) -> {
asyncCollector.collect(Collections.singleton(new Tuple2<>(str, result)));
});
}
}
// create the original stream (In your case the stream you are mappartitioning)
DataStream<String> stream = ...;
// apply the async I/O transformation
DataStream<Tuple2<String, String>> resultStream =
AsyncDataStream.unorderedWait(stream, new AsyncDatabaseRequest(), 1000, TimeUnit.MILLISECONDS, 100);
Edit:
As the user wants to create batches of size 100 and asyncio is specific to Streaming API for the moment, thus the best way would be to create countwindows with size 100.
Also, to purge the last window which might not have 100 events, custom Triggers could be used with a combination of Count Triggers and Time Based Triggers such that the trigger fires after a count of elements or after every few minutes.
A good follow up is available here on Flink Mailing List where the user "Kostya" created a custom trigger which is available here

How do I make an async call to Hive in Java?

I would like to execute a Hive query on the server in an asynchronous manner. The Hive query will likely take a long time to complete, so I would prefer not to block on the call. I am currently using Thirft to make a blocking call (blocks on client.execute()), but I have not seen an example of how to make a non-blocking call. Here is the blocking code:
TSocket transport = new TSocket("hive.example.com", 10000);
transport.setTimeout(999999999);
TBinaryProtocol protocol = new TBinaryProtocol(transport);
Client client = new ThriftHive.Client(protocol);
transport.open();
client.execute(hql); // Omitted HQL
List<String> rows;
while ((rows = client.fetchN(1000)) != null) {
for (String row : rows) {
// Do stuff with row
}
}
transport.close();
The code above is missing try/catch blocks to keep it short.
Does anyone have any ideas how to do an async call? Can Hive/Thrift support it? Is there a better way?
Thanks!
AFAIK, at the time of writing Thrift does not generate asynchronous clients. The reason as explained in this link here (search text for "asynchronous") is that Thrift was designed for the data centre where latency is assumed to be low.
Unfortunately as you know the latency experienced between call and result is not always caused by the network, but by the logic being performed! We have this problem calling into the Cassandra database from a Java application server where we want to limit total threads.
Summary: for now all you can do is make sure you have sufficient resources to handle the required numbers of blocked concurrent threads and wait for a more efficient implementation.
It is now possible to make an asynchronous call in a Java thrift client after this patch was put in:
https://issues.apache.org/jira/browse/THRIFT-768
Generate the async java client using the new thrift and initialize your client as follows:
TNonblockingTransport transport = new TNonblockingSocket("127.0.0.1", 9160);
TAsyncClientManager clientManager = new TAsyncClientManager();
TProtocolFactory protocolFactory = new TBinaryProtocol.Factory();
Hive.AsyncClient client = new Hive.AsyncClient(protocolFactory, clientManager, transport);
Now you can execute methods on this client as you would on a synchronous interface. The only change is that all methods take an additional parameter of a callback.
I know nothing about Hive, but as a last resort, you can use Java's concurrency library:
Callable<SomeResult> c = new Callable<SomeResult>(){public SomeResult call(){
// your Hive code here
}};
Future<SomeResult> result = executorService.submit(c);
// when you need the result, this will block
result.get();
Or, if you do not need to wait for the result, use Runnable instead of Callable.
After talking to the Hive mailing list, Hive does not support async calls using Thirft.
I don't know about Hive in particular but any blocking call can be turned in an asynch call by spawning a new thread and using a callback. You could look at java.util.concurrent.FutureTask which has been designed to allow easy handling of such asynchronous operation.
We fire off asynchronous calls to AWS Elastic MapReduce. AWS MapReduce can run hadoop/hive jobs on Amazon's cloud with a call to the AWS MapReduce web services.
You can also monitor the status of your jobs and grab the results off S3 once the job is completed.
Since the calls to the web services are asynchronous in nature, we never block our other operations. We continue to monitor the status of our jobs in a separate thread and grab the results when the job is complete.

Categories