How do I make an async call to Hive in Java? - java

I would like to execute a Hive query on the server in an asynchronous manner. The Hive query will likely take a long time to complete, so I would prefer not to block on the call. I am currently using Thirft to make a blocking call (blocks on client.execute()), but I have not seen an example of how to make a non-blocking call. Here is the blocking code:
TSocket transport = new TSocket("hive.example.com", 10000);
transport.setTimeout(999999999);
TBinaryProtocol protocol = new TBinaryProtocol(transport);
Client client = new ThriftHive.Client(protocol);
transport.open();
client.execute(hql); // Omitted HQL
List<String> rows;
while ((rows = client.fetchN(1000)) != null) {
for (String row : rows) {
// Do stuff with row
}
}
transport.close();
The code above is missing try/catch blocks to keep it short.
Does anyone have any ideas how to do an async call? Can Hive/Thrift support it? Is there a better way?
Thanks!

AFAIK, at the time of writing Thrift does not generate asynchronous clients. The reason as explained in this link here (search text for "asynchronous") is that Thrift was designed for the data centre where latency is assumed to be low.
Unfortunately as you know the latency experienced between call and result is not always caused by the network, but by the logic being performed! We have this problem calling into the Cassandra database from a Java application server where we want to limit total threads.
Summary: for now all you can do is make sure you have sufficient resources to handle the required numbers of blocked concurrent threads and wait for a more efficient implementation.

It is now possible to make an asynchronous call in a Java thrift client after this patch was put in:
https://issues.apache.org/jira/browse/THRIFT-768
Generate the async java client using the new thrift and initialize your client as follows:
TNonblockingTransport transport = new TNonblockingSocket("127.0.0.1", 9160);
TAsyncClientManager clientManager = new TAsyncClientManager();
TProtocolFactory protocolFactory = new TBinaryProtocol.Factory();
Hive.AsyncClient client = new Hive.AsyncClient(protocolFactory, clientManager, transport);
Now you can execute methods on this client as you would on a synchronous interface. The only change is that all methods take an additional parameter of a callback.

I know nothing about Hive, but as a last resort, you can use Java's concurrency library:
Callable<SomeResult> c = new Callable<SomeResult>(){public SomeResult call(){
// your Hive code here
}};
Future<SomeResult> result = executorService.submit(c);
// when you need the result, this will block
result.get();
Or, if you do not need to wait for the result, use Runnable instead of Callable.

After talking to the Hive mailing list, Hive does not support async calls using Thirft.

I don't know about Hive in particular but any blocking call can be turned in an asynch call by spawning a new thread and using a callback. You could look at java.util.concurrent.FutureTask which has been designed to allow easy handling of such asynchronous operation.

We fire off asynchronous calls to AWS Elastic MapReduce. AWS MapReduce can run hadoop/hive jobs on Amazon's cloud with a call to the AWS MapReduce web services.
You can also monitor the status of your jobs and grab the results off S3 once the job is completed.
Since the calls to the web services are asynchronous in nature, we never block our other operations. We continue to monitor the status of our jobs in a separate thread and grab the results when the job is complete.

Related

Sink in Flink blocks the task execution

I have a Sink in Flink, which extends from RichSinkFunction.
It delays the execution of all the Flink task (if I remove it, it takes a half, from 10' to less than 5'). This is its configuration:
OutputTag<List<SessionSinkModel>> inProgressSessionOutputTag = new OutputTag<>(ProcessorConstants.IN_PROGRESS_SESSIONS_SINK_NAME) {};
SingleOutputStreamOperator<SessionAccumulator> aggregatedSessionStream =
collectionMessageDataStream
.keyBy(CollectionMessage::getSessionId)
.process(sessionKeyedProcessFunction)
.uid("SessionWindow")
.name("Session Window")
.setParallelism(4);
DataStream<List<SessionSinkModel>> inProgressSessionStream = aggregatedSessionStream
.getSideOutput(inProgressSessionOutputTag);
inProgressSessionStream
.broadcast()
.addSink(new SessionAPISink(config))
.uid("Sessions side output")
.name("Sessions side output");
This Sink sends massive data by POST to an endpoint, this POST call is asynchronous (as far as I know like the Sink call). I use the standard call using the output from KeyedBroadcastProcessFunction.ReadOnlyContext ctx.
ctx.output(outputTag, message);
How can I do to make this Sink not to block the task execution?
There are two issues that I see with the workflow...
You shouldn't be doing a inProgressSessionStream.broadcast()
For efficient async IO, you want to use Flink's AsyncIO support, and then follow that with a DiscardingSink.

Lettuce StatefulRedisConnection async command execution order

I'm confused a bit about order of Redis command execution when using a Lettuce driver.
Examples use code like
private val cacheConnection: StatefulRedisConnection<String, String>
// (...)
cacheConnection.async().getset(keyStr, json)
cacheConnection.async().expire(keyStr, expireAfterWrite)
https://github.com/lettuce-io/lettuce-core/issues/1627
https://www.baeldung.com/java-redis-lettuce
However, the documentation states
good example is the async API. Every invocation on the async API returns a Future (response handle) after the command is written to the netty pipeline. A write to the pipeline does not mean, the command is written to the underlying transport. Multiple commands can be written without awaiting the response. Invocations to the API (sync, async and starting with 4.0 also reactive API) can be performed by multiple threads.
(https://github.com/lettuce-io/lettuce-core/wiki/Pipelining-and-command-flushing)
This does not specify when the commands are put in the pipeline. Shouldn't I use thenAccept instead?
cacheConnection.async().getset(keyStr, json)
.thenAccept { expire(keyStr, expireAfterWrite) }
That would mean that all these examples are wrong which is... improbable?
Can you please explain how does it work? Is execution order preservation just a systematic coincidence (ie an implementation detail)?

How to use CompletableFuture with AWS Glue job status?

I have a requirement where I need to get the status of AWS Glue crawler, which is an async request, and based on when the jobs get completed, I would fire certain events. The catch here is that I do not want to use polling. On looking further, AWS docs suggests to use CompletableFuture object to deal with async request in AWS. But when I try to use, I am not able to form CompletableFuture object as it gives me Type mismatch. I have this code :
GetCrawlerMetricsRequest metricsRequest =
new GetCrawlerMetricsRequest().withCrawlerNameList(Arrays.asList("myJavaCrawler"));
GetCrawlerMetricsResult jsonOb = awsglueClient.getCrawlerMetrics(metricsRequest);
CompletableFuture<GetCrawlerMetricsResult> futureResponse = CompletableFuture<GetCrawlerMetricsResult>awsglueClient.getCrawlerMetricsAsync(metricsRequest);
But futureResponse object shows error stating FutureTask cannot be casted to CompletableFuture.
I am following the approach given here
I am not sure how can I make this working. Based on this futureResponse object, I can then use .whenApply function to trigger the certain job which I want to execute such as pushing the above response into a Kafka Queue. Any ideas?
It seems like you are using AWS SDK v1 when the doc you mentioned shows how to do it using v2 (which has 'Developer Preview' status, so it's not recommended for production). Here is a doc showing how to make async calls in v1
For your use case I would recommend another approach. Glue posts few types of events and one of them is "Crawler Succeeded". So you can create a CloudWatch rule to catch these events and trigger a lambda which will make a call to start appropriate job

Using netty with 3rd party blocking API

I am using a 3rd party blocking API. I am going to be using this API as follows:
while(true){
blockingAPI();
sendResultSomewhere();
}
blockingAPI() polls a server for a specific property until it gets a response.
In order to make things asynchronous to some extent I could spawn this API call within a separate thread. and have a callback implemented in Java to handle the response. I was wondering if I can use the netty framework in this scenario, and how I could do this? The examples I have seen involve a server that listens and communicates with a client, and I am not sure how my use case fits in.
If netty cannot be used, would my best bet be spawning a new thread and implementing a callback in Java?
Not sure what you really try to do:
Spawn internally a new thread: you could use LocalChannel with Netty to have intra-JVM process communication and therefore having something like you want, without any network consideration (only within the JVM). The blockingAPI will be computed within ServerLocalChannel side, while the result will be written once the client get back a response through the same LocalChannel.
Spawn but with a request from outside (network), then Netty could of course be used too there. Maybe still keeping a LocalChannel logic to separate network to compute.
Note that I could recommand to use asynchronous operation using LocalChannel (executing the blocking task), such that the send somewhere else is done without blocking the Netty's Network IO thread.
Network Handler side:
localChannel = creationWithinNetworkHandler(networkChannelCtx);
localChannel.writeAndFlush(something);
while LocalChannel handler server side could be as:
void read0(ChannelHandlerContext ctx, someData) {
blockingAPI();
ctx.channel().writeAndFlush(answear).addFutureListener(Channels.CloseFuture);
}
and LocalChannel handler client side could be as:
void read0(ChannelHandlerContext ctx, answear) {
//Using ctx from Network channel side
networkCtx.writeAndFlush(answear);
}

what is the most efficient way to create asynchronous request using axis in java?

I'm looking for the best solution to solve this problem :
I have a client and a server.
The client sending request to the server using the call.invoke method.
The call for now is synchronous and waiting for the answer.
The time is taking to receive the replay from the server under load is around 1 sec(this is a lot of time).
at the client side we are generating requests around 50-100 request per second , the queue is exploding.
For now i just created a thread pool that will work asynchronous and will send the requests to the server per thread , but the request it self will be synchronous.
The meaning of that is that the thread pool should maintain ~100 threads if we do want that it will work fine.
I'm not sure this is the best solution.
I also was thinking to create somehow 1 thread that will send the requests and 1 thread that will catch the replays, but then i'm afraid that i will pass on the load to the server side.
Few things that are importent:
We cannot effect the code on the server side and we cannot control the time it takes to receive a replay.
while receiving the replay we just use this data to create another data structure and pass it on - so the time stamp is not relay importent.
we are using axis api.
any idea of how is it the best way to solve it? the thread pool of the 100 thread seems fine ? or there some other ways?
Thanks!
You can call axis service using non-blocking client way by registering the callback instance.
Client class:
ServiceClient sc = new ServiceClient();
Options opt= new Options();
//set the target EP
opt.setTo(new EndpointReference("http://localhost:8080/axis2/services/CountryService"));
opt.setAction("urn:getCountryDetails");
sc.setOptions(opt);
sc.sendReceiveNonBlocking(payload, callBack);
// inner class with axisCallback , overide all its methods. onMessage get called once result receive from backend
AxisCallback callBack = new AxisCallback() {
#Override
public void onMessage(MessageContext msgContext) {
System.out.println(msgContext.getEnvelope().getBody().getFirstElement());
//this method get called when you received the results from the backend
}
...
}
Reference for writing axis service : http://jayalalk.blogspot.com/2014/01/writing-axis2-services-and-deploying-in.html

Categories