GRPC: make high-throughput client in Java/Scala

GRPC: make high-throughput client in Java/Scala - java

I have a service that transfers messages at a quite high rate.
Currently it is served by akka-tcp and it makes 3.5M messages per minute. I decided to give grpc a try.
Unfortunately it resulted in much smaller throughput: ~500k messages per minute an even less.
Could you please recommend how to optimize it?
My setup
Hardware: 32 cores, 24Gb heap.
grpc version: 1.25.0
Message format and endpoint
Message is basically a binary blob.
Client streams 100K - 1M and more messages into the same request (asynchronously), server doesn't respond with anything, client uses a no-op observer
service MyService {
rpc send (stream MyMessage) returns (stream DummyResponse);
}
message MyMessage {
int64 someField = 1;
bytes payload = 2; //not huge
}
message DummyResponse {
}
Problems:
Message rate is low compared to akka implementation.
I observe low CPU usage so I suspect that grpc call is actually blocking internally despite it says otherwise. Calling onNext() indeed doesn't return immediately but there is also GC on the table.
I tried to spawn more senders to mitigate this issue but didn't get much of improvement.
My findings
Grpc actually allocates a 8KB byte buffer on each message when serializes it. See the stacktrace:
java.lang.Thread.State: BLOCKED (on object monitor)
at com.google.common.io.ByteStreams.createBuffer(ByteStreams.java:58)
at com.google.common.io.ByteStreams.copy(ByteStreams.java:105)
at io.grpc.internal.MessageFramer.writeToOutputStream(MessageFramer.java:274)
at io.grpc.internal.MessageFramer.writeKnownLengthUncompressed(MessageFramer.java:230)
at io.grpc.internal.MessageFramer.writeUncompressed(MessageFramer.java:168)
at io.grpc.internal.MessageFramer.writePayload(MessageFramer.java:141)
at io.grpc.internal.AbstractStream.writeMessage(AbstractStream.java:53)
at io.grpc.internal.ForwardingClientStream.writeMessage(ForwardingClientStream.java:37)
at io.grpc.internal.DelayedStream.writeMessage(DelayedStream.java:252)
at io.grpc.internal.ClientCallImpl.sendMessageInternal(ClientCallImpl.java:473)
at io.grpc.internal.ClientCallImpl.sendMessage(ClientCallImpl.java:457)
at io.grpc.ForwardingClientCall.sendMessage(ForwardingClientCall.java:37)
at io.grpc.ForwardingClientCall.sendMessage(ForwardingClientCall.java:37)
at io.grpc.stub.ClientCalls$CallToStreamObserverAdapter.onNext(ClientCalls.java:346)
Any help with best practices on building high-throughput grpc clients appreciated.

I solved the issue by creating several ManagedChannel instances per destination. Despite articles say that a ManagedChannel can spawn enough connections itself so one instance is enough it's wasn't true in my case.
Performance is in parity with akka-tcp implementation.

Interesting question. Computer network packages are encoded using a stack of protocols, and such protocols are built on top of the specifications of the previous one. Hence the performance (throughput) of a protocol is bounded by the performance of the one used to built it, since you are adding extra encoding/decoding steps on top of the underlying one.
For instance gRPC is built on top of HTTP 1.1/2, which is a protocol on the Application layer, or L7, and as such its performance is bound by the performance of HTTP. Now HTTP itself is build on top of TCP, which is at Transport layer, or L4, so we can deduce that gRPC throughput cannot be larger than an equivalent code served in the TCP layer.
In other words: if you server is able to handle raw TCP packages, how adding new layers of complexity (gRPC) would improve performance?

I'm quite impressed with how good Akka TCP has performed here :D
Our experience was slightly different. We were working on much smaller instances using Akka Cluster. For Akka remoting, we changed from Akka TCP to UDP using Artery and achieved a much higher rate + lower and more stable response time. There is even a config in Artery helping to balance between CPU consumption and response time from a cold start.
My suggestion is to use some UDP based framework which also takes care of transmission reliability for you (e.g. that Artery UDP), and just serialize using Protobuf, instead of using full flesh gRPC. The HTTP/2 transmission channel is not really for high throughput low response time purposes.

Related

Netty and RTP - worse than plain old DatagramSockets?

I'm trying to optimize an RTP server to be able to handle more RTP endpoints. For handling the RTP part I originally used librtp however I also came up with a librtp variant that uses Netty instead of plain old DatagramSocket:
io.netty.channel.socket.DatagramPacket nettyDatagramPacket = new io.netty.channel.socket.DatagramPacket(data, socketAddress);
ChannelFuture future = channel.writeAndFlush(nettyDatagramPacket);
However it seems my approach is actually slower:
Both graphs show the RTP packet intervals as measured by Wireshark and on the right a typical RTP interval diagram as drawn by Wireshark (over time).
Is it possible Netty is not suited for timing sensitive IO operations due to the fact that the sending is offloaded to another thread?
I assume because with Netty the sending is de-coupled from the packet generation (which happens every 20ms) it might occur that multiple packets (even for the same stream) are bunched together and sent pretty much without delay. Is there a Netty way to improve this? Right now I'm using the default thread pools offered by Netty!

How to make ZMQ pub client socket buffer messages while sub server socket is down

Given 2 applications where
application A is using a publisher client to contentiously stream data to application B which has a sub server socket to accept that data, how can we configure pub client socket in application A such that when B is being unavailable (like its being redeployed, restarted) A buffers all the pending messages and when B becomes available buffered messages go trough and socket catches up with real time stream?
In a nutshell, how do we make PUB CLIENT socket buffer messages with some limit while SUB SERVER is unavailable?
The default behaviour for PUB client is to drop in mute state, but it would be great if we could change that to a limit sized buffer, is it possible with zmq? or do i need to do it on application level...
I've tried setting HWM and LINGER in my sockets, but if i'm not wrong they are only responsible for slow consumer case, where my publisher is connected to subscriber, but subscriber is so slow that publisher starts to buffer messages (hwm will limit number of those messages)...
I'm using jeromq since i'm targeting jvm platform.

First of all, welcome to the world of Zen-of-Zero, where latency matters most
PROLOGUE :
ZeroMQ was designed by a Pieter HINTJENS' team of ultimately experienced masters - Martin SUSTRIK to be named first. The design was professionally crafted so as to avoid any unnecessary latency. So asking about having a (limited) persistence? No, sir, not confirmed - PUB/SUB Scalable Formal Communication Pattern Archetype will not have it built-in, right because of the added problems and decreased performance and scalability ( add-on latency, add-on processing, add-on memory-management ).
If one needs a (limited) persistence (for absent remote-SUB-side agent(s)' connections ), feel free to implement it on the app-side, or one may design and implement a new ZMTP-compliant such behaviour-pattern Archetype, extending the ZeroMQ framework, if such work goes into stable and publicly accepted state, but do not request the high-performance, latency-shaved standard PUB/SUB having polished the almost linear scalability ad astra, to get modified in this direction. It is definitely not a way to go.
Solution ?
App-side may easily implement your added logic, using dual-pointer circular buffers, working in a sort-of (app-side-managed)-Persistence-PROXY, yet in-front-of the PUB-sender.
Your design may get successful in squeezing some additional sauce from the ZeroMQ internal details in case your design also enjoys to use the recently made available built-in ZeroMQ-socket_monitor-component to setup an additional control-layer and receive there a stream of events as seen from "inside" the PUB-side Context-instance, where some additional network and connection-management related events may bring more light into your (app-side-managed)-Persistence-PROXY
Yet, be warned that
The _zmq_socket_monitor()_ method supports only connection-oriented
transports, that is, TCP, IPC, and TIPC.
so one may straight forget about this in case any of the ultimately interesting transport-classes was planned to be used { inproc:// | norm:// | pgm:// | epgm:// | vmci:// }
Heads up !
There are inaccurate, if not wrong, pieces of information from our Community honorable member smac89, who tried his best to address your additional interest expressed in the comment:
"...zmq optimizes publishing on topics? like if you keep publishing on some 100char long topic rapidly, is it actually sending the topic every time or it maps to some int and sends the int subsequently...?"
telling you:
"It will always publish the topic. When I use the pub-sub pattern, I usually publish the topic first and then the actual message, so in the subscriber I just read the first frame and ignore it and then read the actual message"
ZeroMQ does not work this way. There is nothing as a "separate" <topic> followed by a <message-body>, but rather the opposite
The TOPIC and the mechanisation of topic-filtering works in a very different way.
1) you never know, who .connect()-s:i.e. one can be almost sure the version 2.x till version 4.2+ will handle the topic-filtering in different manner ( ZMTP:RFC defines intial capability-version handshaking, to let the Context-instance decide, which version of topic-filtering will have to be used: ver 2.x used to move all messages to all peers, and let all the SUB-sides ( of ver 2.x+ ) be delivered the message ( and let the SUB-side Context-instance process the local topic-list filter processing )whereasver 4.2+ are sure to perform the topic-list filter processing on **the PUB-side Context-instance (CPU-usage grows, network-transport the opposite ), so your SUB-side will never be delivered a byte of "useless" read "not-subscribed" to messages.
2) (you may, but) there is no need to separate a "topic" into a first-frame of a thus-implied multi-frame message. Perhaps just the opposite ( it is a rather anti-pattern to do this in high performance, low-latecy distributed system design.
Topic filtering process is defined and works byte-wise, from left-to-right, pattern matching for each of the topic-list member value agains the delivered message payload.
Adding extra data, extra frame-management processing just and only does increase the end-to-end latency and processing overhead. Never a good idea to do this instead of proper distributed-system design work.
EPILOGUE :
There are no easy wins nor any low-hanging fruit in professional distributed-systems design, the less if low-latency or ultra-low-latency are the design targets.
On the other hand, be sure that ZeroMQ framework was made with this in mind and these efforts were crowned with stable, ultimately performant well-balanced set of tools for smart (by design), fast (in operation) and scalable (as hell may envy) signaling/messaging services people love to use right because of this design wisdom.
Wish you live happy with ZeroMQ as it is and feel free to add any additional set of features "in front" of the ZeroMQ layer, inside your application suite of choice.

I'm posting a quick update since the other two answers (though very informative were actually wrong), and i dont want others to be misinformed from my accepted answer. Not only you can do this with zmq, it is actually the default behaviour.
The trick is that if you publisher client never connected to the subscriber server before it keeps dropping messages (and that is why i was thinking it does not buffer messages), but if your publisher connects to subscriber and you restart subscriber, publisher will buffer messages until HWM is reached which is exactly what i asked for... so in short publisher wants to know there is someone on the other end accepting messages only after that it will buffer messages...
Here is some sample code which demonstrates this (you might need to do some basic edits to compile it).
I used this dependency only org.zeromq:jeromq:0.5.1.
zmq-publisher.kt
fun main() {
val uri = "tcp://localhost:3006"
val context = ZContext(1)
val socket = context.createSocket(SocketType.PUB)
socket.hwm = 10000
socket.linger = 0
"connecting to $uri".log()
socket.connect(uri)
fun publish(path: String, msg: Msg) {
">> $path | ${msg.json()}".log()
socket.sendMore(path)
socket.send(msg.toByteArray())
}
var count = 0
while (notInterrupted()) {
val msg = telegramMessage("message : ${++count}")
publish("/some/feed", msg)
println()
sleepInterruptible(1.second)
}
}
and of course zmq-subscriber.kt
fun main() {
val uri = "tcp://localhost:3006"
val context = ZContext(1)
val socket = context.createSocket(SocketType.SUB)
socket.hwm = 10000
socket.receiveTimeOut = 250
"connecting to $uri".log()
socket.bind(uri)
socket.subscribe("/some/feed")
while (true) {
val path = socket.recvStr() ?: continue
val bytes = socket.recv()
val msg = Msg.parseFrom(bytes)
"<< $path | ${msg.json()}".log()
}
}
Try running publisher first without subscriber, then when you launch subscriber you missed all the messages so far... now without restarting publisher, stop subscriber wait for some time and start it again.
Here is an example of one of my services actually benefiting from this...
This is the structure [current service]sub:server <= pub:client[service being restarted]sub:server <=* pub:client[multiple publishers]
Because i restart the service in the middle, all the publishers start buffering their messages, the final service that was observing ~200 messages per second observes drop to 0 (those 1 or 2 are heartbeats) then sudden burst of 1000+ messages come in, because all publishers flushed their buffers (restart took about 5 seconds)... I am actually not loosing a single message here...
Note that you must have subscriber:server <= publisher:client pair (this way publisher knows "there is only one place i need to deliver these messages to" (you can try binding on publisher and connecting on subscriber but you will not see publisher buffering messages anymore simply because its questionable if subscriber that just disconnected did it because it no longer needs the data or because it failed)

As we've discussed in the comments there is no way for the publisher to buffer messages while having nothing connected to it, it will simply drop any new messages:
From the docs:
If a publisher has no connected subscribers, then it will simply drop all messages.
This means your buffer needs to be outside of zeromq's care. Your buffer could then be a list, or a database, or any other method of storage you choose, but you cannot use your publisher for doing that.
Now the next problem is dealing with how to detect that a subscriber has connected/disconnected. This is needed to tell us when we need to start reading from the buffer/filling the buffer.
I suggest using Socket.monitor and listening for the ZMQ_EVENT_CONNECTED and ZMQ_EVENT_DISCONNECTED, as these will tell you when a client has connected/disconnected and thus enable you to switching to filling your buffer of choice. Of course, there might be other ways of doing this that does not directly involve zeromq, but that's up to you to decide.

Does AsyncHttpClient knows how many threads to allocate for all the HTTP requests

I'm evaluating AsyncHttpClient for big loads (~1M HTTP requests).
For each request I would like to invoke a callback using the AsyncCompletionHandler which will just insert the result into a blocking queue
My question is: if I'm sending asynchronous requests in a tight loop, how many threads will the AsyncHttpClient use? (I know you can set the max but apparently you take a risk of losing requests, I've seen it here)
I'm currently using the Netty implementation with these versions:
async-http-client v1.9.33
netty v3.10.5.Final
I don't mind using other versions if there are any optimization in later versions
EDIT:
I read that Netty uses the reactor pattern for reacting to HTTP responses which means it allocates a very small number of threads to act as selectors. This also means that the number of allocated threads doesn't increase with high requests volume. However that contradicts the need to set the max number of connections.
Can anyone explain what I'm missing?
Thanks in advance

The AsyncHttpClient client (and other non-blocking IO clients for the matter), don't need to allocate a thread per request, and the client need not resize its thread pool even if you bombard it with requests. You do initiate many connections if you don't use HTTP keep-alive, or call multiple hosts, but it can all be handled by a single threaded client (there may be more than one IO thread, depending on the implementation).
However, it's always a good idea to limit the max requests per host, and max requests per domain, to avoid overloading a service on a specific host, or a site, and avoid getting blocked. This is why HTTP clients add a maxConnectionsPerXxx setting.

AHC has two types of threads:
For I/O operation. On your screen, it's AsyncHttpClient-x-x threads. AHC creates 2*core_number of those.
For timeouts. On your screen, it's AsyncHttpClient-timer-1-1 thread. Should be only one.
And as you mentioned:
maxConnections just means number of open connections which does not
directly affect the number of threads
Source: issue on GitHub: https://github.com/AsyncHttpClient/async-http-client/issues/1658

Java sockets: can I write a TCP server with one thread?

From what I read about Java NIO and non-blocking [Server]SocketChannels, it should be possible to write a TCP server that sustains several connections using only one thread - I'd make a Selector that waits for all relevant channels in the server's loop.
Is that right, or am I missing some important detail? What problems can I encounter?
(Background: The TCP communication would be for a small multiplayer game, so max. 10-20 simultaneous connections. Messages will be sent about every few seconds.)

Yes, you are right. The problems you can encounter is when the duration of processing is too long. In this case, you'd have to wrap the processing inside another thread, such that it will not interfere with the networking thread, and prevent noticeable delay.
Another detail; Channels are all about "moving" data. If your data you'd wish to send is ready, then you can move this data to a networking channel. The copying/buffering/etc. is all done by the NIO implementation, then.
Your single-threaded "networking thread" is only steering the connection, but not throttling it (read: weird analogy with a car).
The basic multithreaded approach is easier to design and implement than a single threaded NIO. Performance gain isn't noticeable in a small multiplayer game server/client, especially if a message is only sent every few seconds.

Brian Agnew said:
This all works well when the server-side processing
for each client is negligible. However a multi-threaded
approach will scale much better.
I beg to disagree. A one-client-one-thread approach will exhaust memory much faster than if you handle multiple clients per thread as you won't need a full stack per client. See the C10K paper for more on the topic: http://www.kegel.com/c10k.html
Anyway, if there won't be more than 20 clients, just use whatever is easiest to code and debug.

Yes you can. See this example for an illustration on how to do this.
The important section is this:
for (;;) { // Loop forever, processing client connections
// Wait for a client to connect
SocketChannel client = server.accept();
// Build response string, wrap, and encode to bytes (elided)
client.write(response);
client.close();
}
This all works well when the server-side processing for each client is negligible. However a multi-threaded approach will scale much better.

Java, Massive message processing with queue manager (trading)

I would like to design a simple application (without j2ee and jms) that can process massive amount of messages (like in trading systems)
I have created a service that can receive messages and place them in a queue to so that the system won't stuck when overloaded.
Then I created a service (QueueService) that wraps the queue and has a pop method that pops out a message from the queue and if there is no messages returns null, this method is marked as "synchronized" for the next step.
I have created a class that knows how process the message (MessageHandler) and another class that can "listen" for messages in a new thread (MessageListener). The thread has a "while(true)" and all the time tries to pop a message.
If a message was returned, the thread calls the MessageHandler class and when it's done, he will ask for another message.
Now, I have configured the application to open 10 MessageListener to allow multi message processing.
I have now 10 threads that all time are in a loop.
Is that a good design??
Can anyone reference me to some books or sites how to handle such scenario??
Thanks,
Ronny

Seems from your description that you are on the right path, with one little exception. You implemented a busy wait on the retrieval of messages from the Queue.
A better way is to block your threads in the synchronised popMessage() method, doing a wait() on the queue resource when no more messages can be pop-ed. When adding (a) message(s) to the queue, the waiting threads are woken up via a notifyAll(), one or more threads will get a message and the rest re-enter the wait() state.
This way the distribution of CPU resources will be smoother.

I understand that queuing providers like Websphere and Sonic cost money, but there's always JBoss Messaging, FUSE with ApacheMQ, and others. Don't try and make a better JMS than JMS. Most JMS providers have persistence capabilities that for provide fault tolerance if the Queue or App server dies. Don't reinvent the wheel.

Reading between the lines a little it sounds like your not using a JMS provider such as MQ. Your solution sounds in the most parts to be ok however I would question your reasons for not using JMS.
You mention something about trading, I can confirm a lot of trading systems use JMS with and without j2ee. If you really want high performance, reliability and piece of mind don't reinvent the wheel by writing your own queuing system take a look at some of the JMS providers and their client API's.
karl

Event loop
How about using a event loop/message pump instead? I actually learned this technique from watching the excellent node.js video presentation from Ryan which I think you should really watch if not already.
You push at most 10 messages from Thread a, to Thread b(blocking if full). Thread a has an unbounded [LinkedBlockingQueue][3](). Thread b has a bounded [ArrayBlocking][4] of size 10 (new ArrayBlockingQueue(10)). Both thread a and thread b have an endless "while loop". Thread b will process messages available from the ArrayBlockingQueue. This way you will only have 2 endless "while loops". As a side note it might even be better to use 2 arrayBlockingQueues when reading the specification because of the following sentence:
Linked queues typically have higher
throughput than array-based queues but
less predictable performance in most
concurrent applications.
Off course the array backed queue has a disadvantage that it will use more memory because you will have to set the size prior(too small is bad, as it will block when full, too big could also be a problem if low on memory) use.
Accepted solution:
In my opinion you should prefer my solution above the accepted solution. The reason is that if it all posible you should only use the java.util.concurrent package. Writing proper threaded code is hard. When you make a mistake you will end up with deadlocks, starvations, etc.
Redis:
Like others already mentioned you should use a JMS for this. My suggestion is something along the line of this, but in my opinion simpler to use/install. First of all I assume your server is running Linux. I would advise you to install Redis. Redis is really awesome/fast and you should also use it as your datastore. It has blocking list operations which you can use. Redis will store your results to disc, but in a very efficient manner.
Good luck!

While it is now showing it's age, Practical .NET for Financial Markets demonstrates some of the universal concepts you should consider when developing a financial trading system. Athough it is geared toward .Net, you should be able to translate the general concepts to Java.

The separation of listening for the message and it's processing seems sensible to me. Having a scalable number of processing threads also is good, you can tune the number as you find out how much parallel processing works on your platform.
The bit I'm less happy about is the way that the threads poll for message arrival - here you're doing busy work, and if you add sleeps to reduce that then you don't react immediately to message arrival. The JMS APIs and MDBs take a more event driven approach. I would take a look at how that's implemented in an open source JMS so that you can see alternatives. [I also endorse the opinion that re-inventing JMS for yourself is probably a bad idea.] The thing to bear in mind is that as your systems get more complex, you add more queues and more processing busy work has greater impact.
The other concern taht I have is that you will hit limitiations of using a single machine, first you may allow greater scalability my allowing listeners to be on many machines. Second, you have a single point of failure. Clearly solving this sort of stuff is where the Messaging vendors make their money. This is another reason why Buy rather than Build tends to be a win for complex middleware.

You need very light, super fast, scalable queuing system. Try Hazelcast distributed queue!
It is a distributed implementation of java.util.concurrent.BlockingQueue. Check out the documentation for detail.
Hazelcast is actually a little more than a distributed queue; it is transactional, distributed implementation of queue, topic, map, multimap, lock, executor service for Java.
It is released under Apache license.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.