JRedisFuture stability - java

I'm using the synchronous implementation of JRedis, but I'm planning to switch to the asynchronous way to communicate with the redis server.
But before that I would like to ask the community whether the JRedisFuture implementation of alphazero's jredis is stable enough for production use or not?
Is there anybody out there who is using it or have experience with it?
Thanks!

When JRedis gets support for transaction semantics (Redis 1.3.n, JRedis master branch) then certainly, it should be "stable" enough.
Redis protocol for non-transactional commands, themselves atomic, allows a window of unrecoverable failure when a destructive command has been sent, and on the read phase the connection faults. The client has NO WAY of knowing if Redis in fact processed the last request but the response got dropped due to network failure (for example). Even the basic request/reply client is susceptible to this (and I think this is not limited to Java, per se.)
Since Redis's protocol does not require any metadata (at all) with the DML and DDL type commands (e.g. no command sequent number) this window of failure is opened.
With pipelining, there is no longer a sequential association between the command that is being written and the response that is being read. (The pipe is sending a command that is N commands behind the one that caused Redis to issue the response being read at the same time. If anything goes kaput, there are a LOT of dishes in air :)
That said, every single future object in the pipe will be flagged as faulted and you will know precisely at which response the fault occurred.
Does that qualify as "unstable"? In my opinion, no. That is an issue with pipelining.
Again, Redis 1.3.n with transaction semantics completely addresses this issue.
Outside of that issue, with asynchronous (pipelines), there is a great deal of responsibility on your part for making sure you do not excessively overload the input to the connector. To a huge extent JRedis pipelines protect you from this (since the caller's thread is used to make the network write thus naturally damping the input load on the pending response queue).
But you still need to run tests -- you did say "Production", right? )) -- and size your boxes and put a cap on the number of loading threads on the front end.
I would also potentially recommend not running more than one JRedis pipeline on multi-core machines. In the existing implementation (which does not chunk the write buffer) there is room for efficiencies (in context of full bandwidth utilization and maximizing throughput) to be gained by running multiple pipelines to the same server. While one pipeline is busy creating buffers to write, the other is writing, etc. But, these two pipelines will interfere with one another due to their (inevitable -- remember they are queues and some form of synchronization must occur) and periodic cache invalidation (on each dequeue/enqueue in worst case -- but in Doug Lea we trust.) So if pipeline A average latency hit d1 (in isolation), then so does pipe B. Regrettably, running two of them on the same cores will result in a new system wide cache invalidation period that is HALF of the original system so TWICE as more cache invalidations occur (on average). So it is self defeating. But test your load conditions, and on your projected production deployment platform.

Related

Message latencies with CPU under-utilization

We've got a Java app where we basically use three dispatcher pools to handle processing tasks:
Convert incoming messages (from RabbitMQ queues) into another format
Serialize messages
Push serialized messages to another RabbitMQ server
The thing, where we don't know how to start fixing it, is, that we have latencies at the first one. In other words, when we measure the time between "tell" and the start of doing the conversion in an actor, there is (not always, but too often) a delay of up to 500ms. Especially strange is that the CPUs are heavily under-utilized (10-15%) and the mailboxes are pretty much empty all of the time, no huge amount of messages waiting to be processed. Our understanding was that Akka typically would utilize CPUs much better than that?
The conversion is non-blocking and does not require I/O. There are approx. 200 actors running on that dispatcher, which is configured with throughput 2 and has 8 threads.
The system itself has 16 CPUs with around 400+ threads running, most of the passive, of course.
Interestingly enough, the other steps do not see such delays, but that can probably explained by the fact that the first step already "spreads" the messages so that the other steps/pools can easily digest them.
Does anyone have an idea what could cause such latencies and CPU under-utilization and how you normally go improving things there?

Scalability of Redis Cluster using Jedis 2.8.0 to benchmark throughput

I have an instance of JedisCluster shared between N threads that perform set operations.
When I run with 64 threads, the throughput of set operations is only slightly increased (compared to running using 8 threads).
How to configure the JedisCluster instance using the GenericObjectPoolConfig so that I can maximize throughput as I increase the thread count?
I have tried
GenericObjectPoolConfig poolConfig = new GenericObjectPoolConfig();
poolConfig.setMaxTotal(64);
jedisCluster = new JedisCluster(jedisClusterNodes, poolConfig);
believing this could increase the number of jedisCluster connection to the cluster and so boost throughput.
However, I observed a minimal effect.
When talking about performance, we need to dig into details a bit before I can actually answer your question.
A naive approach suggests: The more Threads (concurrency), the higher the throughput.
My statement is not wrong, but it is also not true. Concurrency and the resulting performance are not (always) linear because there is so many involved behind the scenes. Turning something from sequential to concurrent processing might result in something that executes twice of the work compared to sequential execution. This example assumes that you run a multi-core machine, that is not occupied by anything else and it has enough bandwidth for the required work processing (I/O, Network, Memory). If you scale this example from two threads to eight, but your machine has only four physical cores, weird things might happen.
First of all, the processor needs to schedule two threads so each of the threads probably behaves as if they would run sequentially, except that the process, the OS, and the processor have increased overhead caused by twice as many threads as cores. Orchestrating these guys comes at a cost that needs to be paid at least in memory allocation and CPU time. If the workload requires heavy I/O, then the work processing might be limited by your I/O bandwidth and running things concurrently may increase throughput as the CPU is mostly waiting until the I/O comes back with the data to process. In that scenario, 4 threads might be blocked by I/O while the other 4 threads are doing some work. Similar applies to memory and other resources utilized by your application. Actually, there's much more that digs into context switching, branch prediction, L1/L2/L3 caching, locking and much more that is enough to write a 500-page book. Let's stay at a basic level.
Resource sharing and certain limitations lead to different scalability profiles. Some are linear until a certain concurrency level, some hit a roof and adding more concurrency results in the same throughput, some have a knee when adding concurrency makes it even slower because of $reasons.
Now, we can analyze how Redis, Redis Cluster, and concurrency are related.
Redis is a network service which requires network I/O. Networking might be obvious, but we require to add this fact to our considerations meaning a Redis server shares its network connection with other things running on the same host and things that use the switches, routers, hubs, etc. Same applies to the client, even in the case you told everybody else not to run anything while you're testing.
The next thing is, Redis uses a single-threaded processing model for user tasks (Don't want to dig into Background I/O, lazy-free memory freeing and asynchronous replication). So you could assume that Redis uses one CPU core for its work but, in fact, it can use more than that. If multiple clients send commands at a time, Redis processes commands sequentially, in the order of arrival (except for blocking operations, but let's leave this out for this post). If you run N Redis instances on one machine where N is also the number of CPU cores, you can easily run again into a sharing scenario - That is something you might want to avoid.
You have one or many clients that talk to your Redis server(s). Depending on the number of clients involved in your test, this has an effect. Running 64 threads on a 8 core machine might be not the best idea since only 8 cores can execute work at a time (let's leave hyper-threading and all that out of here, don't want to confuse you too much). Requesting more than 8 threads causes time-sharing effects. Running a bit more threads than CPU cores for Redis and other networked services isn't a too bad of an idea since there is always some overhead/lag coming from the I/O (network). You need to send packets from Java (through the JVM, the OS, the network adapter, routers) to Redis (routers, network, yadda yadda yadda), Redis has to process the commands and send the response back. This usually takes some time.
The client itself (assuming concurrency on one JVM) locks certain resources for synchronization. Especially requesting new connections with using existing/creating new connections is a scenario for locking. You already found a link to the Pool config. While one thread locks a resource, no other thread can access the resource.
Knowing the basics, we can dig into how to measure throughput using jedis and Redis Cluster:
Congestion on Redis Cluster can be an issue. If all client threads are talking to the same cluster node, then other cluster nodes are idle, and you effectively measured how one node behaves but not the cluster: Solution: Create an even workload (Level: Hard!)
Congestion on the Client: Running 64 threads on a 8 core machine (that is just my assumption here, so please don't beat me up if I'm wrong) is not the best idea. Raising the number of threads on a client a bit above the number of Cluster nodes (assuming even workload for each cluster node) and a bit over the number of CPU cores can improve performance is never a too bad idea. Having 8x threads (compared to the number of CPU cores) is an overkill because it adds scheduling overhead at all levels. In general, performance engineering is related to finding the best ratio between work, overhead, bandwidth limitations and concurrency. So finding the best number of threads is an own field in computer science.
If running a test using multiple systems, that run a number of total threads, is something that might be closer to a production environment than running a test from one system. Distributed performance testing is a master class (Level: Very hard!) The trick here is to monitor all resources that are used by your test making sure nothing is overloaded or finding the tipping point where you identify the limit of a particular resource. Monitoring the client and the server are just the easy parts.
Since I do not know your setup (number of Redis Cluster nodes, distribution of Cluster nodes amongst different servers, load on the Redis servers, the client, and the network during test caused by other things than your test), it is impossible to say what's the cause.

Exposing whether an application is undergoing GC via UDP

The motivation behind this question is to see whether we can make a theoretical load balancer more efficient for edge-cases by first applying its regular strategy of nominating a particular node to route an HTTP request to (say, via a round robin strategy) and then "peeking" into the internal state of the system to see whether it is undergoing garbage collection. If so, the load balancer avoids the node altogether and moves onto the next one.
In the ideal scenario, each node would "emit" its internal state every few seconds via UDP to some message queue letting the load balancer know which nodes are potentially "radio-active" if they're going through GC (I'm visualizing it as a simple boolean).
The question here is: can I tweak my application to tap into its JVM's internal state and (a) figure out whether we're in GC mode right this instant (b) emit this result via some protocol (UDP/HTTP) over the wire to some other entity (like an MQ or something).
There are a whole bunch of ways to monitor and report on a VM remotely. A well-known protocol, for example, is SNMP. But this is a very complicated subject.
Implementation sort of depends on your requirements. If you need to be really sure a VM is in a good state, you might need to wrap your application in a wrapper VM that controls the actual VM. This is pretty involved.
Many implementations use the built-in monitoring and profiling interfaces that are exposed as beans to participating applications via JMX. Again, this requires a fair amount of tweaking.
I suppose you could create a worker thread that simply acts as a canary. It broadcasts a ping every X seconds, and if the pinged service misses two or three pings, it assumes the VM is not ready to serve.
The problem is deciding what to do when a VM never seems to come back. Is it the VM, the network, or something else? How do you keep track of the VMs? These are not intractable problems, but they combine in interesting ways to make your life equally interesting.
There are a lot of ways to approach this problem, and each has subtle implications.
Can you do it? Yes.
The GarbageCollectorMXBean can provide notifications of GC events to application code. (For instance, see this article which includes example code for configuring a notification listener and processing the events.)
Given this, you could easily code your application so that key GC events were sent out as UDP messages, and/or regular UDP messages were sent to report the current GC state.
However, if the GC performs a "stop the world" collection, then your code to send out messages will also be stopped, and there is no way around that1. If this is a problem then you probably need to take the "canary" approach ... or switch to a low-pause collector. The "canary" or "heart-beat" approaches also detects other kinds of unavailability, which will be relevant to a load balancer. However, the flip-side is that you can also get false positives; e.g. the "heart" is still "beating" but the "patient" is "comatose".
Whether this is going to actually useful for load balancing purposes is a different question entirely. There is certainly scope for additional failure modes. For instance, if the load balancer misses a UDP message saying that a JVM GC has finished, then the JVM could effectively drop out of the load balancer's pool.
1 - At least, not within Java. You could conceivably build something on the outside of the JVM that (for example) reads the GC log file and checks OS-level process usage information.
You can write an external application that instruments the JVM, e.g. via dtrace probes and sends the events to the load balancer or is queriable by the load balancer.

sun.rmi.transport.tcp.TCPTransport uses 100% CPU

I'm developping a communication library based on NIO's non-blocking SocketChannels so I can use select to keep my thread low on CPU usage (and getting faster reaction time to other events).
SocketChannel are created externally to my thread and added to the list it handles, marking them as non-blocking and adding them to a Selector for READ operations (and WRITE when needed, but that does not happen in my problem).
I have a little Swing application for tests, running locally, that can be either a client or server: the client one connects to the server one and they can send each other messages. Pretty simple and works fine, excepts for the CPU which tops 100% (50% for each jvm) as soon as the connection is established between client and server.
Running jvisualvm shows me that sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run() uses 98% of the application time, counting only 3 method calls!
A forced stack trace shows it's blocking on the read operation on a FilteredInputStream, on a Socket.
I'm a little puzzled as I don't use RMI (though I can understand NIO and RMI can share the "transport" code part). I have seen a few similar questions but each were specifically using RMI, which I'm not. The answers I've seen is that this ConnectionHandler.run() method is responsible for marshalling/unmarshalling things, when here I get 100% CPU without any network traffic. I can only infer an active wait on the sockets but that sounds odd, especially with non-blocking SocketChannel...
Any idea would be greatly appreciated!
I tracked CPU use down to select(int timeout) which returns 0 immediately, regardless of the timeout value. My understanding of this function was it would block until a selected operation pops up, or timeout is reached (as said in the Javadoc).
However, I found out this other StackOverflow post showing the same problem: OP_CONNECT operation has to be cancelled once connection is accepted.
Many thanks to #Alexander, and #EJP for clarification about the OP_WRITE/OP_CONNECT similarities.
Regarding tge RMI part, it was probably due to Eclipse run configurations.

Java RMI random latency in a HPC application

I'm using Java and RMI in order to execute 100k Montecarlo Simulations on a cluster of hundreds of cores.
The approach I'm using is to have a client app that invokes RMI processes and divides simulations on the number of available (RMI) processes on the grid.
Once that the simulations have been run I have to reaggregate results.
The only limit I have is that all this has to happen in less than 500ms.
The process is actually in place BUT randomly, from time to time, one of the RMI call takes 200ms more to execute.
I've added loads of logs and timings all over the place and as possible reason I've already discarded:
1) Simulations taking extra time
2) Data transfer (it constantly works, only sometimes the slowdown is verified, and only on a subset of RMI calls)
3) Transferring results back (I can clearly timing how long from last RMI calls return to the end of the process)
The only thing I cannot measure is IF any of the RMI Call is taking extra time to be initialized (and honestly is the only thing I can suppose). The reason of this is that -unfortunately- clocks are not synchronized :(
Is that possible that the RMI remote process got passivated/detached/collected even if I keep a (Remote) reference to it from the client?
Hope the question is clear enough (I'm pretty much sure it isn't).
Thanks a mil and do not hesitate to make more questions if it is not clear enough.
Regards,
Giovanni
Is that possible that the RMI remote process got passivated/detached/collected even if I keep a (Remote) reference to it from the client?
Unlikely, but possible. The RMI remote process should not be collected (as the RMI FAQ indicates for VM exit conditions). It could, however, be paged to disk if the OS desired.
Do you have a way to rule out GC calls (other than writing a monitor with JVM TI)?
Also, is your code structured in such a way that you send off all calls from your aggregator asynchronously, have the replies append to a list, and aggregate the results when your critical time is up, even if some processors haven't returned results? I'm assuming that each processor is an independent, random event and that it's safe to ignore some results. If not, disregard.
I finally came up with issue. Basically after insuring that the stub wasn't getting deallocated and that the GC wasn't triggered behind the scene, I used wireshark for understanding if there was any network issue.
What I found out it was that randomly one of the packet got lost and TCP needed on our network 120ms (41 retransmission) for correctly re-transfer data.
When switching to jdk7, SDP and infiniband, we didn't experience the issue anymore.
So basically the answer to my question was... PACKET LOST!
Thanks who replied to the post it helped to focus on the right path!
Gio

Categories