Netty based application performance issues - java

I have a Producer Consumer based application based on Netty. The basic requirement was to build a message oriented middleware (MOM)
MOM
So the MOM is based on the concept of queuing (Queuing makes systems loosely coupled and that was the basic requirement of the application).
The broker understands the MQTT protocol. We performed stress testing of the application on our local machine. These are the specs of the local machine.
We were getting great results. However, our production server is AWS Ubuntu based. So we stress tested the same application on AWS Ubuntu server. The performance was 10X poor than the local system. This is the configuration of the AWS server.
We have tried the following options to figure out where the issue is.
Initially we checked for bugs in our business logic. Did not find any.
Made the broker, client and all other dependencies same on mac as well as aws. What I mean by same dependencies is that we installed the same versions on aws as on mac.
Increased the ulimit on AWS.
Played with sysctl settings.
We were using Netty 4.1 and we had a doubt that it might be a Netty error as we do not have stable release for Netty 4.1 yet. So we even built the entire application using Netty 3.9.8 Final (Stable) and we still faced the same issue.
Increased the hardware configurations substantially of the AWS machine.
Now we have literally run out of options. The java version is the same on both machines.
So the last resort for us is to build the entire application using NodeJS but that would require a lot of effort rather than tweaking something in Netty itself. We are not searching for Java based alternatives to Netty as we think this might even be a bug in JVM NIO's native implementation on Mac and Ubuntu.
What possible options can we try further to solve this bug. Is this a Netty inherent issue. Or is this something to do with some internal implementations on Mac and Ubuntu which are different and are leading to perfomance differences as we see them ?
EDIT
The stress testing parameters are as follows.
We had 1000 clients sending 1000 messages per second (Global rate).
We ran the test for about 10 minutes to note the latency.
On the server side we have 10 consumer threads handling the messages.
We have a new instance of ChannelHandler per client.
For boss pool and worker pool required by Netty, we used the Cached Thread pool.
We have tried tuning the consumer threads but to no avail.
Edit 2
These are the profiler results provided by jvmtop for one phase of load testing.

Related

Forcefully kill JMS connection threads

First time posting so hopefully I can give enough info.
We are currently using Apache ActiveMQ to setup a pretty standard Producer/Consumer queue app. Our application is hosted on various different client servers and at random times/loads we experience issues where the JMS connection permanently dies, so our producer can no longer connect to the consumer, so we have to restart the producer. We're pretty sure on the issue, that we're running our of connections on the JMS cached connection factory, so need to do a better cleanup/recycling of these connections. This is a relatively common issue described here (Our setup is pretty similar):
Is maven shade-plugin culprit for my jar to not work in server
ActiveMQ Dead Connection issue
Our difficulty is that this problem is only experienced on our application when it's deployed on servers, however we don't have access to these, as they house confidential client info, so we can't do any debugging/reproduction on the servers where the issues occur, however it's not possible so far for us to reproduce the issue on our local environment.
So in short is there any way that we could forcefully kill/corrupt our JMS connection threads so that we can reproduce and test various fixes and approaches? Unfortunately we don't have the option to add in fixes without testing/demo'ing any solutions, so replication of the issue on our local setup is our only option.
Thanks in advance

Weird way servers are getting added in baseline topology

I have two development machines, both running Ignite in server mode on same network. Actually I started the server in the first machine and then started another machine. When the other machine starts, it is getting automatically added to the first one's topology.
Note:
when starting I've removed the work folder in both machines.
In config, I never mentioned any Ips of other machines.
Can anyone tell me what's wrong with this? My intention is each machine should've separate topology.
As described in discovery documentation, Apache Ignite will employ multicast to find all nodes in a local network, forming a cluster. This is default mode of operation.
Please note that we don't really recommend using this mode for either development or production deployment, use static discovery instead (see the same documentation).

Tomcat 8 - POST and PUT requests slow when deployed on RHEL

I have developed a REST API using Spring Framework. When I deploy this in Tomcat 8 on RHEL, the response times for POST and PUT requests are very high when compared to deployment on my local machine (Windows 8.1). On RHEL server it takes 7-9 seconds whereas on local machine it is less than 200 milliseconds.
RAM and CPU of RHEL server are 4 times that of local machine. Default tomcat configurations are used in both Windows and RHEL. Network latency is ruled out because GET requests take more or less same time as local machine whereas time taken to first byte is more for POST and PUT requests.
I even tried profiling the remote JVM using Visual JVM. There are no major hotspots in my custom code.
I was able to reproduce this same issue in other RHEL servers. Is there any tomcat setting which could help in fixing this performance issue ?
The profiling log you have placed means nothing, more or less. It shows the following:
The blocking queue is blocking. Which is normal, because this is its purpose - to block. This mean there is nothing to take from it.
It is waiting for connection on the socket. Which is also normal.
You do not specify what is your RHEL 8 physical/hardware setup. The operating system here might not be the only thing. You can not eliminate still network latency. What about if you have SAN, the SAN may have latency itself. If you are using SSD drive and the RHEL is using SAN with replication you may experience network latecy there.
I am more inclined to first check the IO on the disk than to focus on operating system. If the server is shared there might be other processes occupying the disk.
You are saying that the latency is ruled out because the GET requests are taking the same time. This is not enough to overrule it as I said this is the latency between the client and the application server, it does not check the latency between your app server machin and your SAN or disk or whatever storage is there.

Running GAE Development Server of Google Compute Engine Instance <phew>

I'm trying to run the local dev server (java) for Google AppEngine on a Google compute instance. (we're using compute engine instances as test servers).
When trying to start the dev server using appcfg.sh we notice that 90% of the time, the server doesn't get started and hangs for 10minutes before finnaly starting.
I know that the server hasn't started because this line is never printed to the console when it hangs:
Server default is running at http://localhost:8080/
Has anyone seen anything like this?
In a nutshell:
-The App Engine java SDK uses jetty as the servlet container for the development appserver
-Jetty relies on java.security.SecureRandom
-SecureRandom consumes entropy from /dev/random by default
-/dev/random will block when insufficient entropy is available for a read
The GCE instance, when lightly used (for example, solely as a test appengine server), does not generate entropy quickly. Thus, repeated startups of the java appengine server consume entropy from /dev/random more rapidly than it is replenished, causing the blocking behavior on startup that you observed as the hangs on startup.
You can confirm that the hang is due to the SecureRandom issue by increasing the logging levels of the dev appserver. You should see a message similar to "init SecureRandom" and then the blocking behavior.
Some possible ways to address this:
1) Adding the following to the dev_appserver.sh invocation will cause SecureRandom to consume the /dev/urandom entropy source rather than /dev/random:
--jvm_flag="-Djava.security.egd=file:/dev/./urandom"
2) Having a GCE instance that's more heavily utilized should cause entropy data to be collected more rapidly, which will in turn make /dev/random less susceptible to blocking on subsequent restarts of the development appserver.

unreliable behaviour of Openfire server at EC2

We are using openfire server 3.7.1 on Amazon Ec2 linux instance for a chat Application.
Currently, we are in initial development stage, where we are testing it with 4 or 5 concurrent users.
Now, and then we are getting issues with openfire server:
1) Java heap space exceptions.
2) java.net.BindException: Address already in use
3) they both lead to 5222 port not listening, while openfire admin console at 9090 is working fine
Eventually when i stop all openfire processes and then restart it, again it goes to normal.
I want to know, whether this is a bug in openfire version 3.7.1 or EC2 have some issues with opening of port 5222. I am really apprehensive about performance of Openfire server when 1000s user will be using it concurrently?
Solved by:
Disabling PEP.
Increasing Openfire JVM parametres
The Java heap space exception is common to Openfire, you can check your JVM arguments and increase the parameters. In my experience there were a couple of cases that caused those:
clients using Empathy.
some plugin that provided buddy lists/ white/black lists etc (had to do something with the user's roster lists).
You need to make sure port 5222 and 5223 are opened (some clients may use the old SSL port) in EC2 Firewall settings.
If you plan to have thousands of users, I suggest you get static IP address (you don't mention what's your current config). Also checkout jabberd - proved to be more reliable than openfire.
1000s of concurrent users should not be a problem for Openfire at all. It has seen 250K in testing. It will always be determinant though on what the users are doing.
There is a known memory leak in Openfire that has been fixed but not yet released. It is related to PEP, which can be shut off to circumvent this issue if that is feasible for you.

Categories