I have a production system that uses ActiveMQ (5.3.2) to send messages from server A to server B. A few weeks ago, the system inexplicably started taking 10+ second to send a message. After a reboot of the producer, the system worked fine.
After investigation, I'm pretty sure this is due to producer flow control. (I have a fairly standard activemq setup). The day before this happened (for other reasons) my consumer software had been acting erratically and had even stopped accepting connections for a while. So I'm guessing this triggered this. (It does puzzle me that the requests were still being throttled a day later).
Question -- how can I confirm that the requests were being throttled. I took a heap dump of the server -- is there data in memory I can look for?
Edit: I've found the following:
WireFormatNegotiator.tcpNoDelayEnabled=false for one of three WireFormatNegotiator instances in the memory. I'm trying to figure out what sets this.
And second (and more important), is there a way I can use JMX to tell if the messages are being throttled? I'd like to set up a Nagios alert to let me know if this happens in the future. What property should I check for with JMX?
you can configure your producer client to throw javax.jms.ResourceAllocationException exceptions which can then be detected/logged, etc. just set one of the following...
<systemUsage>
<systemUsage sendFailIfNoSpaceAfterTimeout="3000">
...OR...
<systemUsage sendFailIfNoSpace="true">
Related
I am using a managed RabbitMQ cluster through AWS Amazon-MQ. If the consumers finish their work quickly then everything is working fine. However, depending on few scenarios few consumers are taking more than 30 mins to complete the processing.
In that scenarios, RabbitMQ deletes the consumer and makes the same messages visible again in the queue. Becasue of this another consumer picks it up and starts processing. It is happing in the loop. Therefore the same transaction is getting executed again and I am loosing the consumer as well.
I am not using any AcknowledgeMode so I believe it's AUTO by default and it has 30 mins limit.
Is there any way to increase the Delivery Acknowledgement Timeout for AUTO mode?
Or please let me know if anyone has any other solutions for this.
Reply From AWS Support:
Consumer timeout is now configurable but can be done only by the service team. The change will be permanent irrespective of any version.
So you may update RabbitMQ to latest, and no need to stick with 3.8.11. Provide your broker details and desired timeout, they should be able to do it for you.
This is the response from AWS support.
From my understanding, I see that your workload is currently affected by the consumer_timeout parameter that was introduced in v3.8.15.
We have had a number of reach outs due to this, unfortunately, the service team has confirmed that while they can manually edit the rabbitmq.conf, this will be overwritten on the next reboot or failover and thus is not a recommended solution. This will also mean that all security patching on the brokers where a manual change is applied, will have to be paused. Currently, the service does not support custom user configurations for RabbitMQ from this configuration file, but have confirmed they are looking to address this in future, however, is not able to an ETA on when this will available.
From the RabbitMQ github, it seems this was added for quorum queues in v3.8.15 (https://github.com/rabbitmq/rabbitmq-server/releases/tag/v3.8.15 ), but seems to apply to all consumers (https://github.com/rabbitmq/rabbitmq-server/pull/2990 ).
Unfortunately, RabbitMQ itself does not support downgrades (https://www.rabbitmq.com/upgrade.html )
Thus the recommended workaround and safest action form the service team, as of now is to create a new broker on an older version (3.8.11) and set auto minor version upgrade to false, so that it wont be upgraded.
Then export the configuration from the existing RabbitMQ instance and import it into new instance and use this instance going forward.
UPDATE:
My goal is to learn what factors could overwhelm my little tomcat server. And when some exception happens, what I could do to resolve or remediate it without switching my server to a better machine. This is not a real app in a production environment but just my own experiment (Besides some changes on the server-side, I may also do something on my client-side)
Both of my client and server are very simple: the server only checks the URL format and send 201 code if it is correct. Each request sent from my client only includes an easy JSON body. There is no database involved. The two machines (t2-micro) only run client and server respectively.
My client is OkHttpClient(). To avoid timeout exceptions, I already set timeout 1,000,000 milli secs via setConnectTimeout, setReadTimeout, and setWriteTimeout. I also go to $CATALINA/conf/server.xml on my server and set connectionTimeout = "-1"(infinite)
ORIGINAL POST:
I'm trying to stress out my server by having a client launching 3000+ threads sending HTTP requests to my server. Both of my client and server reside on different ec2 instances.
Initially, I encountered some timeout issues, but after I set the connection, read and write timeout to a bigger value, this exception has been resolved. However, with the same specification, I'm getting java.net.ConnectException: Failed to connect to my_host_ip:8080 exception. And I do not know its root cause. I'm new to multithreading and distributed system, can anyone please give me some insights of this exception?
Below is some screenshot of from my ec2:
1. Client:
2. Server:
Having gone through similar exercise in past I can say that there is no definitive answer to the problem of scaling.
Here are some general trouble shooting steps that may lead to more specific information. I would suggest trying out tests by tweaking a few parameters in each test and measure the changes in Cpu, logs etc.
Please provide what value you have put for the timeout. Increasing timeout could cause your server (or client) to run out of threads quickly (cause each thread can process for longer). Question the need for increasing timeout. Is there any processing that slows your server?
Check application logs, JVM usage, memory usage on the client and Server. There will be some hints there.
Your client seems to be hitting 99%+ and then come down. This implies that there could be a problem at the client side in that it maxes out during the test. Your might want to resize your client to be able to do more.
Look at open file handles. The number should be sufficiently high.
Tomcat has some limit on thread count to handle load. You can check this in server.xml and if required change it to handle more. Although cpu doesn't actually max out on server side so unlikely that this is the problem.
If you a database then check the performance of the database. Also check jdbc connect settings. There is thread and timeout config at jdbc level as well.
Is response compression set up on the Tomcat? It will give much better throughout on server especially if the data being sent back by each request is more than a few kbs.
--------Update----------
Based on update on question few more thoughts.
Since the application is fairly simple, the path in terms of stressing the server should be to start low and increase load in increments whilst monitoring various things (cpu, memory, JVM usage, file handle count, network i/o).
The increments of load should be spread over several runs.
Start with something as low as 100 parallel threads.
Record as much information as you can after each run and if the server holds up well, increase load.
Suggested increments 100, 200, 500, 1000, 1500, 2000, 2500, 3000.
At some level you will see that the server can no longer take it. That would be your breaking point.
As you increase load and monitor you will likely discover patterns that suggest tuning of specific parameters. Each tuning attempt should then be tested again the same level of multi threading. The improvement of available will be obvious from the monitoring.
I've been browsing the forums for last few days and tried almost everything i could find, but without any luck.
The situation is: inside our Java Web Application we have ActiveMQ 5.7 (I know it's very old, eventually we will upgrade to newer version - but for some reasons it's not possible right now). We have only one broker and multiple consumers.
When I start the servers (I have tried to do so for 2, 3, 4 and more servers) everything is ok. The servers are comunicating with each other, QUEUE messages are consumed instantly. But when I leave the servers idle (for example to finally catch some sleep ;) ) it is no longer the case. Messages are stuck in the database and are not beign consumed. The only option to have them delivered is to restart the server.
Part of my configuration (we keep it in properties file, it's the actual state, however I have tried many different combinations):
BrokerServiceURI=broker:(tcp://0.0.0.0:{0})/{1}?persistent=true&useJmx=false&populateJMSXUserID=false&useShutdownHook=false&deleteAllMessagesOnStartup=false&enableStatistics=true
ConnectionFactoryURI=failover://({0})?initialReconnectDelay=100&timeout=6000
ConnectionFactoryServerURI=tcp://{0}:{1}?keepAlive=true&soTimeout=100&wireFormat.cacheEnabled=false&wireFormat.tightEncodingEnabled=false&wireFormat.maxInactivityDuration=0
BrokerService.startAsync=true
BrokerService.networkConnectorStartAsync=true
BrokerService.keepDurableSubsActive=false
Do you have a clue?
I cannot actually tell you the reason from the description mentioned above but I can list down a few checks that are fresh in my mind. Please confirm the following if they are valid for you or not.
Can you check the consumer connections?
Are the consumer sessions still active?
If all the consumer-connections are up, then check the thread-dump whether the active consumer threads (I'm assuming you created consumer threads, correct me if I'm wrong) are in RUNNING or WAITING state(this happened with me where all the consumers were active but some other thread was keeping a lock on Logger while posting a message to slack and the consumers were in WAITING state) because of some other thread in the server).
Check the Dispatch queue size for each consumer. Check the prefetch of each consumer and then compare Dispatch Queue size with Prefetch, refer
Is there a JMSXGroupID you are allotting to each message?
Can you tell a little more about your consumer/producer/broker configurations?
I am trying to run down a problem with consumer connections to RabbitMQ being dropped in our production environment. This problem seems to occur after running for a few days and by restarting our application it seems to connect and work fine for a few more days. My guess is that there is a period of inactivity that is causing the issue. It seems the AMQP heartbeat was designed just for this problem. We are using spring-amqp 1.3.2.RELEASE and setting the requestedHeartbeat on the ConnectionFactory to 10 however we are still seeing connections drop.
The spring-amqp client will reconnect if I completely disconnect from the internet and reconnect, or block the connection with a firewall, however it does not even seem to throw an Exception in the log when this happens in production. Of course that may be because we are using slf4j and logback for our logging mechanism and spring is using commons logging, so it is appearing in System.out and not going to the log. I have added the jcf-over-slf4j bridge to fix that but have not rolled it out yet so I do not have a stack trace to contribute.
One more piece of info about our architecture: we have HA proxy in front of RabbitMQ.
I would like to somehow put the app in debug and run within eclipse to see if the heartbeats are actually going out. I tried to verify with Wireshark but our traffic has two way SSL encryption and I haven't been able to decrypt the traffic yet.
Does anyone have any suggestions? I have been trying to run this down for weeks and I'm running out of ideas. I would greatly appreciate your input.
Thanks!
on 11-Feb-2015 days ago rabbit released 3.4.4 which has support for automatic reconnections. You could roll your own solution like we did a while back but it seems easier to just upgrade to the newest version of rabbit.
https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/rabbitmq_v3_4_4/projects/client/RabbitMQ.Client/src/client/impl/AutorecoveringConnection.cs
If you have rabbitmq's autorecovery enabled, Spring AMQP prior to 1.4 is not compatible with it; the problem being that rabbit restores the connections/channels, but Spring AMQP doesn't know about them. Generally, though, this just causes extra connections/channels - Spring AMQP just establishes new channels. I have not heard of it causing the problems you describe.
If you can't figure it out by fixing your logging configuration, another thing to try is to jstack your process to see what the threads are doing. But you should be able to figure it out from the logs.
I'm about to refactor a broker application written for websphere MQ. In the existing application, while reading a message from the queue, the following options are being set:
MQConstants.MQGMO_WAIT and
waitInterval = 1000 (milliseconds).
In our application, there is no guarantee that we receive a message every second. We may not receive a message even for hours. I'm not sure why the creators of this application chose to go for waitInterval = 1000 instead of setting the waitInterval to MQWI_UNLIMITED.
At the moment, there is a catch block in the code which does not do anything when MQException.MQRC_NO_MSG_AVAILABLE occurs.
The creators of this application were really smart people so I do not know why they opted for this approach. I'm new to MQ series, so can anyone please explain the reason behind this?
Well its just to check the queue every second for a message. You can be more intelligent by using features like async message delivery in a thread of use some of the new features of MQ that does not do lot of polling on the Queue.