Delays in java.rmi.Naming.bind() - java

A driver from one of our vendors suddenly started failing last week, on a machine where it has worked for almost a year. The last two descriptive messages output by the driver are now three minutes apart (whereas before they were seconds). Reverse-engineering the code, I've found that the three minute delays happen during calls to java.rmi.Naming.bind().
What's my starting point for troubleshooting this and getting more information out of the call to bind()? It does not return a success or failure indication, and it does not appear to be throwing any Exceptions. I would assume that the consistent 3-minute delay means that some timeout is being hit at some point during the process, but then why is there no indication of failure?

This is almost certainly a DNS problem. Check the DNS lookup and reverse lookup times. Naming.bind() itself is trivial, at both ends.

Answering my own question:
java RMI appears to log to java.util.logging. So, configure java.util.logging to see what Java RMI is doing, so you can debug the problem.
(In my case, we use log4j, so I found a library to route java.util.logging messages to log4j.)
With logging enabled, we discovered that RMI was spending three minutes trying to connect to a wrong IP address that it thought was localhost, thanks to a wrong entry somebody had added to our hosts file.

Related

RabbitMQ connections dropping and not recovering despite heartbeat setting

I am trying to run down a problem with consumer connections to RabbitMQ being dropped in our production environment. This problem seems to occur after running for a few days and by restarting our application it seems to connect and work fine for a few more days. My guess is that there is a period of inactivity that is causing the issue. It seems the AMQP heartbeat was designed just for this problem. We are using spring-amqp 1.3.2.RELEASE and setting the requestedHeartbeat on the ConnectionFactory to 10 however we are still seeing connections drop.
The spring-amqp client will reconnect if I completely disconnect from the internet and reconnect, or block the connection with a firewall, however it does not even seem to throw an Exception in the log when this happens in production. Of course that may be because we are using slf4j and logback for our logging mechanism and spring is using commons logging, so it is appearing in System.out and not going to the log. I have added the jcf-over-slf4j bridge to fix that but have not rolled it out yet so I do not have a stack trace to contribute.
One more piece of info about our architecture: we have HA proxy in front of RabbitMQ.
I would like to somehow put the app in debug and run within eclipse to see if the heartbeats are actually going out. I tried to verify with Wireshark but our traffic has two way SSL encryption and I haven't been able to decrypt the traffic yet.
Does anyone have any suggestions? I have been trying to run this down for weeks and I'm running out of ideas. I would greatly appreciate your input.
Thanks!
on 11-Feb-2015 days ago rabbit released 3.4.4 which has support for automatic reconnections. You could roll your own solution like we did a while back but it seems easier to just upgrade to the newest version of rabbit.
https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/rabbitmq_v3_4_4/projects/client/RabbitMQ.Client/src/client/impl/AutorecoveringConnection.cs
If you have rabbitmq's autorecovery enabled, Spring AMQP prior to 1.4 is not compatible with it; the problem being that rabbit restores the connections/channels, but Spring AMQP doesn't know about them. Generally, though, this just causes extra connections/channels - Spring AMQP just establishes new channels. I have not heard of it causing the problems you describe.
If you can't figure it out by fixing your logging configuration, another thing to try is to jstack your process to see what the threads are doing. But you should be able to figure it out from the logs.

Red5 crashes after couple seconds when using RTMPT

we've been having this problem for a long time and still cannot find out where is the problem. Our application uses RTMP for videostreaming and if the webclient cannot connect it skips to RTMPT (RTMP over HTTP). This causes the video to freeze after couple seconds of playback.
I have already found some forums where people seems to be havoing the same issue, but none of the proposed solutions worked. One suggestion was to turn of the video recording, but it didn't work. I have also read, that it seems to be a thread problem in the red5, but before hacking into the RED5 I would like to know, if maybe somebody has a patch or anything which repairs this.
One thing more, we've been testing this on Macs if that should count. Thank you very much in advance.
the very first thing you should look at is really the red5/error log.
Also Red5 occassionally produces output that might be not in the log but just to plain std.out
There is a red5-debug.sh or red5-highpref.sh that does output/log everything to a file called std.out.
You should use those logs to start your analysis. Eventually you will already see something into it. For example exception like:
broken pipe
connection closed due to too long xxx
handshake error
encoding issue in packet xyz
unexpected connection closed
call xyz cannot be handled
too many connections
heap space error
too many open files
Some of them are operating system specific, like for example the number of open files. Some are not.
Also it is very important that you are using the latest revision of Red5 and not an old version. You did not tell us what version you are using.
However, just from symptoms like video freezes *occassional disconnects* or similar you won't be able to start a real analysis of the problem.
Sebastian
Were you connected to the server when the video freezed? Or after that? I am not sure but I think connection closed which caused the stream to freeze.Just check in the access logs of Red5 if there are any logs for 'idle' packets(possibly after a 'send' packet(s) and more than one in number).
Another thing you could have a look at is your web server log files because RTMPT is over HTTP. I once had a problem with my anti DDOS program on the server. RTMPT will make many connections after each other and these TCP connections remain alive for about 4 minutes by default. You can easily get hundreds connections at the same time being seen as a DDOS-attack and as a result the IP-adres of the client will be banned.

java.net.UnknownHostException occurs after some time

I have a project in eclipse to retrieve data from a certain website. As there is too much data to be retrieved I have to keep the code running overnight. I get ajave.net.UnknownHostException after sometime. The code runs without any problem for a long time and only later the UnknownHostexception occurs. Any solution as to why this is happening?
You can only have the mac address of your server where the war is being deployed, Check it here how to get the MAC address
I have seen this error in one of my projects before. Till Java 1.5, JVM used to cache the DNS entry and did not honor the TTL values. If for some reason, the DNS entry was modified (usually the case with Akamai or other CDN networks), and the IP you were going to before is no longer available, you may hit upon this error.
Some info on this behavior is available at http://www.rgagnon.com/javadetails/java-0445.html and http://blog.andrewbeacock.com/2006/12/warning-java-caches-dns-to-ip-address.html.
What you may try is to run a iptrace when it works fine and when it starts failing from the same machine - if the IP has changed, you are hitting this scenario.
My guess is that your internet connect is probably breaking. Do you have any other logs to verify this?

Detecting ActiveMQ flow control

I have a production system that uses ActiveMQ (5.3.2) to send messages from server A to server B. A few weeks ago, the system inexplicably started taking 10+ second to send a message. After a reboot of the producer, the system worked fine.
After investigation, I'm pretty sure this is due to producer flow control. (I have a fairly standard activemq setup). The day before this happened (for other reasons) my consumer software had been acting erratically and had even stopped accepting connections for a while. So I'm guessing this triggered this. (It does puzzle me that the requests were still being throttled a day later).
Question -- how can I confirm that the requests were being throttled. I took a heap dump of the server -- is there data in memory I can look for?
Edit: I've found the following:
WireFormatNegotiator.tcpNoDelayEnabled=false for one of three WireFormatNegotiator instances in the memory. I'm trying to figure out what sets this.
And second (and more important), is there a way I can use JMX to tell if the messages are being throttled? I'd like to set up a Nagios alert to let me know if this happens in the future. What property should I check for with JMX?
you can configure your producer client to throw javax.jms.ResourceAllocationException exceptions which can then be detected/logged, etc. just set one of the following...
<systemUsage>
<systemUsage sendFailIfNoSpaceAfterTimeout="3000">
...OR...
<systemUsage sendFailIfNoSpace="true">

Any reason why I would not be permitted to confirm a message using Tibco Rendezvous?

I have a setup in which some applications communicate with each other via Tibco rendezvous. The applications communicate using certified messaging. My problem is that two of my receivers have recently started exhibiting the behavior that they will get an Error 27, Not Permitted when they want to confirm a message (the first message in a certified message exchange isn't certified, we've accounted for that).
I've been looking around the internet to find people with the same error, and I have found many, but they all get the error when trying to create the tibco transport. I can create the transport just fine, but I can't confirm any messages received over it.
Our environment uses both tibco 7.X and 8.X, some times intermingled. This problem appears both when the peers use the same tibco version and when they use different versions. It doesn't show up for all applications, but when it does show up for an application, it remains "broken". Discarding the ledger files for both sender and receiver does nothing. We still get the error. Both sender and receiver have proper permissions to write to (and create the) ledger files. We are connecting to permanently running rvds. The sender and receiver are on different machines. Communication has worked flawlessly in the past, but at some point, it stopped doing so. The application is in java, and we're using the tibrvj.jar auto-native libraries.
The error is
...
Caused by: TibrvException[error=27,message=Not permitted]
at com.tibco.tibrv.TibrvImplCmTPortC.natConfirmMsg(Native Method)
at com.tibco.tibrv.TibrvImplCmTPortC.confirmMsg(TibrvImplCmTPortC.java:304)
at com.tibco.tibrv.TibrvCmListener.confirmMsg(TibrvCmListener.java:88)
....
I know you're going to ask me "what did you do to make it start happening", and my response is "I don't know".
Any input would be appreciated.
Thanks.
It may be possible that TCP connections between the two RVD servers is not possible. Can you check if you can connect from one to the other (connect from the subscriber host back to the publisher)? In my experience, CM acknowledgments are handled over TCP (please take this with a grain of salt as I'm more an end user than a Middleware support guy).
As it turns out, it was a screw-up on the application level.
Due to some old code lying around, after having updated a dependency (our messaging layer), we had moved from an application level confirmation to a container level confirmation, but we had forgotten to remove an explicit message confirmation in the application code.
To summarize: We tried to confirm the message twice, and the second time it threw this exception.
I recently encountered the same exception - application had been working for months, suddenly was throwing exception. In my case some maintenance had been done on the Windows server the application ran on and directories had been marked read-only. Once that was cleared the exception went away.
Discovered this after trouble-shooting hours worth of other potential causes.
Just my two cents: This exception also occurs when you try to explicitly confirm message on non-CM transport.

Categories