sporadic DNS resolution issues in Java http client - java

We've got a batch job that runs every day on an EC2 instance within AWS. The EC2 instance exists in a VPC. The batch job uses java to make a series of REST API calls on a public server. Most days the batch job runs without issue. However, some days, something breaks down in DNS resolution. The job will be happily running and then suddenly DNS resolution fails and the remaining API calls error out with an exception like the following:
java.net.UnknownHostException: some.publicserver.com: Name or service not known
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) ~[na:1.8.0_191]
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929) ~[na:1.8.0_191]
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324) ~[na:1.8.0_191]
at java.net.InetAddress.getAllByName0(InetAddress.java:1277) ~[na:1.8.0_191]
at java.net.InetAddress.getAllByName(InetAddress.java:1193) ~[na:1.8.0_191]
at java.net.InetAddress.getAllByName(InetAddress.java:1127) ~[na:1.8.0_191]
at org.apache.http.impl.conn.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:44) ~[batchjob.jar:na]
at org.apache.http.impl.conn.HttpClientConnectionOperator.connect(HttpClientConnectionOperator.java:102) ~[batchjob.jar:na]
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:319) ~[batchjob.jar:na]
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:363) ~[batchjob.jar:na]
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:219) ~[batchjob.jar:na]
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:195) ~[batchjob.jar:na]
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86) ~[batchjob.jar:na]
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108) ~[batchjob.jar:na]
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ~[batchjob.jar:na]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) ~[batchjob.jar:na]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) ~[batchjob.jar:na]
...
Some days, every API call will fail with this error, some days there will be a string of successful calls and then everything will start failing. On the days where the job is failing, I can connect the server at the same time and verify that DNS seems to be working. For example, if I use the following command
nslookup some.publicserver.com
It returns a successful response. At the same time, the batch job will be spewing a bunch of UnknownHostExceptions.
I am perplexed as to where to look for the source of the problem. Has anyone out there experienced anything similar to this?

I think that this is not a Java specific problem per say, rather than a DNS resolution issue with the EC2 instance. Java will effectively performs DNS resolving actions firstly by checking the hosts file and then by calling the underlying OS's DNS related functions.
With this in mind as well as the fact that the underlying EC2 instance is effectively running a Linux distro, these steps will result in a call to the gethostbyname2 function of the OS. This in turn will perform all the under the hood magic to resolve the name in question.
Now, two things are very important in troubleshooting your problem. First one is whether the IP address of the server you're calling is changing often. Two is that the nslookup program you're using will query the DNS server directly. This means that there could very well be discrepancies between what Java attempts to do to resolve the domain name and what the program does. Furthermore, this may also mean that the OS may have cached up an IP address which does not correspond to the server's latest one. Thus, I would suggest checking the IP address of the hostname using some other utility (e.g. ping).
My best advises on troubleshooting this would be the following:
Adding some kind of log trace when attempting to perform the hostname
resolution and comparing it with the nslookup's resolved value.
Checking whether the EC2 has a proper DNS setup (what DNS server
you're using etc).
Adding an entry to the hosts file mapping the domain name to the IP
address (provided that the latter one is does not change).
Hope the above help.

For what it's worth, in my particular situation, here's what I have been able to figure out. Hopefully this will help someone else down the road.
The authoritative DNS servers for the target in my example (some.publicserver.com) are returning SERVFAIL for some requests. This seems likely to be a load issue as it appears to happen sporadically throughout the day. With my AWS setup, I am using the default DNS servers for my VPC, which are provided by AWS. Those servers apparently do not do any caching. I have learned that Java does some caching for DNS resolutions through InetAddress, but by default it is a short window (30 seconds in most implementations I believe).
So in the end, the real cause of the problem is the authoritative DNS servers for some.publicserver.com not being completely reliable. Since I have no control over those servers, I think the best workaround is to use DNS caching. Option #1 is to use local DNS caching on my EC2 Ubuntu instance (something like dnsmasq). Option #2 is to increase the caching duration used by the Java, by doing something like this:
java.security.Security.setProperty("networkaddress.cache.ttl" , "900");
I chose option #2 as it required less effort and minimizes the potential side effects. So for, it has resolved the issue for me.

Related

Springboot app settings to perform StressTest

i want to ask about what's the proper settings on springboot application properties to perform a stress test? because i'm constantly getting this error from my Jmeter when it reach 16k samples on my springboot rest server, also i am doing 100thread/s on the jmeter thread group for 5 minutes. Thank you
java.net.BindException: Address already in use: connect
at java.base/sun.nio.ch.Net.connect0(Native Method)
at java.base/sun.nio.ch.Net.connect(Net.java:579)
at java.base/sun.nio.ch.Net.connect(Net.java:568)
at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:588)
at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
at java.base/java.net.Socket.connect(Socket.java:633)
at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
at org.apache.jmeter.protocol.http.sampler.HTTPHC4Impl$JMeterDefaultHttpClientConnectionOperator.connect(HTTPHC4Impl.java:404)
Most probably you've run out of outbound ports.
Check your operating system documentation regarding how to either increase them and/or or reduce recycle time.
Example suggestions can be found i.e. in Solved “java.net.BindException: Address already in use: connect” issue on Windows or Handling "exhausting available ports" in Jmeter
Alternatively you can allocate another machine and consider switching to JMeter Distributed Testing

JVM (Java Transport client) caches the IP address of the DNS entry AWS

NoNodeAvailableException could potentially be related to the fact that the IP address of ELB changes frequently but the JVM (Java Transport client) caches the IP address of the DNS entry of ELB.
Problem is the IP of the ELB which is on AWS might change once in while which we dont have control . When it changes our system fails with a error saying no node found reason being the IP is cache and is still trying lookup with the same IP which is causing the issue .
For this i got a resolution that is to set "networkaddress.cache.ttl" = "0" which will tell the JVM not to cache the IP .
My problem is how do i simulate this scenario , because change the IP is not in my control , Can any suggest me a smart way doing it from code (Not the fix part but the testing part)
I would recommend using a value other than 0 because otherwise you'll be flooding the network with DNS requests every time you make a connection, and in some cases, will cause additional delays. Using a number between 5 and 60 would make sense.
If you are verifying the fix, you can use a tool like wireshark to monitor the network and verify that continued connections to the server are making DNS requests and responses for the name. If you want to write an automated test then you'll have to stand up some kind of DNS infrastructure and have that co-ordinate with your tests to cycle between different values.

Simulating slow/lossy communication in java

I need to test a functionality internal to my company's server whose benefit is evident only when clients run slow (as of latency and packet loss). To that extent, I need to simulate clients on a slow and/or lossy connection (TCP/HTTP). I'm using a Mac, Mountain Lion, and ideally I'd need to run both server and client locally.
One approach I tried to pursue -- unsuccessfully -- was to get hold of some java APIs that allow me to build clients with slow connections. I know JMeter has got something called SlowSockets (or something similar), but I was looking for APIs more focused on slow-performing clients. Any ideas of useful APIs?
Another approach I tried consisted in using a proxy to act as a middleman between client and server. In that case, the proxy should provide functionalities for simulating slow links. I've tried Charles proxy (Mac) and Apache TCPMon, however I seem to miss something when I try to get them at work. With TCPMon, for instance, when I start it in 'Proxy' mode (which is the mode that offers the 'simulate slow link' functionality) I define port for the local proxy, but I can't see how to define the remote host and port. Something similar happens with Charles Proxy; I can set the local port in the Proxy settings, but I can't understand how to define the remote end of the proxy (in fact connections fail saying the remote server is not responding). Anyone having ideas what I'm doing wrong?
One further approach I have tried to pursue is by using lower-level (e.g. OS-based) means; in this case, I tried Apple's Network Link Conditioner. I switched it on and defined my slowness parameters, but when I ping I don't seem to see the expected RTT etc. I've got a feeling NLC has a tight relationship with XCode and iOS testing, anyone capable of putting it at work for testing other (e.g. Java) applications? I've also tried ipfw on Mac, however the manual says ipfw is now deprecated and I don't want to dedicate time to get to know a tool that won't be available soon.
Any idea/help will be highly appreciated.
Thanks in advance.

java.net.UnknownHostException occurs after some time

I have a project in eclipse to retrieve data from a certain website. As there is too much data to be retrieved I have to keep the code running overnight. I get ajave.net.UnknownHostException after sometime. The code runs without any problem for a long time and only later the UnknownHostexception occurs. Any solution as to why this is happening?
You can only have the mac address of your server where the war is being deployed, Check it here how to get the MAC address
I have seen this error in one of my projects before. Till Java 1.5, JVM used to cache the DNS entry and did not honor the TTL values. If for some reason, the DNS entry was modified (usually the case with Akamai or other CDN networks), and the IP you were going to before is no longer available, you may hit upon this error.
Some info on this behavior is available at http://www.rgagnon.com/javadetails/java-0445.html and http://blog.andrewbeacock.com/2006/12/warning-java-caches-dns-to-ip-address.html.
What you may try is to run a iptrace when it works fine and when it starts failing from the same machine - if the IP has changed, you are hitting this scenario.
My guess is that your internet connect is probably breaking. Do you have any other logs to verify this?

unreliable behaviour of Openfire server at EC2

We are using openfire server 3.7.1 on Amazon Ec2 linux instance for a chat Application.
Currently, we are in initial development stage, where we are testing it with 4 or 5 concurrent users.
Now, and then we are getting issues with openfire server:
1) Java heap space exceptions.
2) java.net.BindException: Address already in use
3) they both lead to 5222 port not listening, while openfire admin console at 9090 is working fine
Eventually when i stop all openfire processes and then restart it, again it goes to normal.
I want to know, whether this is a bug in openfire version 3.7.1 or EC2 have some issues with opening of port 5222. I am really apprehensive about performance of Openfire server when 1000s user will be using it concurrently?
Solved by:
Disabling PEP.
Increasing Openfire JVM parametres
The Java heap space exception is common to Openfire, you can check your JVM arguments and increase the parameters. In my experience there were a couple of cases that caused those:
clients using Empathy.
some plugin that provided buddy lists/ white/black lists etc (had to do something with the user's roster lists).
You need to make sure port 5222 and 5223 are opened (some clients may use the old SSL port) in EC2 Firewall settings.
If you plan to have thousands of users, I suggest you get static IP address (you don't mention what's your current config). Also checkout jabberd - proved to be more reliable than openfire.
1000s of concurrent users should not be a problem for Openfire at all. It has seen 250K in testing. It will always be determinant though on what the users are doing.
There is a known memory leak in Openfire that has been fixed but not yet released. It is related to PEP, which can be shut off to circumvent this issue if that is feasible for you.

Categories