URLConnection getInputStream blocks thread - java

I encountered an issue with getInputStream method in URLConnection class. I'm aware there are some other similar issues discussed in other threads, but no single solution seemed to work in my case.
The funny thing is that as first execution goes well, further ones fail (block). Prior to describing the issue I'd like to write some background. Here it is.
Basically I have simple client-server configuration. As I don't want to hardcode server address and port in client app, I employ HTTP server (nginx), from which actual connection parameters can be retrieved.
On the client side, there's a 'network thread', that is controlled by service. Service starts the thread and can interrupt it when needed. At the very beginning of run() method there's invocation of following function:
private ConnectionParameters obtainConnectionParameters(String url) throws MalformedURLException, IOException {
URLConnection connection = new URL(url).openConnection();
InputStream in = connection.getInputStream(); // here the problem occurs
... // do some processing
in.close();
return connectionParameters;
}
When connection parameters are obtained, another socket connection is opened. After some time thread may be closed or simply reach end of run() method. I double-checked that it exits cleanly.
Returning to the problem, I have no idea what may be causing this to happen. Do you have any clues what can possibly causing this behavior?
I'd also like to mention that service and the network thread are running in separate (background) process from activities. There's no other place in this proces where URLConnection is used. It's worth noticing that all variables used in method obtainConnectionParameters are local.
I suppose that nothing crucial is missing in the description. Otherwise please let me know, so I can edit my post.
EDIT (1):
I have just tried apache HTTP client as in thread Make an HTTP request with android
and it worked well. I'd love to find out what is wrong with URLConnection, though.

If I understand you correctly, the code snippet above is called multiple times, and the first time it works fine, but the second time it blocks on the getInputStream() call?
The problem could be on the server side. Maybe the server is only accepting one connection at a time, and the first connection you made is still open? Is it possible to open the url with a browser multiple times, to verify that the server works as expected?

Related

How do I make a Java function that retries a URL connection every half second if the connection takes too long?

So I have a problem with a Java program I have. The program's basic functionality includes basically connecting to a web API for data. The function that does that is something like this:
public static Object getData(String sURL) throws IOException {
URL url = new URL(sURL);
URLConnection request = url.openConnection();
request.connect();
return request.getContent();
}
The code works fine as it is, but recently, after my house changed ISPs, I have found that sometimes the connections take an unreasonably long amount of time, something like 10 seconds or more in about 10% of attempts, while the other 90% takes only around 200ms. I have found it to be faster to ask my program to call the function again in a different thread than to wait for some of these connections to finally connect.
Therefore, I want to change the function so that if after 500ms, the connection did not establish, it would disconnect and a new connection would be attempted. How could I do this?
Somewhere online I read that HttpURLConnection might help, but I am not sure how.
URLConnection allows you to specify the connect and read timeout prior to calling connect():
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/URLConnection.html#setConnectTimeout(int)
Sets a specified timeout value, in milliseconds, to be used when
opening a communications link to the resource referenced by this
URLConnection. If the timeout expires before the connection can be
established, a java.net.SocketTimeoutException is raised. A timeout of
zero is interpreted as an infinite timeout.
With 500ms timeout:
try {
URLConnection request = url.openConnection();
request.setConnectTimeout(500); // 500 ms
request.connect();
// on successful connection
} catch (SocketTimeoutException ex) {
// on request timeout
}
This you can pack into a loop, but I recommend limiting the number of attempts made.
Java's URLConnection doesn't have retry capabilities in Java 8 therefore the best way here to achieve this - use an appropriate standalone 3-party library such as Apache HttpClient.
This is by far the best standalone 3-party HTTP client with advanced capabilities as of 2020 and it's still maintained.
By default as of version 5.2.x Apache Http Client, Apache Http Client uses the default implementation of org.apache.http.client.HttpRequestRetryHandler, which retries 3 times, but you can use a custom implementation instead.
The configuration might look like this(full imports are for example's sake):
org.apache.http.client.HttpClient httpClient = org.apache.http.impl.client.HttpClients.custom()
.setRetryHandler(YourCustomImplOfTheRetryHandlerClass)
//other config
.build();
There is no way I can reproduce that problem using my ISP.
I suggest you dig deeper into the problem and find a better solution. Sending another request just doesn't seem good enough to me. Maybe try a different way to get the data and see if that works for you. Can't say for sure as I can't reproduce the problem.

httpclient Connection reset [duplicate]

I'm creating a (well behaved) web spider and I notice that some servers are causing Apache HttpClient to give me a SocketException -- specifically:
java.net.SocketException: Connection reset
The code that causes this is:
// Execute the request
HttpResponse response;
try {
response = httpclient.execute(httpget); //httpclient is of type HttpClient
} catch (NullPointerException e) {
return;//deep down in apache http sometimes throws a null pointer...
}
For most servers it's just fine. But for others, it immediately throws a SocketException.
Example of site that causes immediate SocketException: http://www.bhphotovideo.com/
Works great (as do most websites): http://www.google.com/
Now, as you can see, www.bhphotovideo.com loads fine in a web browser. It also loads fine when I don't use Apache's HTTP Client. (Code like this:)
HttpURLConnection c = (HttpURLConnection)url.openConnection();
BufferedInputStream in = new BufferedInputStream(c.getInputStream());
Reader r = new InputStreamReader(in);
int i;
while ((i = r.read()) != -1) {
source.append((char) i);
}
So, why don't I just use this code instead? Well there are some key features in Apache's HTTP Client that I need to use.
Does anyone know what causes some servers to cause this exception?
Research so far:
Problem occurs on my local Mac dev machines AND an AWS EC2 Instance, so it's not a local firewall.
It seems the error isn't caused by the remote machine because the exception doesn't say "by peer"
This stack overflow seems relavent java.net.SocketException: Connection reset but the answers don't show why this would happen only from Apache HTTP Client and not other approaches.
Bonus question: I'm doing a fair amount of crawling with this system. Is there generally a better Java class for this other than Apache HTTP Client? I've found a number of issues (such as the NullPointerException I have to catch in the code above). It seems that HTTPClient is very picky about server communications -- more picky than I'd like for a crawler that can't just break when a server doesn't behave.
Thanks all!
Solution
Honestly, I don't have a perfect solution, but it works, so that's good enough for me.
As pointed out by oleg below, Bixo has created a crawler that customizes HttpClient to be more forgiving to servers. To "get around" the issue more than fix it, I just used SimpleHttpFetcher provided by Bixo here:
(linked removed - SO thinks I'm a spammer, so you'll have to google it yourself)
SimpleHttpFetcher fetch = new SimpleHttpFetcher(new UserAgent("botname","contact#yourcompany.com","ENTER URL"));
try {
FetchedResult result = fetch.fetch("ENTER URL");
System.out.println(new String(result.getContent()));
} catch (BaseFetchException e) {
e.printStackTrace();
}
The down side to this solution is that there are a lot of dependencies for Bixo -- so this may not be a good work around for everyone. However, you can always just work through their use of DefaultHttpClient and see how they instantiated it to get it to work. I decided to use the whole class because it handles some things for me, like automatic redirect following (and reporting the final destination url) that are helpful.
Thanks for the help all.
Edit: TinyBixo
Hi all. So, I loved how Bixo worked, but didn't like that it had so many dependencies (including all of Hadoop). So, I created a vastly simplified Bixo, without all the dependencies. If you're running into the problems above, I would recommend using it (and feel free to make pull requests if you'd like to update it!)
It's available here: https://github.com/juliuss/TinyBixo
First, to answer your question:
The connection reset was caused by a problem on the server side. Most likely the server failed to parse the request or was unable to process it and dropped the connection as a result without returning a valid response. There is likely something in the HTTP requests generated by HttpClient that causes server side logic to fail, probably due to a server side bug. Just because the error message does not say 'by peer' does not mean the connection reset took place on the client side.
A few remarks:
(1) Several popular web crawlers such as bixo http://openbixo.org/ use HttpClient without major issues but pretty much of them had to tweak HttpClient behavior to make it more lenient about common HTTP protocol violations. Per default HttpClient is rather strict about the HTTP protocol compliance.
(2) Why did not you report the NPE problem or any other problem you have been experiencing to the HttpClient project?
These two settings will sometimes help:
client.getParams().setParameter("http.socket.timeout", new Integer(0));
client.getParams().setParameter("http.connection.stalecheck", new Boolean(true));
The first sets the socket timeout to be infinite.
Try getting a network trace using wireshark, and augment that with log4j logging of the HTTPClient. That should show why the connection is being reset

Rather mysterious SocketException with Java 1.6 on CentOS 4

I have a JUnit test of a JAX-RS web service. The test launches embedded tomcat, and then talks to it via the Apache CXF JAX-RS client.
Consider this backtrace:
Caused by: java.net.SocketException: Socket Closed
at java.net.PlainSocketImpl.getOption(PlainSocketImpl.java:286)
at java.net.Socket.getSoTimeout(Socket.java:1032)
at sun.net.www.http.HttpClient.available(HttpClient.java:356)
at sun.net.www.http.HttpClient.New(HttpClient.java:273)
at sun.net.www.http.HttpClient.New(HttpClient.java:310)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:987)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:923)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:841)
at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1031)
This fails only on CentOS 4.8. The same unit test (which launches an embedded tomcat and then talks to a web service in it) works just fine on a wide variety of other systems. Note the extreme oddity of this backtrace: HttpHRLConnection has called HttpClient to get a new connection, and that later class has apparently closed its own socket before the connection has been returned where any code of mine could get to it.
Further, the test has friends that do the same server setup of the same service and talk to it without issues.
Even further, the following incantation (slightly abbreviated) is a workaround:
#Before
public void pingServiceToWorkAroundCentos() {
try {
/* ... code to make a connection to the service and close it ... */
} catch (Throwable t) {
// do nothing
}
}
In other words, if I arrange for an extra throwaway connection before running each of the test cases, that uses up whatever this problem is.
What could this be?
Since there is only a backtrace and no code here, I am assuming that there is some sort of race condition or bug where the socket is being closed prior by another thread while this current thread is attempting to get the OutputStream.
Looking at the source for the JDK I see this...
public Object getOption(int opt) throws SocketException {
if (isClosedOrPending()) {
throw new SocketException("Socket Closed");
}
... snip ...
the isClosedOrPending method checks whether the internal FD is null or if a close is pending, i.e. close has been called on the socket.
Good luck tracking it down.
Nothing mysterious about it. You have closed the socket and then continued to use it.
Closing either the input or the output stream of the socket closes the other stream and the socket.
I am pretty sure this is a JDK bug.
HttpClient was modified in a recent commit:
http://hg.openjdk.java.net/jdk7u/jdk7u/jdk/diff/e6dc1d9bc70b/src/share/classes/sun/net/www/http/HttpClient.java
The getSoTimeout() call needs to be in a try/catch block, for now unfortunately the only real option is to downgrade the JDK.
Looks similar to an issue we ran into where the httpclient pooled connections were kept alive longer than the corresponding server side connections in tomcat. Basically this results in stale connections in the httpclient connection pool. When httpclient tries to use these, they basically fail. I believe httpclient actually recovers from this using the standard retry handler.
The solution is to double check your timeout settings client and serverside and your retry policy.

Apache HTTPClient throws java.net.SocketException: Connection reset for many domains

I'm creating a (well behaved) web spider and I notice that some servers are causing Apache HttpClient to give me a SocketException -- specifically:
java.net.SocketException: Connection reset
The code that causes this is:
// Execute the request
HttpResponse response;
try {
response = httpclient.execute(httpget); //httpclient is of type HttpClient
} catch (NullPointerException e) {
return;//deep down in apache http sometimes throws a null pointer...
}
For most servers it's just fine. But for others, it immediately throws a SocketException.
Example of site that causes immediate SocketException: http://www.bhphotovideo.com/
Works great (as do most websites): http://www.google.com/
Now, as you can see, www.bhphotovideo.com loads fine in a web browser. It also loads fine when I don't use Apache's HTTP Client. (Code like this:)
HttpURLConnection c = (HttpURLConnection)url.openConnection();
BufferedInputStream in = new BufferedInputStream(c.getInputStream());
Reader r = new InputStreamReader(in);
int i;
while ((i = r.read()) != -1) {
source.append((char) i);
}
So, why don't I just use this code instead? Well there are some key features in Apache's HTTP Client that I need to use.
Does anyone know what causes some servers to cause this exception?
Research so far:
Problem occurs on my local Mac dev machines AND an AWS EC2 Instance, so it's not a local firewall.
It seems the error isn't caused by the remote machine because the exception doesn't say "by peer"
This stack overflow seems relavent java.net.SocketException: Connection reset but the answers don't show why this would happen only from Apache HTTP Client and not other approaches.
Bonus question: I'm doing a fair amount of crawling with this system. Is there generally a better Java class for this other than Apache HTTP Client? I've found a number of issues (such as the NullPointerException I have to catch in the code above). It seems that HTTPClient is very picky about server communications -- more picky than I'd like for a crawler that can't just break when a server doesn't behave.
Thanks all!
Solution
Honestly, I don't have a perfect solution, but it works, so that's good enough for me.
As pointed out by oleg below, Bixo has created a crawler that customizes HttpClient to be more forgiving to servers. To "get around" the issue more than fix it, I just used SimpleHttpFetcher provided by Bixo here:
(linked removed - SO thinks I'm a spammer, so you'll have to google it yourself)
SimpleHttpFetcher fetch = new SimpleHttpFetcher(new UserAgent("botname","contact#yourcompany.com","ENTER URL"));
try {
FetchedResult result = fetch.fetch("ENTER URL");
System.out.println(new String(result.getContent()));
} catch (BaseFetchException e) {
e.printStackTrace();
}
The down side to this solution is that there are a lot of dependencies for Bixo -- so this may not be a good work around for everyone. However, you can always just work through their use of DefaultHttpClient and see how they instantiated it to get it to work. I decided to use the whole class because it handles some things for me, like automatic redirect following (and reporting the final destination url) that are helpful.
Thanks for the help all.
Edit: TinyBixo
Hi all. So, I loved how Bixo worked, but didn't like that it had so many dependencies (including all of Hadoop). So, I created a vastly simplified Bixo, without all the dependencies. If you're running into the problems above, I would recommend using it (and feel free to make pull requests if you'd like to update it!)
It's available here: https://github.com/juliuss/TinyBixo
First, to answer your question:
The connection reset was caused by a problem on the server side. Most likely the server failed to parse the request or was unable to process it and dropped the connection as a result without returning a valid response. There is likely something in the HTTP requests generated by HttpClient that causes server side logic to fail, probably due to a server side bug. Just because the error message does not say 'by peer' does not mean the connection reset took place on the client side.
A few remarks:
(1) Several popular web crawlers such as bixo http://openbixo.org/ use HttpClient without major issues but pretty much of them had to tweak HttpClient behavior to make it more lenient about common HTTP protocol violations. Per default HttpClient is rather strict about the HTTP protocol compliance.
(2) Why did not you report the NPE problem or any other problem you have been experiencing to the HttpClient project?
These two settings will sometimes help:
client.getParams().setParameter("http.socket.timeout", new Integer(0));
client.getParams().setParameter("http.connection.stalecheck", new Boolean(true));
The first sets the socket timeout to be infinite.
Try getting a network trace using wireshark, and augment that with log4j logging of the HTTPClient. That should show why the connection is being reset

java.net.SocketTimeoutException: Read timed out

I have an application with client server architecture. The client
use Java Web Start with Java Swing / AWT and the sert uses HTTP server / Servlet with
Tomcat.
The communication is made from the serialization of objects, create a
ObjectOutput serializes a byte array and send to the server
respectively called the ObjectInputStream and deserializes.
The application follows communicating correctly to a certain
time of concurrency where starting to show error
"SocketException read timeout". The erro happens when the server invoke the method
ObjectInputStream.getObject() in my servlet doPost method.
The tomcat will come slow and the errors start to decrease server response time until the crash time where i must restart the server and after everything works.
Someone went through this problem ?
Client Code
URLConnection conn = url.openConnection();
conn.setDoOutput(true);
OutputStream os = conn.getOutputStream();
ObjectOutputStream oss = new ObjectOutputStream(os);
oss.writeUTF("protocol header sample");
oss.writeObject(_parameters);
oss.flush();
oss.close();
Server Code
ObjectInputStream input = new ObjectInputStream(_request.getInputStream());
String method = input.readUTF();
parameters = input.readObject();
input.readObject() is where the error is
You haven't given us much information to go on, especially about the client side. But my suspicion is that the client side is:
failing to setting the Content-length header (or setting it to the wrong value),
failing to flush the output stream, and/or
not closing the output side of the socket.
Mysterious.
Based on your updated question, it looks like none of the above. Here are a couple of other possibilities:
For some reason the client side is either locking up entirely during serialization or taking a VERY LONG TIME.
There is a proxy between the client and server that is causing problems.
You are experiencing load-related network problems, or network hardware problems.
Another possible explanation is that you have a memory leak, and that the slowdown is caused by the GC taking more and more time as you run out of memory. This will show up in the GC logs if you have them enabled.
I think During high Concurrency, the Socket Timeout set in Tomcat is Expired and the connection is closed. The next read by Tomcat for that connection is greater than the server socket timeout specified in the server.
If you want to avoid this problem you have to increase the timeout on the server-side which is expired in your case. But not advisable.
BTW you did not give enough information. Did you increase the no of threads for connection in Tomcat? If you did, this surely would happen.

Categories