SimpleHttpConnectionManager being used incorrectly - java

SimpleHttpConnectionManager being used incorrectly. Be sure that HttpMethod.releaseConnection() is always called and that only one thread and/or method is using this connection manager at a time.
Does Anyone know why this error shows up and is causes the files I want to download or to fail and retry or to download uncompleted
Thank you !

Make sure that you don't use SimpleHttpConnectionManager to create and use connections from multiple threads. The simple connection manager is not designed for it - it returns always the same connection, and this is not thread safe.
In a multi-threaded environment, use a different manager that uses a pool of connections. See MultiThreadedHttpConnectionManager.

Prefer to take no credit for this, but as per Eyal Schneider's answer, find more info on using MultiThreadedHttpConnectionManager in Vincent de Villers excellent blog.
Code snippet copied in case the link ever disappears:
HttpClient httpclient = new HttpClient(new MultiThreadedHttpConnectionManager());
GetMethod httpget = new GetMethod("http://www.myhost.com/");
try {
httpclient.executeMethod(httpget);
Reader reader = new InputStreamReader(
httpget.getResponseBodyAsStream(), httpget.getResponseCharSet());
// consume the response entity
} finally {
httpget.releaseConnection();
}

Related

httpclient Connection reset [duplicate]

I'm creating a (well behaved) web spider and I notice that some servers are causing Apache HttpClient to give me a SocketException -- specifically:
java.net.SocketException: Connection reset
The code that causes this is:
// Execute the request
HttpResponse response;
try {
response = httpclient.execute(httpget); //httpclient is of type HttpClient
} catch (NullPointerException e) {
return;//deep down in apache http sometimes throws a null pointer...
}
For most servers it's just fine. But for others, it immediately throws a SocketException.
Example of site that causes immediate SocketException: http://www.bhphotovideo.com/
Works great (as do most websites): http://www.google.com/
Now, as you can see, www.bhphotovideo.com loads fine in a web browser. It also loads fine when I don't use Apache's HTTP Client. (Code like this:)
HttpURLConnection c = (HttpURLConnection)url.openConnection();
BufferedInputStream in = new BufferedInputStream(c.getInputStream());
Reader r = new InputStreamReader(in);
int i;
while ((i = r.read()) != -1) {
source.append((char) i);
}
So, why don't I just use this code instead? Well there are some key features in Apache's HTTP Client that I need to use.
Does anyone know what causes some servers to cause this exception?
Research so far:
Problem occurs on my local Mac dev machines AND an AWS EC2 Instance, so it's not a local firewall.
It seems the error isn't caused by the remote machine because the exception doesn't say "by peer"
This stack overflow seems relavent java.net.SocketException: Connection reset but the answers don't show why this would happen only from Apache HTTP Client and not other approaches.
Bonus question: I'm doing a fair amount of crawling with this system. Is there generally a better Java class for this other than Apache HTTP Client? I've found a number of issues (such as the NullPointerException I have to catch in the code above). It seems that HTTPClient is very picky about server communications -- more picky than I'd like for a crawler that can't just break when a server doesn't behave.
Thanks all!
Solution
Honestly, I don't have a perfect solution, but it works, so that's good enough for me.
As pointed out by oleg below, Bixo has created a crawler that customizes HttpClient to be more forgiving to servers. To "get around" the issue more than fix it, I just used SimpleHttpFetcher provided by Bixo here:
(linked removed - SO thinks I'm a spammer, so you'll have to google it yourself)
SimpleHttpFetcher fetch = new SimpleHttpFetcher(new UserAgent("botname","contact#yourcompany.com","ENTER URL"));
try {
FetchedResult result = fetch.fetch("ENTER URL");
System.out.println(new String(result.getContent()));
} catch (BaseFetchException e) {
e.printStackTrace();
}
The down side to this solution is that there are a lot of dependencies for Bixo -- so this may not be a good work around for everyone. However, you can always just work through their use of DefaultHttpClient and see how they instantiated it to get it to work. I decided to use the whole class because it handles some things for me, like automatic redirect following (and reporting the final destination url) that are helpful.
Thanks for the help all.
Edit: TinyBixo
Hi all. So, I loved how Bixo worked, but didn't like that it had so many dependencies (including all of Hadoop). So, I created a vastly simplified Bixo, without all the dependencies. If you're running into the problems above, I would recommend using it (and feel free to make pull requests if you'd like to update it!)
It's available here: https://github.com/juliuss/TinyBixo
First, to answer your question:
The connection reset was caused by a problem on the server side. Most likely the server failed to parse the request or was unable to process it and dropped the connection as a result without returning a valid response. There is likely something in the HTTP requests generated by HttpClient that causes server side logic to fail, probably due to a server side bug. Just because the error message does not say 'by peer' does not mean the connection reset took place on the client side.
A few remarks:
(1) Several popular web crawlers such as bixo http://openbixo.org/ use HttpClient without major issues but pretty much of them had to tweak HttpClient behavior to make it more lenient about common HTTP protocol violations. Per default HttpClient is rather strict about the HTTP protocol compliance.
(2) Why did not you report the NPE problem or any other problem you have been experiencing to the HttpClient project?
These two settings will sometimes help:
client.getParams().setParameter("http.socket.timeout", new Integer(0));
client.getParams().setParameter("http.connection.stalecheck", new Boolean(true));
The first sets the socket timeout to be infinite.
Try getting a network trace using wireshark, and augment that with log4j logging of the HTTPClient. That should show why the connection is being reset

Releasing connections when using PoolingClientConnectionManager?

I'm using an Apache DefaultHttpClient with a PoolingClientConnectionManager and BasicResponseHandler. These are shared between different threads, and each thread creates its own HttpRequestBase extension.
Do I need to manually tell the manager that I'm done using the connection when using BasicResponseHandlers? Do I need to wrap it in a finally so exceptions don't cause a connection leak?
In other words, do I need to do this
HttpGet get = new HttpGet(address);
try {
httpclient.execute(get, new BasicResponseHandler());
} finally {
get.reset();
}
or is this enough ?
HttpGet get = new HttpGet(address);
httpclient.execute(get, new BasicResponseHandler());
I didn't see a clear answer in the Apache documentation.
This is enough and is recommended.
HttpClient#execute methods are guaranteed to automatically release all resources associated with the request in case of an exception (either I/O or runtime). When an HTTP response is processed using a ResponseHandler the underlying connection gets automatically released back to the connection manager is all cases.

Apache HTTPClient throws java.net.SocketException: Connection reset for many domains

I'm creating a (well behaved) web spider and I notice that some servers are causing Apache HttpClient to give me a SocketException -- specifically:
java.net.SocketException: Connection reset
The code that causes this is:
// Execute the request
HttpResponse response;
try {
response = httpclient.execute(httpget); //httpclient is of type HttpClient
} catch (NullPointerException e) {
return;//deep down in apache http sometimes throws a null pointer...
}
For most servers it's just fine. But for others, it immediately throws a SocketException.
Example of site that causes immediate SocketException: http://www.bhphotovideo.com/
Works great (as do most websites): http://www.google.com/
Now, as you can see, www.bhphotovideo.com loads fine in a web browser. It also loads fine when I don't use Apache's HTTP Client. (Code like this:)
HttpURLConnection c = (HttpURLConnection)url.openConnection();
BufferedInputStream in = new BufferedInputStream(c.getInputStream());
Reader r = new InputStreamReader(in);
int i;
while ((i = r.read()) != -1) {
source.append((char) i);
}
So, why don't I just use this code instead? Well there are some key features in Apache's HTTP Client that I need to use.
Does anyone know what causes some servers to cause this exception?
Research so far:
Problem occurs on my local Mac dev machines AND an AWS EC2 Instance, so it's not a local firewall.
It seems the error isn't caused by the remote machine because the exception doesn't say "by peer"
This stack overflow seems relavent java.net.SocketException: Connection reset but the answers don't show why this would happen only from Apache HTTP Client and not other approaches.
Bonus question: I'm doing a fair amount of crawling with this system. Is there generally a better Java class for this other than Apache HTTP Client? I've found a number of issues (such as the NullPointerException I have to catch in the code above). It seems that HTTPClient is very picky about server communications -- more picky than I'd like for a crawler that can't just break when a server doesn't behave.
Thanks all!
Solution
Honestly, I don't have a perfect solution, but it works, so that's good enough for me.
As pointed out by oleg below, Bixo has created a crawler that customizes HttpClient to be more forgiving to servers. To "get around" the issue more than fix it, I just used SimpleHttpFetcher provided by Bixo here:
(linked removed - SO thinks I'm a spammer, so you'll have to google it yourself)
SimpleHttpFetcher fetch = new SimpleHttpFetcher(new UserAgent("botname","contact#yourcompany.com","ENTER URL"));
try {
FetchedResult result = fetch.fetch("ENTER URL");
System.out.println(new String(result.getContent()));
} catch (BaseFetchException e) {
e.printStackTrace();
}
The down side to this solution is that there are a lot of dependencies for Bixo -- so this may not be a good work around for everyone. However, you can always just work through their use of DefaultHttpClient and see how they instantiated it to get it to work. I decided to use the whole class because it handles some things for me, like automatic redirect following (and reporting the final destination url) that are helpful.
Thanks for the help all.
Edit: TinyBixo
Hi all. So, I loved how Bixo worked, but didn't like that it had so many dependencies (including all of Hadoop). So, I created a vastly simplified Bixo, without all the dependencies. If you're running into the problems above, I would recommend using it (and feel free to make pull requests if you'd like to update it!)
It's available here: https://github.com/juliuss/TinyBixo
First, to answer your question:
The connection reset was caused by a problem on the server side. Most likely the server failed to parse the request or was unable to process it and dropped the connection as a result without returning a valid response. There is likely something in the HTTP requests generated by HttpClient that causes server side logic to fail, probably due to a server side bug. Just because the error message does not say 'by peer' does not mean the connection reset took place on the client side.
A few remarks:
(1) Several popular web crawlers such as bixo http://openbixo.org/ use HttpClient without major issues but pretty much of them had to tweak HttpClient behavior to make it more lenient about common HTTP protocol violations. Per default HttpClient is rather strict about the HTTP protocol compliance.
(2) Why did not you report the NPE problem or any other problem you have been experiencing to the HttpClient project?
These two settings will sometimes help:
client.getParams().setParameter("http.socket.timeout", new Integer(0));
client.getParams().setParameter("http.connection.stalecheck", new Boolean(true));
The first sets the socket timeout to be infinite.
Try getting a network trace using wireshark, and augment that with log4j logging of the HTTPClient. That should show why the connection is being reset

java.net.SocketException: Software caused connection abort: recv failed

I haven't been able to find an adequate answer to what exactly the following error means:
java.net.SocketException: Software caused connection abort: recv failed
Notes:
This error is infrequent and unpredictable; although getting this error means that all future requests for URIs will also fail.
The only solution that works (also, only occasionally) is to reboot Tomcat and/or the actual machine (Windows in this case).
The URI is definitely available (as confirmed by asking the browser to do the fetch).
Relevant code:
BufferedReader reader;
try {
URL url = new URL(URI);
reader = new BufferedReader(new InputStreamReader(url.openStream())));
} catch( MalformedURLException e ) {
throw new IOException("Expecting a well-formed URL: " + e);
}//end try: Have a stream
String buffer;
StringBuilder result = new StringBuilder();
while( null != (buffer = reader.readLine()) ) {
result.append(buffer);
}//end while: Got the contents.
reader.close();
This also happens if your TLS client is unable to be authenticate by the server configured to require client authentication.
This usually means that there was a network error, such as a TCP timeout. I would start by placing a sniffer (wireshark) on the connection to see if you can see any problems. If there is a TCP error, you should be able to see it. Also, you can check your router logs, if this is applicable. If wireless is involved anywhere, that is another source for these kind of errors.
This error occurs when a connection is closed abruptly (when a TCP connection is reset while there is still data in the send buffer). The condition is very similar to a much more common 'Connection reset by peer'. It can happen sporadically when connecting over the Internet, but also systematically if the timing is right (e.g. with keep-alive connections on localhost).
An HTTP client should just re-open the connection and retry the request. It is important to understand that when a connection is in this state, there is no way out of it other than to close it. Any attempt to send or receive will produce the same error.
Don't use URL.open(), use Apache-Commons HttpClient which has a retry mechanism, connection pooling, keep-alive and many other features.
Sample usage:
HttpClient httpClient = HttpClients.custom()
.setConnectionTimeToLive(20, TimeUnit.SECONDS)
.setMaxConnTotal(400).setMaxConnPerRoute(400)
.setDefaultRequestConfig(RequestConfig.custom()
.setSocketTimeout(30000).setConnectTimeout(5000).build())
.setRetryHandler(new DefaultHttpRequestRetryHandler(5, true))
.build();
// the httpClient should be re-used because it is pooled and thread-safe.
HttpGet request = new HttpGet(uri);
HttpResponse response = httpClient.execute(request);
reader = new BufferedReader(new InputStreamReader(response.getEntity().getContent()));
// handle response ...
Are you accessing http data? Can you use the HttpClient library instead of the standard library? The library has more options and will provide better error messages.
http://hc.apache.org/httpclient-3.x/
The only time I've seen something like this happen is when I have a bad connection, or when somebody is closing the socket that I am using from a different thread context.
Try adding 'autoReconnect=true' to the jdbc connection string
This will happen from time to time either when a connection times out or when a remote host terminates their connection (closed application, computer shutdown, etc). You can avoid this by managing sockets yourself and handling disconnections in your application via its communications protocol and then calling shutdownInput and shutdownOutput to clear up the session.
Look if you have another service or program running on the http port. It happened to me when I tried to use the port and it was taken by another program.
If you are using Netbeans to manage Tomcat, try to disable HTTP monitor in Tools - Servers
I too had this problem. My solution was:
sc.setSoLinger(true, 10);
COPY FROM A WEBSITE -->By using the setSoLinger() method, you can explicitly set a delay before a reset is sent, giving more time for data to be read or send.
Maybe it is not the answer to everybody but to some people.

How can I set a timeout against a BufferedReader based upon a URLConnection in Java?

I want to read the contents of a URL but don't want to "hang" if the URL is unresponsive. I've created a BufferedReader using the URL...
URL theURL = new URL(url);
URLConnection urlConn = theURL.openConnection();
urlConn.setDoOutput(true);
BufferedReader urlReader = new BufferedReader(newInputStreamReader(urlConn.getInputStream()));
...and then begun the loop to read the contents...
do
{
buf = urlReader.readLine();
if (buf != null)
{
resultBuffer.append(buf);
resultBuffer.append("\n");
}
}
while (buf != null);
...but if the read hangs then the application hangs.
Is there a way, without grinding the code down to the socket level, to "time out" the read if necessary?
I think URLConnection.setReadTimeout is what you are looking for.
If you have java 1.4:
I assume the connection timeout (URLConnection.setConnectTimeout(int timeout) ) is of no use because you are doing some kind of streaming.
---Do not kill the thread--- It may cause unknown problems, open descriptors, etc.
Spawn a java.util.TimerTask where you will check if you have finished the process, otherwise, close the BufferedReader and the OutputStream of the URLConnection
Insert a boolean flag isFinished and set it to true at the end of your loop and to false before the loop
TimerTask ft = new TimerTask(){
public void run(){
if (!isFinished){
urlConn.getInputStream().close();
urlConn.getOutputStream().close();
}
}
};
(new Timer()).schedule(ft, timeout);
This will probably cause an ioexception, so you have to catch it. The exception is not a bad thing in itself.
I'm omitting some declarations (i.e. finals) so the anonymous class can access your variables. If not, then create a POJO that maintains a reference and pass that to the timertask
Since Java 1.5, it is possible to set the read timeout in milliseconds on the underlying socket via the 'setReadTimeout(int timeout)' method on the URLConnection class.
Note that there is also the 'setConnectTimeout(int timeout)' which will do the same thing for the initial connection to the remote server, so it is important to set that as well.
I have been working on this issue in a JVM 1.4 environment just recently. The stock answer is to use the system properties sun.net.client.defaultReadTimeout (read timeout) and/or sun.net.client.defaultConnectTimeout. These are documented at Networking Properties and can be set via the -D argument on the Java command line or via a System.setProperty method call.
Supposedly these are cached by the implementation so you can't change them from one thing to another so one they are used once, the values are retained.
Also they don't really work for SSL connections ala HttpsURLConnection. There are other ways to deal with that using a custom SSLSocketFactory.
Again, all this applies to JVM 1.4.x. At 1.5 and above you have more methods available to you in the API (as noted by the other responders above).
For Java 1.4, you may use SimpleHttpConnectionManager.getConnectionWithTimeout(hostConf,CONNECTION_TIMEOUT) from Apache

Categories