Really what I'm wondering: is python's urllib2 more like java's HttpUrlConnection, or more like apache's HttpClient? And, ultimately I'm wondering if urllib2 scales when used in a http server, or if there is some alternate library that is used when performance is an issue (as is the case in the java world).
To expand on my question a bit:
Java's HttpUrlConnection internally holds one connection open per host, and does pipelining. So if you do the following concurrently across threads it won't perform well:
HttpUrlConnection cxn = new Url('www.google.com').openConnection();
InputStream is = cxn.getInputStream();
By comparison, apache's HttpClient can be initialized with a connection pool, like this:
// this instance can be a singleton and shared across threads safely:
HttpClient client = new HttpClient();
MultiThreadedHttpConnectionManager cm = new MultiThreadedHttpConnectionManager();
HttpConnectionManagerParams p = new HttpConnectionManagerParams();
p.setMaxConnectionsPerHost(HostConfiguration.ANY_HOST_CONFIGURATION,20);
p.setMaxTotalConnections(100);
p.setConnectionTimeout(100);
p.setSoTimeout(250);
cm.setParams(p);
client.setHttpConnectionManager(cm);
The important part in the example above being that the number of total connections and the per-host connections are configurable.
In a comment urllib3 was mentioned, but I can't tell from reading the docs if it allows a per-host max to be set.
As of Python 2.7.14rc1, No.
For urllib, urlopen() eventually calls httplib.HTTP, which creates a new instance of HTTPConnection. HTTPConnection is tied to a socket and has methods for opening and closing it.
For urllib2, HTTPHandler does something similar and creates a new instance of HTTPConnection.
Related
I am trying to implement something like circuit breaker for my Sesame connections to the back-end database. When the database is absent I want to know this after 2 seconds, not to rely on the defaults of the client for timeouts. I can possibly overcome this with my own FutureTasks where I will execute the repository initialization and the connection obtaining. However in the logs I can see that sesame client uses o.a.h.i.c.PoolingClientConnectionManager - which is passed I bet ExecutorService and some default timeouts. This will make my FutureTask solution pretty messy. Is there an easier way to set timeouts for the sesame client.
You can set the query and update timeout, specifically, on the query/update object itself:
RepositoryConnection conn = ....;
...
TupleQuery query = conn.prepareTupleQuery(QueryLangage.SPARQL, "SELECT ...");
query.setMaxExecutionTime(2);
However, if you want to set a general timeout for all api calls over HTTP, the only way to currently do that is by obtaining a reference to the HttpClient object, and reconfigure it:
HTTPRepository repo = ....;
AbstractHttpClient httpClient = (AbstractHttpClient)((SesameClientImpl)repo.getSesameClient()).getHtttpClient();
HttpParams params = httpClient.getParams();
params.setIntParameter(CoreConnectionPNames.SO_TIMEOUT, 2000);
httpClient.setParams(params);
As you can see, this is rather brittle (lots of explicit casts), and uses an approach that is deprecated in Apache HttpClient 4.4. So I don't exactly recommend this as a stable solution, but it should provide a workaround in the short term.
In the longer term the Sesame dev team are working on more convenient access to the configuration of the httpclient.
Background :
I am using HttpClient (SolrJ) to connect to a Solr service. The question is not directly related to Solr though.
I bumped into the following issue when doing a Load testing.
Caused by: java.lang.IllegalStateException: Invalid use of BasicClientConnManager: connection still allocated.
SOF Answer - to use Pooled connection manager
Invalid use of BasicClientConnManager: connection still allocated
Question :
I am using the PoolingHttpClientConnectionManager as in the following code. Instead of manually throttling the connection size, I would like it to be managed using the AIMDBackoffManager. However, I see that the AIMDBackoffManager needs the connection pool as its parameter.
public static final PoolingClientConnectionManager poolingConnectionManager = new PoolingClientConnectionManager();
public static DefaultHttpClient getHttpClient(){
DefaultHttpClient httpClient = new DefaultHttpClient(poolingConnectionManager);
httpClient.setBackoffManager(new AIMDBackoffManager(poolingConnectionManager));
...
...
}
I googled a fair bit but I am unable to find any examples on the usage of BackoffManager. So, this is what I did but I am not excited in passing the connection manager twice to the DefaultHttpClient. Or should I not be worried considering the first time I am passing it to the HttpClient and the second time I am passing it to the BackoffManager?
I am using httpclient-4.2.3
I ventured into this deep water as well. I have been investigating how to use ServiceUnavailableRetryStrategy which seems failing due to BackoffManager in my case. I have an impression that this is not a finished functionality as I can't google out its usage and there is not much in the HttpClient source code either.
The AIMDBackoffManager constructor takes a ConnPoolControl (which the connection manager implements). Looking at this interface you'll see it only returns route-specific statistics of the pool which is what the BackoffManager uses to perform its tasks.
So you should not be worried about passing the connection manager twice while building the client, just be aware that AIMDBackoffManager acquires a lock on the connection manager in its backOff and probe implementations, which you can see in the source.
I'm using an Apache DefaultHttpClient with a PoolingClientConnectionManager and BasicResponseHandler. These are shared between different threads, and each thread creates its own HttpRequestBase extension.
Do I need to manually tell the manager that I'm done using the connection when using BasicResponseHandlers? Do I need to wrap it in a finally so exceptions don't cause a connection leak?
In other words, do I need to do this
HttpGet get = new HttpGet(address);
try {
httpclient.execute(get, new BasicResponseHandler());
} finally {
get.reset();
}
or is this enough ?
HttpGet get = new HttpGet(address);
httpclient.execute(get, new BasicResponseHandler());
I didn't see a clear answer in the Apache documentation.
This is enough and is recommended.
HttpClient#execute methods are guaranteed to automatically release all resources associated with the request in case of an exception (either I/O or runtime). When an HTTP response is processed using a ResponseHandler the underlying connection gets automatically released back to the connection manager is all cases.
SimpleHttpConnectionManager being used incorrectly. Be sure that HttpMethod.releaseConnection() is always called and that only one thread and/or method is using this connection manager at a time.
Does Anyone know why this error shows up and is causes the files I want to download or to fail and retry or to download uncompleted
Thank you !
Make sure that you don't use SimpleHttpConnectionManager to create and use connections from multiple threads. The simple connection manager is not designed for it - it returns always the same connection, and this is not thread safe.
In a multi-threaded environment, use a different manager that uses a pool of connections. See MultiThreadedHttpConnectionManager.
Prefer to take no credit for this, but as per Eyal Schneider's answer, find more info on using MultiThreadedHttpConnectionManager in Vincent de Villers excellent blog.
Code snippet copied in case the link ever disappears:
HttpClient httpclient = new HttpClient(new MultiThreadedHttpConnectionManager());
GetMethod httpget = new GetMethod("http://www.myhost.com/");
try {
httpclient.executeMethod(httpget);
Reader reader = new InputStreamReader(
httpget.getResponseBodyAsStream(), httpget.getResponseCharSet());
// consume the response entity
} finally {
httpget.releaseConnection();
}
I want to read the contents of a URL but don't want to "hang" if the URL is unresponsive. I've created a BufferedReader using the URL...
URL theURL = new URL(url);
URLConnection urlConn = theURL.openConnection();
urlConn.setDoOutput(true);
BufferedReader urlReader = new BufferedReader(newInputStreamReader(urlConn.getInputStream()));
...and then begun the loop to read the contents...
do
{
buf = urlReader.readLine();
if (buf != null)
{
resultBuffer.append(buf);
resultBuffer.append("\n");
}
}
while (buf != null);
...but if the read hangs then the application hangs.
Is there a way, without grinding the code down to the socket level, to "time out" the read if necessary?
I think URLConnection.setReadTimeout is what you are looking for.
If you have java 1.4:
I assume the connection timeout (URLConnection.setConnectTimeout(int timeout) ) is of no use because you are doing some kind of streaming.
---Do not kill the thread--- It may cause unknown problems, open descriptors, etc.
Spawn a java.util.TimerTask where you will check if you have finished the process, otherwise, close the BufferedReader and the OutputStream of the URLConnection
Insert a boolean flag isFinished and set it to true at the end of your loop and to false before the loop
TimerTask ft = new TimerTask(){
public void run(){
if (!isFinished){
urlConn.getInputStream().close();
urlConn.getOutputStream().close();
}
}
};
(new Timer()).schedule(ft, timeout);
This will probably cause an ioexception, so you have to catch it. The exception is not a bad thing in itself.
I'm omitting some declarations (i.e. finals) so the anonymous class can access your variables. If not, then create a POJO that maintains a reference and pass that to the timertask
Since Java 1.5, it is possible to set the read timeout in milliseconds on the underlying socket via the 'setReadTimeout(int timeout)' method on the URLConnection class.
Note that there is also the 'setConnectTimeout(int timeout)' which will do the same thing for the initial connection to the remote server, so it is important to set that as well.
I have been working on this issue in a JVM 1.4 environment just recently. The stock answer is to use the system properties sun.net.client.defaultReadTimeout (read timeout) and/or sun.net.client.defaultConnectTimeout. These are documented at Networking Properties and can be set via the -D argument on the Java command line or via a System.setProperty method call.
Supposedly these are cached by the implementation so you can't change them from one thing to another so one they are used once, the values are retained.
Also they don't really work for SSL connections ala HttpsURLConnection. There are other ways to deal with that using a custom SSLSocketFactory.
Again, all this applies to JVM 1.4.x. At 1.5 and above you have more methods available to you in the API (as noted by the other responders above).
For Java 1.4, you may use SimpleHttpConnectionManager.getConnectionWithTimeout(hostConf,CONNECTION_TIMEOUT) from Apache