java - Computing file size from URL - java

I have a lot of URLs. I want to calculate the total files size of all those Urls. I don't want to download it just calculate the size. I used the following approach it's working
but it will take long time. Please can anyone suggest me the best approach than this..
Computing the file size.
int getFileSize(URL url) {
HttpURLConnection conn = null;
try {
conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("HEAD");
conn.getInputStream();
return conn.getContentLength();
} catch (IOException e) {
return -1;
} finally {
conn.disconnect();
}
}

As HttpURLConnection.getContentLength() returns the value of the content-length header field, this may return -1 if it has not been set.
So unless you can write something on the server which reads the files size, then you may have to read the entire inputstream.

HEAD requests is your only way without actually downloading everything unless you have access to the file system on the server (which I doubt).
You can't get the size of a document without asking for it.

Your approach is seems OK to me. You can't find out the length without asking the server. You could launch all of the HEAD requests in parallel to reduce the elapsed time - doing them sequentially means your program will wait for responses for most of the time.
If there is no Content-Length, then this explains how to calculate the message body length. So do a HEAD, then if you get a -1, do a GET and work out the length from that.

Related

Java heap space error when uploading files through http basic authentication (JAVA) [duplicate]

I am trying to publish a large video/image file from the local file system to an http path, but I run into an out of memory error after some time...
here is the code
public boolean publishFile(URI publishTo, String localPath) throws Exception {
InputStream istream = null;
OutputStream ostream = null;
boolean isPublishSuccess = false;
URL url = makeURL(publishTo.getHost(), this.port, publishTo.getPath());
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
if (conn != null) {
try {
conn.setDoOutput(true);
conn.setDoInput(true);
conn.setRequestMethod("PUT");
istream = new FileInputStream(localPath);
ostream = conn.getOutputStream();
int n;
byte[] buf = new byte[4096];
while ((n = istream.read(buf, 0, buf.length)) > 0) {
ostream.write(buf, 0, n); //<--- ERROR happens on this line.......???
}
int rc = conn.getResponseCode();
if (rc == 201) {
isPublishSuccess = true;
}
} catch (Exception ex) {
log.error(ex);
} finally {
if (ostream != null) {
ostream.close();
}
if (istream != null) {
istream.close();
}
}
}
return isPublishSuccess;
}
HEre is the error i am getting...
Exception in thread "Thread-8773" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at sun.net.www.http.PosterOutputStream.write(PosterOutputStream.java:61)
at com.test.HTTPClient.publishFile(HTTPClient.java:110)
at com.test.HttpFileTransport.put(HttpFileTransport.java:97)
The HttpUrlConnection is buffering the data so that it can set the Content-Length header (per HTTP spec).
One alternative, if your destination server supports it, is to use "chunked" transfers. This will buffer only a small portion of data at a time. However, not all services support it (Amazon S3, for example, doesn't).
Another alternative (and imo a better one) is to use Jakarta HttpClient. You can set the "entity" in a request from a file, and the connection code will set request headers appropriately.
Edit: nos commented that the OP could call HttpURLConnection.setFixedLengthStreamingMode(long length). I was unaware of this method; it was added in 1.5, and I haven't used this class since then.
However, I still suggest using Jakarta HttpClient, for the simple reason that it reduces the amount of code that the OP has to maintain. Code that is boilerplate, yet still has the potential for errors:
The OP correctly handles the loop to copy between input and output. Usually when I see an example of this, the poster either doesn't properly check the returned buffer size, or keeps re-allocating the buffers. Congratulations, but you now have to ensure that your successors take as much care.
The exception handling isn't quite so good. Yes, the OP remembers to close the connections in a finally block, and again, congratulations on that. Except that either of the close() calls could throw IOException, keeping the other from executing. And the method as a whole throws Exception, so that the compiler isn't going to help catch similar errors.
I count 31 lines of code to setup and execute the response (excluding the response code check and the URL computation, but including the try/catch/finally). With HttpClient, this would be somewhere in the range of a half dozen LOC.
Even if the OP had written this code perfectly, and refactored it into methods similar to those in Jakarta Commons IO, s/he shouldn't do that. This code has been written and tested by others. I know that it's a waste of my time to rewrite it, and suspect that it's a waste of the OP's time as well.
conn.setFixedLengthStreamingMode((int) new File(localpath).length());
And for buffering you could cover your streams into the BufferedOutputStream and BufferedInputStream
Good example of chunked uploading you could find there: gdata-java-client
The problem is that the HttpURLConnection class is using a byte array to store your data. Presumably this video you are pushing is taking more memory than available. You have a few options here:
Increase the memory to your application. You can use the -Xmx1024m option to give 1GB of memory to your application. This will increase the amount of data you can store in memory.
If you still run out of memory, you might want to consider trying another library to push the video up that does not store the data all in memory at once. The Apache Commons HttpClient has such a feature. See this site for more information: http://hc.apache.org/httpclient-3.x/features.html. See this section for multi-part form upload of large files: http://hc.apache.org/httpclient-3.x/methods/multipartpost.html
For anything other than basic GET operations, the built-in java.net HTTP stuff isn't very good. Using Apache Commons HttpClient is recommended for this. It lets you do much more intuitive stuff like this:
PutMethod put = new PutMethod(url);
put.setRequestEntity(new FileRequestEntity(localFile, contentType));
int responseCode = put.executeMethod();
which replaces a lot of your boiler-plate code.
HttpsURLConnection#setChunkedStreamingMode(1024 * 1024 * 10); //10MB chunk
This ensures that any file (of any size) is streamed over a https connection, without internal buffering. This should be used when the file size or the content length is unknown.
Your problem is that you're trying to fix X video bytes into X/N bytes of RAM, when N > 1.
You either need to read the video into a smaller buffer and write it out as you go or make the file smaller or increase the memory available to your process.
Check your heap size. You can use -Xmx to increase it if you've taken the default.

HttpURLConnection slow to disconnect - Java / Android

I want to get the file size of a file on a remote connection without actually downloading the (large) file. I am using the "Content-Length" header of the file. The relevant code is:
URL obj = new URL(FILES_URL + fileName);
String contentLength = "";
HttpURLConnection conn = null;
try {
conn = (HttpURLConnection) obj.openConnection();
conn.setConnectTimeout(3000);
conn.setReadTimeout(3000);
contentLength = conn.getHeaderField("Content-Length");
int responseCode = conn.getResponseCode();
Log.d(TAG, "responseCode: " + responseCode);
} finally {
Log.d(TAG, "pre-disconnect");
if (conn!=null) conn.disconnect();
Log.d(TAG, "post-disconnect");
}
return contentLength;
The command "conn.disconnect();" sometimes seems to take forever. I have seen 23 seconds! Admittedly, this is connecting to a secondary local device which is running a web server, but the WiFi signal is strong, relatively fast, and I have never had any such problems using "curl" from my laptop. I do not have control over the web server I am connecting too.
The problem possibly is enhanced when making multiple similar connections to different files one after another, not sure. This is, however, creating entirely new HttpURLConnection's and not reusing the old one. Could reusing the connection help?
I never actually download the file or access the inputstream.
I could just not call disconnect, but I understand it is not recommended because resources would not be released. Is this not correct? I notice URLConnection doesn't have a disconnect. It is just suggested to close any streams you open.
This code is in an asynctask. I guess I could try moving the disconnect call itself to a further asynctask because I don't do anything afterwards. Not sure if that is even possible.
Do you have any suggestions? Should I try something other than HttpURLConnection to get the file size without downloading the file?
Thanks to EJP in the comments. Changing the request method to "HEAD" made the disconnect almost instantaneous:
conn.setRequestMethod("HEAD");
From what I have read, HttpURLConnection.disconnect() will skip through the entire response object if it hasn't been read. Therefore, for very large files, it will take a long time. Using the request method "HEAD" force the response body to be empty and solves the issue.
I suggest you to use either Volley or Okhttp for faster networking but depending on your requirement . Got through Comparison Of Volley And OkHttp and Retrofit and decide which library to use.
As suggestion if you putting this code inside AsyncTask then Read Dark Side of AsyncTask.

Who is tampering with my data stream?

The piece of code below downloads a file from some URL and saves it to a local file. Piece of cake. What could possible be wrong here?
protected long download(ProgressMonitor montitor) throws Exception{
long size = 0;
DataInputStream dis = new DataInputStream(is);
int read = 0;
byte[] chunk = new byte[chunkSize];
while( (read = dis.read(chunk)) != -1){
os.write(chunk, 0, read);
size += read;
if(montitor != null)
montitor.worked(read);
}
chunk = null;
dis.close();
os.flush();
os.close();
return size;
}
The reason I am posting a question here is because it works in 99.999% of the time and doesn't work as expected whenever there is an antivirus or some other protection software installed on a computer running this code. I am blindly pointing a finger that way because whenever I stop (or disable) it, the code works perfect again. The end result of such interference is that the MD5 of downloaded file don't match the expected, and a whole new saga begins.
So, the question is - is it really possible that some smart "protection" software would alter the actual stream coming from the URL without me knowing about it? And if yes - how do you deal with this? (verified with Kasperksy and Norton products).
EDIT-1:
Apparently I've got a hold on the problem and it's got nothing to do with antiviruses. The download takes place from the FTP server (FileZilla in particular) and we use apache commons ftp on client side . What I did is went to the FTP server and terminated the connection (kicked it out) in a middle of the download. I expected that is.read(..) would throw an IOException on client side, but this never happened. Instead, the is.read(..) returns -1 meaning that there is no more data coming from the stream. This is definitely unexpected and explains why sometimes I get partial files. This doesn't explain however why sometimes the data gets altered as well.
Yeah this happens to me all the time. In my case it's caused by transparent HTTP proxying by Websense on my corporate network. The worst problem are caused by the block page being returned with 200 OK.
Do you get the same or similar corruption every time? E.g., do you get some HTML explaining why the request was blocked? The best you can probably do is compare the first few bytes of the downloaded data to some text in the block page, and throw an exception in this case.
Edit: based on your update, have you got the FTP client set to image/binary mode?

OutputStream OutOfMemoryError when sending HTTP

I am trying to publish a large video/image file from the local file system to an http path, but I run into an out of memory error after some time...
here is the code
public boolean publishFile(URI publishTo, String localPath) throws Exception {
InputStream istream = null;
OutputStream ostream = null;
boolean isPublishSuccess = false;
URL url = makeURL(publishTo.getHost(), this.port, publishTo.getPath());
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
if (conn != null) {
try {
conn.setDoOutput(true);
conn.setDoInput(true);
conn.setRequestMethod("PUT");
istream = new FileInputStream(localPath);
ostream = conn.getOutputStream();
int n;
byte[] buf = new byte[4096];
while ((n = istream.read(buf, 0, buf.length)) > 0) {
ostream.write(buf, 0, n); //<--- ERROR happens on this line.......???
}
int rc = conn.getResponseCode();
if (rc == 201) {
isPublishSuccess = true;
}
} catch (Exception ex) {
log.error(ex);
} finally {
if (ostream != null) {
ostream.close();
}
if (istream != null) {
istream.close();
}
}
}
return isPublishSuccess;
}
HEre is the error i am getting...
Exception in thread "Thread-8773" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at sun.net.www.http.PosterOutputStream.write(PosterOutputStream.java:61)
at com.test.HTTPClient.publishFile(HTTPClient.java:110)
at com.test.HttpFileTransport.put(HttpFileTransport.java:97)
The HttpUrlConnection is buffering the data so that it can set the Content-Length header (per HTTP spec).
One alternative, if your destination server supports it, is to use "chunked" transfers. This will buffer only a small portion of data at a time. However, not all services support it (Amazon S3, for example, doesn't).
Another alternative (and imo a better one) is to use Jakarta HttpClient. You can set the "entity" in a request from a file, and the connection code will set request headers appropriately.
Edit: nos commented that the OP could call HttpURLConnection.setFixedLengthStreamingMode(long length). I was unaware of this method; it was added in 1.5, and I haven't used this class since then.
However, I still suggest using Jakarta HttpClient, for the simple reason that it reduces the amount of code that the OP has to maintain. Code that is boilerplate, yet still has the potential for errors:
The OP correctly handles the loop to copy between input and output. Usually when I see an example of this, the poster either doesn't properly check the returned buffer size, or keeps re-allocating the buffers. Congratulations, but you now have to ensure that your successors take as much care.
The exception handling isn't quite so good. Yes, the OP remembers to close the connections in a finally block, and again, congratulations on that. Except that either of the close() calls could throw IOException, keeping the other from executing. And the method as a whole throws Exception, so that the compiler isn't going to help catch similar errors.
I count 31 lines of code to setup and execute the response (excluding the response code check and the URL computation, but including the try/catch/finally). With HttpClient, this would be somewhere in the range of a half dozen LOC.
Even if the OP had written this code perfectly, and refactored it into methods similar to those in Jakarta Commons IO, s/he shouldn't do that. This code has been written and tested by others. I know that it's a waste of my time to rewrite it, and suspect that it's a waste of the OP's time as well.
conn.setFixedLengthStreamingMode((int) new File(localpath).length());
And for buffering you could cover your streams into the BufferedOutputStream and BufferedInputStream
Good example of chunked uploading you could find there: gdata-java-client
The problem is that the HttpURLConnection class is using a byte array to store your data. Presumably this video you are pushing is taking more memory than available. You have a few options here:
Increase the memory to your application. You can use the -Xmx1024m option to give 1GB of memory to your application. This will increase the amount of data you can store in memory.
If you still run out of memory, you might want to consider trying another library to push the video up that does not store the data all in memory at once. The Apache Commons HttpClient has such a feature. See this site for more information: http://hc.apache.org/httpclient-3.x/features.html. See this section for multi-part form upload of large files: http://hc.apache.org/httpclient-3.x/methods/multipartpost.html
For anything other than basic GET operations, the built-in java.net HTTP stuff isn't very good. Using Apache Commons HttpClient is recommended for this. It lets you do much more intuitive stuff like this:
PutMethod put = new PutMethod(url);
put.setRequestEntity(new FileRequestEntity(localFile, contentType));
int responseCode = put.executeMethod();
which replaces a lot of your boiler-plate code.
HttpsURLConnection#setChunkedStreamingMode(1024 * 1024 * 10); //10MB chunk
This ensures that any file (of any size) is streamed over a https connection, without internal buffering. This should be used when the file size or the content length is unknown.
Your problem is that you're trying to fix X video bytes into X/N bytes of RAM, when N > 1.
You either need to read the video into a smaller buffer and write it out as you go or make the file smaller or increase the memory available to your process.
Check your heap size. You can use -Xmx to increase it if you've taken the default.

Reasonable to hold an HttpUrlConnection open indefinitely to a remote REST endpoint?

I am looking to optimize a process that runs continually and makes frequent calls (> 1 per second on average) to an external API via a simple REST style HTTP post. One thing I've noticed is that currently, the HttpUrlConnection is created and closed for every API call, as per the following structure (non essential code and error handling removed for readability).
//every API call
try {
URL url = new URL("..remote_site..");
conn = (HttpURLConnection) url.openConnection();
setupConnectionOptions(conn); //sets things like timeoout and usecaches false
outputWriter = new OutputStreamWriter(new BufferedOutputStream(conn.getOutputStream()));
//send request
} finally {
conn.disconnect();
outputWriter.close();
}
I don't have extensive experience dealing with the http protocol directly, but based on common sense / knowledge of sockets in general it seems that it would be much more efficient to only create the connection once and re-use it, and only reinitialize it on a problem, to avoid the connection negotiation each time, like this:
//on startup, or error
private void initializeConnection()
{
URL url = new URL("..remote_site..");
conn = (HttpURLConnection) url.openConnection();
setupConnectionOptions(conn); //sets things like timeoout and usecaches false
}
//per request
try {
outputWriter = new OutputStreamWriter(new BufferedOutputStream(conn.getOutputStream()));
//send request
} catch (IOException) {
try conn.disconnect();
initializeConnection();
} finally {
outputWriter.close();
}
//on graceful exit
conn.disconnect();
My questions are:
is this a reasonable optimization in general (will the speed increase be noticeable)?
Assuming yes:
should I reuse the output stream as well the connection?
is it reasonable to only reinitialize connection on error, or should I do it after a certain number of requests / time?
Basically, yes, and it saves a lot of time --- setting up a socket takes significant effort, even worse with SSL. That's why "keepalive" was implemented back in the Old Days. That's a litle bit counter to the REST philosophy, but it's a performance optimization.
The one thing about it is that sockets are a limited resource; in a really heavy-use environment, you could end up with no sockets left for new connections. this is a Bad Thing.

Categories