Download xml.gz file with HttpsURLConnection - java

I am trying to download an xml.gz file from a remote server with HttpsURLConnection in java, but I am getting an empty response. Here is a sample of my code:
URL server = new URL("https://www.myurl.com/path/sample_file.xml.gz");
HttpsURLConnection connection = (HttpsURLConnection)server.openConnection();
connection.connect();
When I try to get an InputStream from the connection, it is empty. (If I try connection.getInputStream().read() I get -1) The file I am expecting is approximately 50MB.
To test my sanity, I aslo tried entering the exact same url in my browser, and it did return the file I needed. Am I missing something? Do I have to set some sort of parameter in the connection? Any help/direction is much appreciated.

Is any exception being logged? Is the website presenting a self-signed SSL certificate, or one that is not signed by a CA? There are several reasons why it might work fine in your browser (the browser might have been told to accept self-signed certs from that domain) and not in your code.
What are the results of using curl or wget to fetch the URL?
The fact that the InputStream is empty / result from the InputStream.read() == -1 implies that there is nothing in the stream to read, meaning that the stream was not able to even be set up properly.
Update: See this page for some info on how you can deal with invalid/self-signed certificates in your connection code. Or, if the site is presenting a certificate but it is invalid, you can import it into the keystore of the server to tell Java to trust the certificate. See this page for more info.

Verify the response code is 200
Check that connection.contentType to verify the content type is recognized
You may need to add a Content-Handler for the GZ mime type, which I can't recall off the top of my head.
After the comment describing the response code as 3xx,
Set 'connection.setFollowRedirects(true)'
Should fix it.

Turns out the download wasn't working because the remote server was redirecting me to a new url to download the file. Even though connection.setFollowRedirects(true) was set, I still had to manually set up a new connection for the redirected URL as follows:
if (connection.getResponseCode() == 302 && connection.getHeaderField("location") != null){
URL server2 = new URL(connection.getHeaderField("location"));
HttpURLConnection connection2 = (HttpURLConnection)server2.openConnection();
connection2.connect();
InputStream in = connection2.getInputStream();
}
After that, I was able to retrieve the file from the input stream. Thanks for all your help guys!

Related

Downloading binary file from url

I am using this code to download files from a url:
FileUtils.copyURLToFile(url, new File("C:/Songs/newsong.mp3"));
When I create the url using for instance,
"https://mjcdn.cc/2/282676442/MjUgU2FhbCAtIFZlZXQgQmFsaml0Lm1wMw==",
this works just fine and the mp3 is downloaded.
However,
if I use another url:
"https://dl.jatt.link/hd.jatt.link/a0339e7c772ed44a770a3fe29e3921a8/uttzv/Hummer-(Mr-Jatt.com).mp3",
the file is 0kb.
I am able to download files from both these urls from within a web browser.
What's wrong here, and how can I fix it.
I noticed a difference between your 2 URLs:
The first one just gives back the file without redirection.
But the second one responds with a redirect (HTTP/1.1 302 Moved Temporarily). It's also a special case, because it's a redirect from HTTPS to HTTP protocol.
Browsers can follow redirects, but your program - for some reason (see below) - can't.
I suggest you to use a HTTP client library (e.g. Apache HTTP client or Jsoup), and configure it to follow redirects (if they don't do it by default).
For example, with Jsoup, you would need a code like this:
String url = "https://dl.jatt.link/hd.jatt.link/a0339e7c772ed44a770a3fe29e3921a8/uttzv/Hummer-(Mr-Jatt.com).mp3";
String filename = "C:/Songs/newsong.mp3";
Response r = Jsoup.connect(url)
//.followRedirects(true) // follow redirects (it's the default)
.ignoreContentType(true) // accept not just HTML
.maxBodySize(10*1000*1000) // accept 10M bytes (default is 1M), or set to 0 for unlimited
.execute(); // send GET request
FileOutputStream out = new FileOutputStream(new File(filename));
out.write(r.bodyAsBytes());
out.close();
Update on #EJP's comment:
I looked up Apache Commons IO's FileUtils class on GitHub. It calls openStream() of the received URL object.
openStream() is a shorthand for openConnection().inputStream().
openConnection() returns an URLConnection object. If there is an appropriate subclass for the protocol used by URL, it will return an instance of that subclass. In this case that's a HttpsURLConnection which is the subclass of HttpURLConnection.
The followRedirects option is defined in HttpURLConnection and it's indeed true by default:
Sets whether HTTP redirects (requests with response code 3xx) should be automatically followed by this class. True by default.
So OP's approach would normally work with redirects too, but it seems that redirection from HTTPS to HTTP is not handled (properly) by HttpsURLConnection. - It's the case that #VGR mentioned in the comments below.
It's possible to handle redirects manually by reading the Location header with HttpsURLConnection, then use it in a new HttpURLConnection. (Example) (I wouldn't be surprised if Jsoup did the same.)
I suggested Jsoup because it already implements a way to handle HTTPS to HTTP redirections correctly and also provides tons of useful features.

HTTP response code of text file is 460 when file contains 1.0.2?

I'm using an HttpsURLConnection to grab the only line in a text file hosted on Dropbox, as an update checker (for a Minecraft mod). The relevant code is below:
URL url = new URL(linkToVersionFile);
HttpsURLConnection connection = (HttpsURLConnection) url.openConnection();
connection.setConnectTimeout(999);
int responseCode = connection.getResponseCode();
Normally, this works fine. Except, however, if the text in the file is "1.0.2". When it's 1.0.2, it returns the 460 response code, which I cant seem to find in any list of response codes. The accompanying response message is "Restricted" though.
If the file contains "1.0.1", "1.0.3", "1.1.2", "1.2.2" or even "2.0.2" it works just fine. Nothing changes but the 5 characters located in the file. The same thing happens if different files are used, and given the text "1.0.2", so it's not a corrupt file.
While I can get around it by either avoiding 1.0.2 and moving straight to 1.0.3, or writing it as "102" instead, it's just such an usual problem that I was wondering if anyone had an explanation :P
If more information or test results is required, let me know.
Thanks in advance :)

Get the redirected URL of a very specific URL (in Java)

How can I get the redirected URL of http://at.atwola.com/?adlink/5113/1649059/0/2018/AdId=4041444;BnId=872;itime=15692006;impref=13880156912668385284; in Java?
My code (given below) is constructed according to answers to similar questions on stack-overflow (https://stackoverflow.com/a/5270162/1382251 in particular).
But it just yields the original URL. I suspect that there are other similar cases, so I would like to resolve this one in specific and use the solution in general.
String ref = "http://at.atwola.com/?adlink/5113/1649059/0/2018/AdId=4041444;BnId=872;itime=15692006;impref=13880156912668385284;";
try
{
URLConnection con1 = new URL(ref).openConnection();
con1.connect();
InputStream is = con1.getInputStream();
URL url = con1.getURL();
is.close();
String finalPage = url.toString();
if (finalPage.equals(ref))
{
HttpURLConnection con2 = (HttpURLConnection)con1;
con2.setInstanceFollowRedirects(false);
con2.connect();
if (con2.getResponseCode()/100 == 3)
finalPage = con2.getHeaderField("Location");
}
System.out.println(finalPage);
}
catch (Exception error)
{
System.out.println("error");
}
I played a bit with your URL with telnet, wget, and curl and I noticed that in some cases the server returns response 200 OK, and sometimes 302 Moved Temporarily. The main difference seems to be the request User-agent header. Your code works if you add the following before con1.connect():
con1.setRequestProperty("User-Agent","");
That is, with empty User-Agent (or if the header is not present at all), the server issues a redirect. With the Java User-Agent (in my case User-Agent: Java/1.7.0_45) and with the default curl User-Agent (User-Agent: curl/7.32.0) the server responds with 200 OK.
In some cases you might need to also set:
System.setProperty("http.agent", "");
See Setting user agent of a java URLConnection
The server running the site is the Adtech Adserver and apparently it is doing user agent sniffing. There is a long history of user agent sniffing. So it seems that the safest thing to do would be to set the user agent to Mozilla:
con1.setRequestProperty("User-Agent","Mozilla"); //works with your code for your URL
Maybe the safest option would be to use a user agent used by some of the popular web browsers.

Guaranteed way to correctly get the contents of www.bing.com/

I have been working on a program that gets the contents of www.bing.com and saves it to a file, but out of the two ways I have tried one using sockets, and the other using HtmlUnit neither shows the contents 100% correct when I open the file. I know there are other options out there, but I looking for one that is guaranteed to get the contents of www.bing.com/ correctly. I would therefore appreciate it if someone could point me to a means of accomplishing this.
The differences you see are likely due to the web server providing different content to different browsers based on the user agent string and other request headers.
Try setting the User-Agent header in your socket and HtmlUnit strategies to the one you are comparing against and see if the result is as expected. Moreover, you will likely have to replicate the request headers exactly as they are sent by your target browser.
What is "incorrect" about what is returned? Keep in mind, Bing is probably generating some of the content via JavaScript; your client will need to make additional requests to retrieve the JavaScript files, run the JavaScript, etc.
You can use a URL.openConnection() to create a URLConnection and call URLConnection.getInputStream(). You can read the InputStream contents and write it to a file.
If you need to override the User-Agent because the server is using it to serve different content you can do so by first setting the http.agent system property to empty string.
/* Somewhere in your code before you make requests */
System.setProperty("http.agent", "");
or using -Dhttp.agent= on your java command line
and then setting the User-Agent to something useful on the connection before you get the InputStream.
URLConnection conn = ... //Create your URL connection as described above.
String userAgent = ... //Some user-agent string here.
conn.setRequestProperty("User-Agent", userAgent);

java.io.IOException: Server returned HTTP response code: 500

I'm facing this problem with Java. I want to get some HTML informations from a URL. This code was working for so long, but suddenly, it stopped working.
When I access this URL using the browser, it opens with no problem.
The code:
URL site = new URL(this.url);
java.net.URLConnection yc = site.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
String objetivo = "<td height=\"28\" colspan=\"2\"";
while ((inputLine = in.readLine()) != null && !inputLine.contains(objetivo)) {
}
inputLine = in.readLine();
The Exception:
java.io.IOException: Server returned HTTP response code: 500 for URL: http://www.myurl.com
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at Sites.websites.Site1.getData(Site1.java:53)
at util.Util.lerArquivo(Util.java:278)
at util.Util.main(Util.java:983)
What's wrong? Did the host block me?
HTTP status code 500 usually means that the webserver code has crashed. You need to determine the status code beforehand using HttpURLConnection#getResponseCode() and in case of errors, read the HttpURLConnection#getErrorStream() instead. It may namely contain information about the problem.
If the host has blocked you, you would rather have gotten a 4nn status code like 401 or 403.
See also:
How to use URLConnection to fire and handle HTTP requests?
This Status Code 500 is an Internal Server Error. This code indicates that a part of the server (for example, a CGI program) has crashed or encountered a configuration error.
i think the problem does'nt lie on your side, but rather on the side of the Http server.
the resources you used to access may have been moved or get corrupted, or its configuration just may have altered or spoiled
I had this problem i.e. works fine when pasted into browser but 505s when done through java. It was simply the spaces that needed to be escaped/encoded.
The problem must be with the parameters you are passing(You must be passing blank parameters). For example : http://www.myurl.com?id=5&name=
Check if you are handling this at the server you are calling.
Change the content-type to "application/x-www-form-urlencoded", i solved the problem.
You may look within the first server response and see if the server sent you a cookie.
To check if the server sent you a cookie, you can use HttpURLConnection#getHeaderFields() and look for headers named "Set-Cookie".
If existing, here's the solution for your problem. 100% Working for this case!
In my case, I had changed the Content-Type to Accept and it resolved the issue.
URL url = new URL(GET_URL);
HttpURLConnection con = (HttpURLConnection) url.openConnection();
con.setRequestMethod("GET");
con.setRequestProperty("Accept", "application/json; charset=utf-8");

Categories