I am using this code to download files from a url:
FileUtils.copyURLToFile(url, new File("C:/Songs/newsong.mp3"));
When I create the url using for instance,
"https://mjcdn.cc/2/282676442/MjUgU2FhbCAtIFZlZXQgQmFsaml0Lm1wMw==",
this works just fine and the mp3 is downloaded.
However,
if I use another url:
"https://dl.jatt.link/hd.jatt.link/a0339e7c772ed44a770a3fe29e3921a8/uttzv/Hummer-(Mr-Jatt.com).mp3",
the file is 0kb.
I am able to download files from both these urls from within a web browser.
What's wrong here, and how can I fix it.
I noticed a difference between your 2 URLs:
The first one just gives back the file without redirection.
But the second one responds with a redirect (HTTP/1.1 302 Moved Temporarily). It's also a special case, because it's a redirect from HTTPS to HTTP protocol.
Browsers can follow redirects, but your program - for some reason (see below) - can't.
I suggest you to use a HTTP client library (e.g. Apache HTTP client or Jsoup), and configure it to follow redirects (if they don't do it by default).
For example, with Jsoup, you would need a code like this:
String url = "https://dl.jatt.link/hd.jatt.link/a0339e7c772ed44a770a3fe29e3921a8/uttzv/Hummer-(Mr-Jatt.com).mp3";
String filename = "C:/Songs/newsong.mp3";
Response r = Jsoup.connect(url)
//.followRedirects(true) // follow redirects (it's the default)
.ignoreContentType(true) // accept not just HTML
.maxBodySize(10*1000*1000) // accept 10M bytes (default is 1M), or set to 0 for unlimited
.execute(); // send GET request
FileOutputStream out = new FileOutputStream(new File(filename));
out.write(r.bodyAsBytes());
out.close();
Update on #EJP's comment:
I looked up Apache Commons IO's FileUtils class on GitHub. It calls openStream() of the received URL object.
openStream() is a shorthand for openConnection().inputStream().
openConnection() returns an URLConnection object. If there is an appropriate subclass for the protocol used by URL, it will return an instance of that subclass. In this case that's a HttpsURLConnection which is the subclass of HttpURLConnection.
The followRedirects option is defined in HttpURLConnection and it's indeed true by default:
Sets whether HTTP redirects (requests with response code 3xx) should be automatically followed by this class. True by default.
So OP's approach would normally work with redirects too, but it seems that redirection from HTTPS to HTTP is not handled (properly) by HttpsURLConnection. - It's the case that #VGR mentioned in the comments below.
It's possible to handle redirects manually by reading the Location header with HttpsURLConnection, then use it in a new HttpURLConnection. (Example) (I wouldn't be surprised if Jsoup did the same.)
I suggested Jsoup because it already implements a way to handle HTTPS to HTTP redirections correctly and also provides tons of useful features.
Related
I am loading model in apache jena using function FileManager.get().loadModel(url).And I also know that there may be some URLs in HTTP Response Link Header .I want to load model also from the links(URLs) in link header.How to do that ? Is there any inbuilt fuctionality to get access to header and process link header in Response header?
FileManager.get().loadModel(url) packages up reading a URL and parsing the results into a model. It is packing up a common thing to do; it is not claiming to be comprehensive. It is quite an old interface.
If you wanted detailed control over the HTTP handling, see if HttpOp (a lower level) mechanism helps, otherwise do the handling in the application and hand the input stream for the response directly to the parser.
You may also find it useful to look at the code in RDFDataMgr.process for help with content negotiation.
I don't think that this is supported by Jena. I don't see any reason in doing so. The HTTP request is done to get the data and maybe also to get the response type. If you want to get the URLs in some header fields, why not simply use plain old Java:
URL url = new URL("http://your_ontology.owl");
URLConnection conn = url.openConnection();
Map<String, List<String>> map = conn.getHeaderFields();
I want to readLines from a URL, which resolves to an HTTP service. I can use
Resources.readLines(url, Charsets.SOMETHING)
from com.google.common.io.
This works, but the class javadoc for Resources states the following, without further explanation:
Note that even though these methods use URL parameters, they are usually not appropriate for HTTP or other non-classpath resources.
Why is this method inappropriate for reading from an HTTP service, and what is the recommended approach?
When using URL to send an HTTP request, the typical process is
URL url = new URL(someStringUrl);
HttpUrlConnection con = (HttpUrlConnection) url.openConnection();
// do some stuff with con, add headers, add request body, etc.
con.getInputStream(); // get body of response
The URL given to Resources skips all that. The methods in Resources depend on URL#openStream() which skips any modifications to the URLConnection, ie. is equivalent the url.openConnection().getInputStream(). It's possible you'll get any number of 400 level error codes from the HTTP response because your request wasn't correct.
This won't happen with class path resources because the protocol is simple. You just copy the bytes.
I am using Apache HttpComponents with 150 threads to download HTML source code for roughly 5000 different URLs.
The URL's are contained in a LinkedBlockingQueue and the SourceGetterThreads take from the queue when possible. A thread will then attempt to download the source code using EntityUtils.toString(HttpClient.execute().getEntity). The string representation of the HTML source code is then put on another LinkedBlockingQueue where I have a further 10 threads ready to peform useful work on the source code they take from the second queue.
My problem lies in that I have noticed errors in the work being peformed on the source code. I am using Matcher to match specific patterns and record the patterns found. However sometimes the source code is incorrect and does not match the URL (i.e the source code saved in my java memory is not the same as the source code when viewed in Chrome or FireFox). This is seemingly random and thus sometimes the source code is correct and sometimes it is not.
Does anybody know why this is?
Most likely the sites you're trying to fetch pages from perform some kind of request filtering based on request headers, like User-Agent. So they can simply return different content based on the analysis result.There are many reasons to do so:
Provide search robots with appropriate info
Deny web-crawlers from fetching site's content
Recognize mobile devices and supply their user with different HTML/CSS/JS
If you're querying the same site intensively some kind of DOS-attack protection may be triggered, causing a stub error page being returned instead of a regular content
If you want to get the same content as in browser, then the most basic recommendation for your client is to behave like a browser:
Always provide User-Agent header
Follow HTTP redirect (HTTP codes 302,303)
Maintain cookies and authentication routines if required
Be ready to support HTTPS scheme
Without the source code you're trying to run it's hard to say more, but Apache HTTP Client you're using is definitely capable to do all these things. Here's, for example, how to set User-Agent value:
String url = "http://www.google.com/search?q=httpClient";
HttpClient client = new DefaultHttpClient();
HttpGet request = new HttpGet(url);
request.addHeader("User-Agent","Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
HttpResponse response = client.execute(request);
I Am serving an authenticated image using django. The image is behind a view which require login, and in the end I have to check more things than just the authentication.
Because of a reason to complicated to explain here, I cannot use the real url to the image, but I Am serving it with a custom url leading to the authenticated view.
From java the image must be reachable, to save or display. For this part I use Apache httpclient.
In Apacahe I tried a lot of things (every example and combination of examples...) but can't seem to get it working.
For other parts of the webapp I use django-rest-framwork, which I succesfully connected to from java (and c and curl).
I use the login_reuired decorator in django, which makes the attempt to get to the url redirect to a login page first.
Trying the link and the login in a webviewer, I see the 200 code (OK) in the server console.
Trying the link with the httpclient, I get a 302 Found in the console.... (looking up 302, it means a redirect..)
this is what I do in django:
in urls.py:
url(r'^photolink/(?P<filename>.*)$', 'myapp.views.photolink',name='photolink'),
in views.py:
import mimetypes
import os
#login_required
def photolink(request, filename):
# from the filename I get the image object, for this question not interesting
# there is a good reason for this complicated way to reach a photo, but not the point here
filename_photo = some_image_object.url
base_filename=os.path.basename(filename_photo)
# than this is the real path and filename to the photo:
path_filename=os.path.join(settings.MEDIA_ROOT,'photos',mac,base_filename)
mime = mimetypes.guess_type(filename_photot)[0]
logger.debug("mimetype response = %s" % mime)
image_data = open(path_filename, 'rb').read()
return HttpResponse(image_data, mimetype=mime)
by the way, if i get this working i need another decorator to pass some other tests....
but i first need to get this thing working....
for now it's not a secured url.... plain http.
in java i tried a lot of things... using apache's httpclient 4.2.1
proxy, cookies, authentication negociation, with follow redirects... and so on...
Am I overlooking some basic thing here?...
it seems the login of the website client is not suitable for automated login...
so the problem can be in my code in django....or in the java code....
In the end the problem was, using HTTP authorization.
Which is not by default used in the login_required decorator.
adding a custom decorator that checks for HTTP authorization did the trick:
see this example: http://djangosnippets.org/snippets/243/
I have been working on a program that gets the contents of www.bing.com and saves it to a file, but out of the two ways I have tried one using sockets, and the other using HtmlUnit neither shows the contents 100% correct when I open the file. I know there are other options out there, but I looking for one that is guaranteed to get the contents of www.bing.com/ correctly. I would therefore appreciate it if someone could point me to a means of accomplishing this.
The differences you see are likely due to the web server providing different content to different browsers based on the user agent string and other request headers.
Try setting the User-Agent header in your socket and HtmlUnit strategies to the one you are comparing against and see if the result is as expected. Moreover, you will likely have to replicate the request headers exactly as they are sent by your target browser.
What is "incorrect" about what is returned? Keep in mind, Bing is probably generating some of the content via JavaScript; your client will need to make additional requests to retrieve the JavaScript files, run the JavaScript, etc.
You can use a URL.openConnection() to create a URLConnection and call URLConnection.getInputStream(). You can read the InputStream contents and write it to a file.
If you need to override the User-Agent because the server is using it to serve different content you can do so by first setting the http.agent system property to empty string.
/* Somewhere in your code before you make requests */
System.setProperty("http.agent", "");
or using -Dhttp.agent= on your java command line
and then setting the User-Agent to something useful on the connection before you get the InputStream.
URLConnection conn = ... //Create your URL connection as described above.
String userAgent = ... //Some user-agent string here.
conn.setRequestProperty("User-Agent", userAgent);