Http request returning different response in browser - java

I'm getting a JSON object from a server, When I enter the following generated URL into my browser I get a response with "num_match": 18, however when running in my app I get a JSON object with "num_matches": 2.
The URL object is created like this
URL request;
request = new URL(url);
and connection like this:
HttpURLConnection connection = (HttpURLConnection) request.openConnection();
connection.setConnectTimeout(MAX_TIME);
connection.setReadTimeout(MAX_TIME);
url is a String and I am copying the string contents into my browser to test.
The string is:
http://search.3taps.com/?auth_token=xxxxxxxxxxxxxxxxxx&retvals=heading,body,timestamp,external_url,images,price&rpp=100&source=BKPGE|CRAIG|EBAYC|INDEE|KIJIJ&category=PWSM&radius=200mi&lat=26.244&long=-80.2&annotations={age:18 OR age:19 OR age:20 OR age:21 OR age:22}
The URL object has the following fields
query:
auth_token=xxxxxxxxxxxxxxxxxx&retvals=heading,body,timestamp,external_url,images,price&rpp=100&source=BKPGE|CRAIG|EBAYC|INDEE|KIJIJ&category=PWSM&radius=200mi&lat=26.244&long=-80.2&annotations={age:18 OR age:19 OR age:20 OR age:21 OR age:22}
file:
/?auth_token=xxxxxxxxxxxxxxxxxx&retvals=heading,body,timestamp,external_url,images,price&rpp=100&source=BKPGE|CRAIG|EBAYC|INDEE|KIJIJ&category=PWSM&radius=200mi&lat=26.244&long=-80.2&annotations={age:18 OR age:19 OR age:20 OR age:21 OR age:22}
host:
search.3taps.com
The response comes back as "success":true on both but with a discrepancy in the object returned. I don't know much about http, what could be causing this?
UPDATE: On further testing it seems like there is only a problem when the annotations entry is present
annotations={age:18 OR age:19 OR age:20 OR age:21 OR age:22}
seems to be causing the problem.

Make sure you are encoding the URL request correctly when you are setting the URL for the server. The spaces, braces, and colons all need to be appropriately escaped. Spaces should be %20, etc. This may help: HTTP URL Address Encoding in Java
Old Answer.... Comments indicate this does not affect the result... so moving down.
It is quite possible that the server is changing it's behaviour based on the type of 'browser' you are reporting yourself to be. When connecting to an HTTP server you tell the server what your UserAgent is (typically for a browser it is something like "Internet Explorer ...." or "Mozilla ..." or "Google Chome ...". The Server will often tailor the results of a request to suite the User Agent (different javascript files and HTML codes go to IE, etc.). This is also how servers re-direct mobile devices to a mobile-friendly version of a site.
It is quite possible that the server is changing it's response to match your UserAgent exposed by your Java code, (which by decault is something like "Java/1.7.0". You can change this value a few ways. Have a look at this question Setting user agent of a java URLConnection and try to run your program with the Mozilla agent, and see if you get different results.

Related

Downloading binary file from url

I am using this code to download files from a url:
FileUtils.copyURLToFile(url, new File("C:/Songs/newsong.mp3"));
When I create the url using for instance,
"https://mjcdn.cc/2/282676442/MjUgU2FhbCAtIFZlZXQgQmFsaml0Lm1wMw==",
this works just fine and the mp3 is downloaded.
However,
if I use another url:
"https://dl.jatt.link/hd.jatt.link/a0339e7c772ed44a770a3fe29e3921a8/uttzv/Hummer-(Mr-Jatt.com).mp3",
the file is 0kb.
I am able to download files from both these urls from within a web browser.
What's wrong here, and how can I fix it.
I noticed a difference between your 2 URLs:
The first one just gives back the file without redirection.
But the second one responds with a redirect (HTTP/1.1 302 Moved Temporarily). It's also a special case, because it's a redirect from HTTPS to HTTP protocol.
Browsers can follow redirects, but your program - for some reason (see below) - can't.
I suggest you to use a HTTP client library (e.g. Apache HTTP client or Jsoup), and configure it to follow redirects (if they don't do it by default).
For example, with Jsoup, you would need a code like this:
String url = "https://dl.jatt.link/hd.jatt.link/a0339e7c772ed44a770a3fe29e3921a8/uttzv/Hummer-(Mr-Jatt.com).mp3";
String filename = "C:/Songs/newsong.mp3";
Response r = Jsoup.connect(url)
//.followRedirects(true) // follow redirects (it's the default)
.ignoreContentType(true) // accept not just HTML
.maxBodySize(10*1000*1000) // accept 10M bytes (default is 1M), or set to 0 for unlimited
.execute(); // send GET request
FileOutputStream out = new FileOutputStream(new File(filename));
out.write(r.bodyAsBytes());
out.close();
Update on #EJP's comment:
I looked up Apache Commons IO's FileUtils class on GitHub. It calls openStream() of the received URL object.
openStream() is a shorthand for openConnection().inputStream().
openConnection() returns an URLConnection object. If there is an appropriate subclass for the protocol used by URL, it will return an instance of that subclass. In this case that's a HttpsURLConnection which is the subclass of HttpURLConnection.
The followRedirects option is defined in HttpURLConnection and it's indeed true by default:
Sets whether HTTP redirects (requests with response code 3xx) should be automatically followed by this class. True by default.
So OP's approach would normally work with redirects too, but it seems that redirection from HTTPS to HTTP is not handled (properly) by HttpsURLConnection. - It's the case that #VGR mentioned in the comments below.
It's possible to handle redirects manually by reading the Location header with HttpsURLConnection, then use it in a new HttpURLConnection. (Example) (I wouldn't be surprised if Jsoup did the same.)
I suggested Jsoup because it already implements a way to handle HTTPS to HTTP redirections correctly and also provides tons of useful features.

Get the redirected URL of a very specific URL (in Java)

How can I get the redirected URL of http://at.atwola.com/?adlink/5113/1649059/0/2018/AdId=4041444;BnId=872;itime=15692006;impref=13880156912668385284; in Java?
My code (given below) is constructed according to answers to similar questions on stack-overflow (https://stackoverflow.com/a/5270162/1382251 in particular).
But it just yields the original URL. I suspect that there are other similar cases, so I would like to resolve this one in specific and use the solution in general.
String ref = "http://at.atwola.com/?adlink/5113/1649059/0/2018/AdId=4041444;BnId=872;itime=15692006;impref=13880156912668385284;";
try
{
URLConnection con1 = new URL(ref).openConnection();
con1.connect();
InputStream is = con1.getInputStream();
URL url = con1.getURL();
is.close();
String finalPage = url.toString();
if (finalPage.equals(ref))
{
HttpURLConnection con2 = (HttpURLConnection)con1;
con2.setInstanceFollowRedirects(false);
con2.connect();
if (con2.getResponseCode()/100 == 3)
finalPage = con2.getHeaderField("Location");
}
System.out.println(finalPage);
}
catch (Exception error)
{
System.out.println("error");
}
I played a bit with your URL with telnet, wget, and curl and I noticed that in some cases the server returns response 200 OK, and sometimes 302 Moved Temporarily. The main difference seems to be the request User-agent header. Your code works if you add the following before con1.connect():
con1.setRequestProperty("User-Agent","");
That is, with empty User-Agent (or if the header is not present at all), the server issues a redirect. With the Java User-Agent (in my case User-Agent: Java/1.7.0_45) and with the default curl User-Agent (User-Agent: curl/7.32.0) the server responds with 200 OK.
In some cases you might need to also set:
System.setProperty("http.agent", "");
See Setting user agent of a java URLConnection
The server running the site is the Adtech Adserver and apparently it is doing user agent sniffing. There is a long history of user agent sniffing. So it seems that the safest thing to do would be to set the user agent to Mozilla:
con1.setRequestProperty("User-Agent","Mozilla"); //works with your code for your URL
Maybe the safest option would be to use a user agent used by some of the popular web browsers.

Apache HttpComponents request.get.excute is not returning correct HTML source code

I am using Apache HttpComponents with 150 threads to download HTML source code for roughly 5000 different URLs.
The URL's are contained in a LinkedBlockingQueue and the SourceGetterThreads take from the queue when possible. A thread will then attempt to download the source code using EntityUtils.toString(HttpClient.execute().getEntity). The string representation of the HTML source code is then put on another LinkedBlockingQueue where I have a further 10 threads ready to peform useful work on the source code they take from the second queue.
My problem lies in that I have noticed errors in the work being peformed on the source code. I am using Matcher to match specific patterns and record the patterns found. However sometimes the source code is incorrect and does not match the URL (i.e the source code saved in my java memory is not the same as the source code when viewed in Chrome or FireFox). This is seemingly random and thus sometimes the source code is correct and sometimes it is not.
Does anybody know why this is?
Most likely the sites you're trying to fetch pages from perform some kind of request filtering based on request headers, like User-Agent. So they can simply return different content based on the analysis result.There are many reasons to do so:
Provide search robots with appropriate info
Deny web-crawlers from fetching site's content
Recognize mobile devices and supply their user with different HTML/CSS/JS
If you're querying the same site intensively some kind of DOS-attack protection may be triggered, causing a stub error page being returned instead of a regular content
If you want to get the same content as in browser, then the most basic recommendation for your client is to behave like a browser:
Always provide User-Agent header
Follow HTTP redirect (HTTP codes 302,303)
Maintain cookies and authentication routines if required
Be ready to support HTTPS scheme
Without the source code you're trying to run it's hard to say more, but Apache HTTP Client you're using is definitely capable to do all these things. Here's, for example, how to set User-Agent value:
String url = "http://www.google.com/search?q=httpClient";
HttpClient client = new DefaultHttpClient();
HttpGet request = new HttpGet(url);
request.addHeader("User-Agent","Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
HttpResponse response = client.execute(request);

Open an authenticated image served by django from java using Apache http client

I Am serving an authenticated image using django. The image is behind a view which require login, and in the end I have to check more things than just the authentication.
Because of a reason to complicated to explain here, I cannot use the real url to the image, but I Am serving it with a custom url leading to the authenticated view.
From java the image must be reachable, to save or display. For this part I use Apache httpclient.
In Apacahe I tried a lot of things (every example and combination of examples...) but can't seem to get it working.
For other parts of the webapp I use django-rest-framwork, which I succesfully connected to from java (and c and curl).
I use the login_reuired decorator in django, which makes the attempt to get to the url redirect to a login page first.
Trying the link and the login in a webviewer, I see the 200 code (OK) in the server console.
Trying the link with the httpclient, I get a 302 Found in the console.... (looking up 302, it means a redirect..)
this is what I do in django:
in urls.py:
url(r'^photolink/(?P<filename>.*)$', 'myapp.views.photolink',name='photolink'),
in views.py:
import mimetypes
import os
#login_required
def photolink(request, filename):
# from the filename I get the image object, for this question not interesting
# there is a good reason for this complicated way to reach a photo, but not the point here
filename_photo = some_image_object.url
base_filename=os.path.basename(filename_photo)
# than this is the real path and filename to the photo:
path_filename=os.path.join(settings.MEDIA_ROOT,'photos',mac,base_filename)
mime = mimetypes.guess_type(filename_photot)[0]
logger.debug("mimetype response = %s" % mime)
image_data = open(path_filename, 'rb').read()
return HttpResponse(image_data, mimetype=mime)
by the way, if i get this working i need another decorator to pass some other tests....
but i first need to get this thing working....
for now it's not a secured url.... plain http.
in java i tried a lot of things... using apache's httpclient 4.2.1
proxy, cookies, authentication negociation, with follow redirects... and so on...
Am I overlooking some basic thing here?...
it seems the login of the website client is not suitable for automated login...
so the problem can be in my code in django....or in the java code....
In the end the problem was, using HTTP authorization.
Which is not by default used in the login_required decorator.
adding a custom decorator that checks for HTTP authorization did the trick:
see this example: http://djangosnippets.org/snippets/243/

Guaranteed way to correctly get the contents of www.bing.com/

I have been working on a program that gets the contents of www.bing.com and saves it to a file, but out of the two ways I have tried one using sockets, and the other using HtmlUnit neither shows the contents 100% correct when I open the file. I know there are other options out there, but I looking for one that is guaranteed to get the contents of www.bing.com/ correctly. I would therefore appreciate it if someone could point me to a means of accomplishing this.
The differences you see are likely due to the web server providing different content to different browsers based on the user agent string and other request headers.
Try setting the User-Agent header in your socket and HtmlUnit strategies to the one you are comparing against and see if the result is as expected. Moreover, you will likely have to replicate the request headers exactly as they are sent by your target browser.
What is "incorrect" about what is returned? Keep in mind, Bing is probably generating some of the content via JavaScript; your client will need to make additional requests to retrieve the JavaScript files, run the JavaScript, etc.
You can use a URL.openConnection() to create a URLConnection and call URLConnection.getInputStream(). You can read the InputStream contents and write it to a file.
If you need to override the User-Agent because the server is using it to serve different content you can do so by first setting the http.agent system property to empty string.
/* Somewhere in your code before you make requests */
System.setProperty("http.agent", "");
or using -Dhttp.agent= on your java command line
and then setting the User-Agent to something useful on the connection before you get the InputStream.
URLConnection conn = ... //Create your URL connection as described above.
String userAgent = ... //Some user-agent string here.
conn.setRequestProperty("User-Agent", userAgent);

Categories