Problem reading Website content with a get request in java - java

i'm trying to read the best price from the skyscanner website using a normal get request, but i'm not getting the content that i want by using this code.
private void getRequest() throws Exception {
StringBuilder result = new StringBuilder();
URL url = new URL(URL);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.addRequestProperty("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0");
System.out.println(conn.getURL());
conn.setInstanceFollowRedirects(true);
HttpURLConnection.setFollowRedirects(true);
conn.setRequestMethod("GET");
BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line;
while ((line = rd.readLine()) != null) {
result.append(line);
}
System.out.println(conn.getURL());
rd.close();
response = result.toString();
}
The requested URL is the following:
https://www.skyscanner.com/transport/flights/fra/txl/181220/?adults=1&children=0&adultsv2=1&childrenv2=&infants=0&cabinclass=economy&rtn=0&preferdirects=false&outboundaltsenabled=false&inboundaltsenabled=false&currency=EUR&market=DE&locale=en-US
Response from the code above looks like this:
https://pastebin.com/YKh17RKE
By going to the mentioned skyscanner link in chrome i can click on inspect element and voila under
fqs-opts-container -> <span class="fqs-price">42 €</span>
i can see the cheapest price.
How to get this information using java? What am i doing wrong here?
Thanks in advance.

Inspect shows the current HTML DOM (Document Object Model) resulting from:
the static HTML page (see right-click + View page source) plus
dynamic modifications by JavaScript.
If you do Inspect, tab Network and reload the page, you can see which files (and their contents) are all requested by the browser to display the page.
In this particular case, it seems that you could get the data as JSON:
In the tab Network filter for conductor/v1/fps3/search/. The query is an HTTP post request with the URL https://www.skyscanner.de/g/conductor/v1/fps3/search/?geo_schema=skyscanner&carrier_schema=skyscanner&response_include=query%3Bdeeplink%3Bsegment%3Bstats%3Bfqs%3Bpqs%3B_flights_availability. The answer is in JSON and includes a session_id which is required as part of the URL for subsequent requests for details.
Please note that even if it is technically possible to receive the data, it is in most cases forbidden to use them commercially.

Related

Eclipse not respecting system proxy

I am trying to make a post request to a https address and set up fiddler to return a standard response. I have two rules set up in Fiddler and the process works from both Internet Explorer and Postman (but not Chrome) and I cannot get it to work from the java application I am trying to write even when I have created an executeable jar file and run from the cmd. I have been using this example as the base for this work. I have the sendGet() working (ish) but I cannot get sendPost() to work getting a java.net.UnknownHostException.
I think the problem may be that I am not hitting Fiddler as the proxy from Eclipse. For the sendGet() from browser and Postman I get the contents of 200_SimpleHTML.dat as required but from eclipse the same rule has no affect and I get the content from the actual URL (Our TeamForge in this case)
My organisation uses a proxy which is set in IE and I have set the java configuration to "Use browser settings" and also tried "Use automatic proxy configuration script" (pointing to the proxy.pac file) and neither seems to have any affect. I have the following in Window -> Preferences -> Network Connections:
but I have no idea how, or even if, I can point to Fiddler as the proxy here. I am not setting up any authentication from the working routes.
The current state of my sendPost is below:
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
which I copied from Fiddler after one of the successful request.
private void sendPost() throws Exception
{
String url = "<Actual URL removed>";
URL obj = new URL(url);
HttpURLConnection http = (HttpURLConnection) obj.openConnection();
// add request header
http.setRequestMethod("POST");
http.setDoOutput(true);
http.setRequestProperty("User-Agent", USER_AGENT);
http.setRequestProperty("Accept-Language", "en-GB,en;q=0.5");
OutputStream out = http.getOutputStream();
int responseCode = http.getResponseCode();
System.out.println("\nSending 'POST' request to URL : " + url);
System.out.println("Response Code : " + responseCode);
BufferedReader in = new BufferedReader(new InputStreamReader(http.getInputStream()));
String inputLine;
StringBuffer response = new StringBuffer();
while ((inputLine = in.readLine()) != null)
{
response.append(inputLine);
}
in.close();
// print result
System.out.println(response.toString());
}
Does anyone have any ideas as to how I can get this to work from my Java app?

Incredibly long wait times when trying to read content of secure https website in Java using basic approaches

After reading and trying various approaches to reading the content of websites I realised that I am unable to get the input stream of secure web pages (or have to wait for minutes for a single response). These are pages that I can easily access via a browser (no proxies involved).
The different fixes that I tried are the following:
Setting user agent
Following redirects
Using JSoup
Catering for encoding
Using Scanner to parse the stream
Using cookie manager
The two basic approaches that seem most popular, one using Jsoup:
Document doc = Jsoup.connect(url)
.userAgent(userAgent)
.timeout(5000).followRedirects(true).execute().parse();
Elements body = doc.select("body");
System.out.println(body.html());
The other with vanilla Java:
URL obj = new URL(url);
CookieHandler.setDefault(new CookieManager(null, CookiePolicy.ACCEPT_ALL));
HttpsURLConnection con = (HttpsURLConnection) obj.openConnection();
con.setRequestMethod("GET");
con.setRequestProperty("User-Agent", userAgent);
con.setConnectTimeout(5000);
con.setReadTimeout(5000);
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
StringBuilder response = new StringBuilder();
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}
in.close();
The execution goes into a painfully slow loop in case of https addresses such as https://en.wikipedia.org/wiki/Java in the execution phase (so no status code can be retrieved from the response either), or when the input stream is being fetched in case of the second attempt. Both approaches work perfectly fine for insecure addresses (e.g. http://www.codingpedia.org/ ) - with a switch to HttpURLConnection.
Weirdly, there is also very high disc usage while waiting ( > 20 MB/sec).
All help greatly appreciated!

POST Http request with JSON object

I'm working on a pet project to scrape fantasy football stats from MY own fantasy league on ESPN. The problem that I'm running into that I can't seem to get past is the login which is needed before I can make requests for my league's page.
The URL I hit is
http://games.espn.com/ffl/leaguesetup/ownerinfo?leagueId=123456&seasonId=2016
and by looking at the GET requests it looks like I get redirected to
http://games.espn.com/ffl/signin?redir=http://games.espn.com/ffl/leaguesetup/ownerinfo?leagueId=123456&seasonId=2016
Which immediately gets me to a login prompt window. When I log in I inspect the POST request and note down all the Request Header. Looks like the requested URL on the POST is
https://registerdisney.go.com/jgc/v5/client/ESPN-FANTASYLM-PROD/guest/login?langPref=en-US
additionally I noted the following JSON objected is passed along:
{"loginValue":"myusername","password":"mypassword"}
using the Request Headers and JSON object I did the following:
String url = "http://games.espn.com/ffl/leaguesetup/ownerinfo?leagueId=123456&seasonId=2016";
String rawData = "{\"loginValue\":\"myusername\",\"password\":\"mypassword\"}";
URL obj = new URL(url);
HttpURLConnection con = (HttpURLConnection) obj.openConnection();
con.setRequestMethod("POST");
con.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
con.setRequestProperty("Accept-Encoding", "gzip, deflate");
con.setRequestProperty("Accept-Language", "en-US,en;q=0.5");
con.setRequestProperty("Authorization", "APIKEY 8IYGqTgmpFTX51iF1ldp6MBtWrdQ0BxNUf8bg5/empOdV4u16KUSrnkJqy1DXy+QxV8RaxKq45o2sM8Omos/DlHYhQ==");
con.setRequestProperty("Cache-Control", "no-cache");
con.setRequestProperty("Content-Length", "52");
con.setRequestProperty("Content-Type", "application/json; charset=UTF-8");
con.setRequestProperty("Expires", "-1");
con.setRequestProperty("Host", "registerdisney.go.com");
con.setRequestProperty("Origin", "https://cdn.registerdisney.go.com");
con.setRequestProperty("Pragma", "no-cache");
con.setRequestProperty("Referer", "https://cdn.registerdisney.go.com/v2/ESPN-ESPNCOM-PROD/en-US?include=config,l10n,js,html&scheme=http&postMessageOrigin=http%3A%2F%2Fwww.espn.com%2F&cookieDomain=www.espn.com&config=PROD&logLevel=INFO&topHost=www.espn.com&ageBand=ADULT&countryCode=US&cssOverride=https%3A%2F%2Fsecure.espncdn.com%2Fcombiner%2Fc%3Fcss%3Ddisneyid%2Fcore.css&responderPage=https%3A%2F%2Fwww.espn.com%2Flogin%2Fresponder%2Findex.html&buildId=157599bfa88");
con.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0");
con.setRequestProperty("conversation-id", "5a4572f4-c940-454c-8f86-9af27345c894, adffddd3-8c31-41a0-84d7-7a0401cd2ad0");
con.setRequestProperty("correlation-id", "4d9ddc78-b00e-4c5a-8eec-87622961fd34")
con.setDoOutput(true);`
OutputStreamWriter w = new OutputStreamWriter(con.getOutputStream(), "UTF-8");
w.write(rawData);
w.close();
int responseCode = con.getResponseCode();
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
StringBuffer response = new StringBuffer();
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}
in.close();
Assuming I'm on the right track what I'm currently getting back from the server is server is
returned HTTP response code: 400 for URL: https://registerdisney.go.com/jgc/v5/client/ESPN-FANTASYLM-PROD/guest/login?langPref=en-US
Any ideas what is happening or if i'm taking the complete wrong approach here? I tried to use JSoup but had no luck either and I believe underneath JSoup uses HttpUrlConnection as well.
Do I need to do some sort of GET request first, save something then do the POST request? How should it work?
You are trying to emulate the behaviour of a Web Browser with JSoup. As you have experienced this is quite complicated and JSoup is not made for to impersonate a browser. When you start with crafting HTTP headers, then it's better to go another way.
The solution for your problem is to use a browser that can be programmatically manipulated. Selenium is more or less the defacto standard in Java.
Selenium starts your favorite browser (Firefox, Chrome, ..) and let you control it from your Java program. You can also retrieve the content of the web pages in order to scrap them with JSoup. Selenium is well documented, you will have no difficulty to find the required documentation/tutorial.
Another answer to your problem. While it is impossible for me to reproduce your issue (don't have football fantasy account and I have no intent to create one), I can still try to give some methodology help.
I would tackle the problem by using the network inspector from my browser, copy in a file all the exchanges between the browser and the server and try to reproduce this in my code.
The API key value in the Authorization header can only be reused for a limited time. If it is expired, the registration response body will contain an "API_KEY_INVALID" error.

Why am I getting content type of a PDF file is returned as HTML?

I am trying to see the content type of a web URL using the following code.
Interestingly, the content type of the given URL (http://www.jbssinc.com/inv_pr_pdf/2007-05-08.pdf") is returned as text/html; charset=iso-8859-1 even though it is a PDF document. I would like to understand why.
Here is my code:
public static void main(String[] args) throws MalformedURLException{
URLConnection urlConnection = null;
URL url = new URL("http://www.jbssinc.com/inv_pr_pdf/2007-05-08.pdf");
try {
urlConnection = url.openConnection();
urlConnection.setConnectTimeout(10*1000);
urlConnection.setReadTimeout(10*1000);
urlConnection.connect();
} catch (IOException e) {
System.out.println("Error in establishing connection.\n");
}
String contentType = "";
/* If we were able to get a connection ---> */
if (urlConnection != null) {
contentType = urlConnection.getContentType();
}
System.out.println(contentType);
}
When I access this page in Java, if I attempt to actually load the page, I get a 403 - Forbidden error. These error pages are HTML pages, not pdf files, so that's why you're getting the content type you're seeing.
This site is probably detecting your browser or using some other mechanism to prevent automatic downloads, that's why it works in Chrome, Firefox and IE but not with Java.
Your code works fine with a different URL, such as https://partners.adobe.com/public/developer/en/xml/AdobeXMLFormsSamples.pdf.
In the case of this webserver, if you specify the User-Agent to a typical browser value, it will allow you to make the connection normally.
Try adding this line immediately before urlConnection.connect():
urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
See this answer for more information about setting the User-Agent. You should make sure you are not violating the website's Terms of Service in some way before doing this, though.
Typically, the way to check if a website is explicitly forbidding apps from downloading their content is with the http://example.com/robots.txt file. In this case, that would be http://www.jbssinc.com/robots.txt. In this case, this file doesn't forbid robots (your program) from downloading this particular file, so I think you are okay to spoof your User Agent. In this case, the fact that Java is blocked is more likely to be user error.
Further reading: Is using a faked user agent allowed?

Java URL Without Protocol

I'm trying to open an InputStream to a certain URL as given by the service's API. However, it does not have a set protocol (it's not http or https) and without one, I am getting the following error.
Is there any way to get a request without a protocol?
Exception:
Exception in thread "main" java.net.MalformedURLException: no protocol.
Code:
String url = "maple.fm/api/2/search?server=1";
InputStream is = new URL(url).openStream();
UPDATE: I now updated the code to:
Code:
String url = "http://maple.fm/api/2/search?server=1";
InputStream is = new URL(url).openStream();
and now I'm getting the following error:
Exception:
Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: http://maple.fm/api/2/search?server=1
A URL without a protocol is not a valid URL. It is actually a relative URI, and you can only use a relative URI if you have an absolute URI (or equivalent) to provide the context for resolving it.
Is there any way to [make] a request without a protocol?
Basically .... No. The protocol tells the client-side libraries how to perform the request. If there is no protocol, the libraries would not know what to do.
The reason that "urls without protocols" work in a web browser's URL bar is that the browser is being helpful, and filling in the missing protocol with "http:" ... on the assumption that that is what the user probably means. (Plus a whole bunch of other stuff, like adding "www.", adding ".com", escaping spaces and other illegal characters, ... or trying a search instead of a normal HTTP GET request.)
Now you could try to do the same stuff in your code before passing the URL string to the URL class. But IMO, the correct solution if you are writing code to talk to a service is to just fix the URL. Put the correct protocol on the front ...
The 403 error you are now getting means Forbidden. The server is saying "you are not permitted to do this".
Check the documentation for the service you are trying to use. (Perhaps you need to go through some kind of login procedure. Perhaps what you are trying to do is only permitted for certain users, or something.)
Try the example URL on this page ... which incidentally works for me from my web browser.
When you say it does not have a set protocol, I am a little bit suspicious of what that means. If it can use multiple protocols, I would hope the API documentation mentions some way of determining what the protocol should be.
I hit the URL http://maple.fm/api/2/search?server=1 and it is simply returning JSON over http. I think your actual problem is that you are trying to open an InputStream to talk to a web server. I believe the solution to your problem, of trying to handle JSON over http, can be found here.
I decided to dig into this because I was curious. Combining this answer and this answer, we have the following code which will print out the JSON output from your URL. Of course, you still need a JSON library to parse it, but that's a separate problem.
import java.net.*;
import java.io.*;
public class Main{
public static String getHTML(String urlToRead) {
URL url;
HttpURLConnection conn;
BufferedReader rd;
String line;
String result = "";
try {
url = new URL(urlToRead);
conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
while ((line = rd.readLine()) != null) {
result += line;
}
rd.close();
} catch (IOException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
return result;
}
public static void main(String[] args) {
String url = "http://maple.fm/api/2/search?server=1";
System.out.println(getHTML(url));
}
}
you need to surround it with a try/catch block
try {
String url = "maple.fm/api/2/search?world=1";
InputStream is = new URL(url).openStream();
catch(MalformedURLException e) {
e.printStackTrace();

Categories