HTML contents different from Google "View page source"

HTML contents different from Google "View page source" - java

I've read on this page that this has something to see with the user agent used, but I can't find a way to to get the one used by Google.
I'm trying to fet HTML contents from let's say https://www.kayak.fr/flights/TLS-ATH/2019-10-04/2019-10-07?sort=price_a, when I click on "View page source" using Google Chrome, I'm getting prices etc (what I need) but I can't access those with my java code..
Do I have to find the user-agent of my Google Chrome? I found this page but I'm getting the exact same result as before using java..
Any ideas?
Here's my code:
try{
URL url = new URL("https://www.kayak.fr/flights/TLS-ATH/2019-10-04/2019-10-07?sort=price_a");
URLConnection con = url.openConnection();
con.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36");
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(con.getInputStream(),"UTF-8"));
String line;
while((line = bufferedReader.readLine()) != null){
System.out.println(line);
}
bufferedReader.close();
}catch(IOException e){
e.printStackTrace();
}
The setRequestProperty is really random in this code because I'm still testing.

Related

Java 403 Exception When My Bot Tries To Send An Embedded Link

I got a discord bot that I've made in Java and one of its purposes is to send an embedded link (I don't own the site) everytime someone leaves the server. It worked the first 2-3 times and every time after that I get the following exception:
java.io.IOException: Server returned HTTP response code: 403 for URL: ...
Example link:
https://signature.hzgaming.net/sig.php?name=Juntao_Lubu&style=1
I tried numerous solutions I've found online (with User-Agents and all that fancy stuff), but none of them seem to work for me.
Is there any other workaround this?
Code:
String link = "https://signature.hzgaming.net/sig.php?name=" + allMembers.get(mEvent.getUser().getDiscriminatedName()).replace(" ", "_") + "&style=1";
URLConnection urlCon = new URL(link).openConnection();
urlCon.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.29 Safari/537.36");
InputStream is = urlCon.getInputStream();
StringBuilder textBuilder = new StringBuilder();
Reader reader = new BufferedReader(new InputStreamReader(is, Charset.forName(StandardCharsets.UTF_8.name())));
int c = 0;
while((c = reader.read()) != -1) {
textBuilder.append((char)c);
}
String result = textBuilder.toString().replaceAll("<[^>]*>", "");
if(!result.equalsIgnoreCase("Non-Existant Player") && !result.equalsIgnoreCase("Non-ExistantPlayer")) {
new MessageBuilder().append(link).send((TextChannel)server.getChannelById(973242211623895080L).get());
}
Thanks in advance.

Java - Read page source from url returns unknown characters

I am using the code below to read a page source from url (https://www.amazon.com) with "UTF-8" charset in NetBeans, but it returns unknown characters (the attached image). I don't have any idea that what is the problem and would be gratefull if help me to modify the code to work properly? Thanks.
public static String getURLSource(String url) throws IOException
{
URL urlObject = new URL(url);
URLConnection urlConnection = urlObject.openConnection();
urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
return toString(urlConnection.getInputStream());
}
private static String toString(InputStream inputStream) throws IOException
{
try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream, "UTF-8")))
{
String inputLine;
StringBuilder stringBuilder = new StringBuilder();
while ((inputLine = bufferedReader.readLine()) != null)
{
stringBuilder.append(inputLine);
}
return stringBuilder.toString();
}
}

Use HttpsUrlConnection instead of UrlConnection. See a similar question.

You just need to unzip your content. Here is the code that worked for me
HttpClient httpClient = new HttpClient();
try {
httpClient.setConnectionUrl("https://www.amazon.com");
ByteBuffer buff = httpClient.setRequestHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11")
.sendHttpRequestForBinaryResponse(HttpClient.HttpMethod.GET);
try (
ByteArrayInputStream bais = new ByteArrayInputStream(buff.array());
GZIPInputStream gzis = new GZIPInputStream(bais);
InputStreamReader isr = new InputStreamReader(gzis);
BufferedReader br = new BufferedReader(isr)
) {
br.lines().forEach(line -> System.out.println(line));
}
} catch (Exception e) {
System.out.println(httpClient.getLastResponseCode() + " "
+ httpClient.getLastResponseMessage() + TextUtils.getStacktrace(e, false));
}
Just few clarifications: In this example I use a 3d party Http client class HttpClient (And also class TextUtils). They both come from Open source MgntUtils library writen and maintained by me. But you don't have to use it. The main part is - read the info from the InputStream as binary info (as byte array or ByteBuffer) and than and unzip it with GZIPInputStream like in my example.
If you do want to use MgntUtils library you can get it As maven artifact or from Github (including source code and Javadoc). and here is Javadoc online

403 forbidden for url in java but not in browser

I'm behind a corporate firewall, but i can paste the URL in my browser with and without my proxy settings enabled within the browser and can retrieve the data fine. I just can't within java.
Any ideas?
Code:
private static String getURLToString(String strUrl) throws IOException {
// LOG.debug("Calling URL: [" + strUrl + "]");
String content = "";
URLConnection connection = new URL(strUrl).openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
connection.connect();
BufferedReader br = new BufferedReader(new InputStreamReader(connection.getInputStream(), Charset.forName("UTF-8")));
String inputLine;
while ((inputLine = br.readLine()) != null) {
content += inputLine;
}
br.close();
return content;
}
Error:
java.io.FileNotFoundException: Response: '403: Forbidden' for url: '<url here>'
at weblogic.net.http.HttpURLConnection.getInputStream(HttpURLConnection.java:778)
at weblogic.net.http.SOAPHttpURLConnection.getInputStream(SOAPHttpURLConnection.java:37)
Note: The '' portion is for anonymizing.

As you are receiving a "403: Forbidden" error, it means that your Java code can reach the URL, but it lacks something that is required to access it.
In the browser, press F12 (developer/debug mode) and request the URL again. Check the headers and cookies that are being sent. Most likely you will need to add one of these for you to be able to receive the content you need.

Adding "User-Agent" header fixed it for me:
connection.setRequestProperty("User-Agent", "Mozilla/5.0");

Java HttpURLConnection trying to login with cookie

So am trying to login this website using java but for some reason its not working as expected i got the host and all that stuff but its not going to the account page with the cookie it still shows the login page and yes my account info is correct any help is great
public static void main(String[] args) {
try {
String params = "loginEmail=private#hotmail.com&loginPassword=privatepassword&Submit=Sign+In";
String urls = "http://www.filefactory.com/member/signin.php";
URL url = new URL(urls);
HttpURLConnection connection = (HttpURLConnection)url.openConnection();
connection.setRequestMethod("POST");
connection.setRequestProperty("Content-Type",
"application/x-www-form-urlencoded");
connection.setRequestProperty("Host", "www.filefactory.com");
connection.setRequestProperty("Referer", "http://www.filefactory.com/member/signin.php");
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101 Safari/537.36 OPR/25.0.1614.50");
connection.setRequestProperty("Content-Length", "" +
Integer.toString(params.getBytes().length));
connection.setRequestProperty("Content-Language", "en-US");
connection.setDoInput(true);
connection.setDoOutput(true);
//Send request
DataOutputStream wr = new DataOutputStream (
connection.getOutputStream ());
wr.writeBytes (params);
wr.flush ();
wr.close ();
//Get Response
InputStream is = connection.getInputStream();
BufferedReader rd = new BufferedReader(new InputStreamReader(is));
String line;
StringBuilder response = new StringBuilder();
while((line = rd.readLine()) != null) {
response.append(line);
response.append('\r');
}
rd.close();
System.out.println(response.toString());
// get the cookie if need, for login
String cookies = connection.getHeaderField("Set-Cookie");
// open the new connnection again
connection = (HttpURLConnection) new URL("http://www.filefactory.com/account/").openConnection();
connection.setRequestProperty("Cookie", cookies);
connection.addRequestProperty("Accept-Language", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
connection.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101 Safari/537.36 OPR/25.0.1614.50");
connection.addRequestProperty("Host", "www.filefactory.com");
System.out.println("Redirect to URL : " + "http://www.filefactory.com/account/");
BufferedReader in = new BufferedReader(
new InputStreamReader(connection.getInputStream()));
String inputLine;
StringBuilder html = new StringBuilder();
while ((inputLine = in.readLine()) != null) {
html.append(inputLine);
}
in.close();
System.out.println("URL Content... \n" + html.toString());
System.out.println("Done");
} catch (Exception e) {
e.printStackTrace();
}
}
}

You are using : String cookies = connection.getHeaderField("Set-Cookie");
Are you sure there is only one entry for that header? There could be more.
http://en.wikipedia.org/wiki/HTTP_cookie
Try using chrome or firefox and try logging in manually to capture the request and response. That may give you some hints regarding what could be wrong.
Additionally you could use a tool to view the communication between your client and the server (unless you are already doing so)

It's hard to tell, not knowing the exact way that website works, but you should note that it sends you the cookies first when it presents the login page to you. When you send in your credentials you have to already send them together with the cookies, so that it knows to associate those credentials with this cookie.

Java - handling foreign characters

So, I have some Java code that fetches the contents of a HTML page as follows:
BufferedReader bf;
String response = "";
HttpURLConnection connection;
try
{
connection = (HttpURLConnection) url.openConnection();
connection.setInstanceFollowRedirects(false);
connection.setUseCaches(false);
connection.setRequestMethod("GET");
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.0; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.16 Safari/534.24");
connection.connect();
bf = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line;
while ((line = bf.readLine()) != null) {
response += line;
}
connection.disconnect();
}
catch (Throwable ex)
{
response = "";
}
This works perfectly fine and will return the content to me as required. I then drill down to the area of code that I want to pull, which is as follows:
10€ de réduction chez Asos be!
Java seems to be handling the € fine since it is a HTML entity. The word "réduction" is problematic though. It seems to render it as:
10€ de r�duction chez Asos be!
As you can see it is struggling to handle the "é" character.
How do I go about solving this? I've been searching the internet and playing around with the code for the past few hours but no luck whatsoever! I'm very new to Java so it's all very difficult to get my head around.
Thanks in advance.

That code is ok but you might need to detect the character encoding of the response (see here) and pass it to the class that wraps the inputStream to get a Reader (see here).
Otherwise the problem is not reading the response but in the stuff you do with that response string.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

HTML contents different from Google "View page source" - java

Related

Java 403 Exception When My Bot Tries To Send An Embedded Link

Java - Read page source from url returns unknown characters

403 forbidden for url in java but not in browser

Java HttpURLConnection trying to login with cookie

Java - handling foreign characters

Categories

Resources