Extracting HTML from javascript protected website

Extracting HTML from javascript protected website - java

A website www.kissanime.to has a "Is javascript enabled on your browser" protection so when you want to read the html content of the website you have to have a browser with javascript enabled so using this code won't work:
URL kissanime = new URL("http://www.kissanime.to/");
URLConnection ks = kissanime.openConnection();
BufferedReader in = new BufferedReader(newInputStreamReader(ks.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
And after a while of researching I found Selenium it's a browser library emulator:
HtmlUnitDriver html = new HtmlUnitDriver();
String URL="https://www.kissanime.to/";
html.get(URL);
String pageSource=html.getPageSource();
System.out.println(pageSource);
And that works but isn't there a better way to do this like with Jsoup and Rhino libraries where you will make an initial connection with jsoup and then you will add rhino to make it seem that you have javascript or better yet only Jsoup and adding some cookies to bypass the protection.

Selenium is a pretty heavyweight solution for such a simple use case. If You need a basic engine emulating a real browser with javascript enabled then HtmlUnit (http://htmlunit.sourceforge.net/) is what you are looking for.
Here is a code fragment which scrapes data from google.com:
WebClient webClient = new WebClient();
HtmlPage googlePage = webClient.getPage("http://www.google.com/");
//posibly wait for the javascript to execute
String source = googlePage.asXml()

I do it like you but not only that page anyway, you can get the html this way by it's url, also emulate a browser
private String conexion(String link) {
String content = null;
URLConnection connection = null;
try {
connection = new URL(link).openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
Scanner scanner = new Scanner(connection.getInputStream());
scanner.useDelimiter("\\Z");
content = scanner.next();
} catch (IOException ex) {
JOptionPane.showMessageDialog(null, "¡¡Ese capitulo no Existe!!", "Error", JOptionPane.ERROR_MESSAGE);
}
return content;
}

Related

How to scrape the price from dynamically updated webpages?

I have a problem when i trying to scrape a price from dynamically updated web pages. I mean that lion's share of html code doesn't received using ways like UrlConnection, Jsoup, HtmlUnit.
I don't know really much about web scraping, but I guess that problem is that internet shops like these:
Auchan,
Silpo
use javascript and ajax to load main info about products. And in my opinion, the problem is in redirecting or deley that doesn't allow to get full loaded html file with all needed data.
So, the question is how to scrape price from links above?
I have already tried several approaches:
UrlConnection
URL url;
try {
url = new URL("https://auchan.ua/govjadina-v-kartofel-nom-pjure-so-svekloj-hipp-6440-220-g-297668/");
URLConnection con = url.openConnection();
InputStream is = con.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line;
try(FileWriter fileWriter = new FileWriter("output.html")){
while ((line = br.readLine()) != null) {
fileWriter.write(line+"\n");
}
}
} catch (IOException e) {
e.printStackTrace();
}
Runs good, but return html without price data.
Jsoup
Document document = null;
String link = "https://auchan.ua/govjadina-v-kartofel-nom-pjure-so-svekloj-hipp-6440-220-g-297668/";
try {
document = Jsoup.connect(link).get();
} catch (IOException e) {
e.printStackTrace();
}
if (document != null) {
try (FileWriter fileWriter = new FileWriter("output.html")) {
fileWriter.write(document.toString());
} catch (IOException e) {
e.printStackTrace();
}
}
Returns the same.
3.HtmlUnit
String link = "https://auchan.ua/govjadina-v-kartofel-nom-pjure-so-svekloj-hipp-6440-220-g-297668/";
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.waitForBackgroundJavaScriptStartingBefore(5000);
HtmlPage htmlPage = null;
try {
htmlPage = webClient.getPage(link);
webClient.waitForBackgroundJavaScript(5000);
} catch (IOException e) {
e.printStackTrace();
}
if (htmlPage!=null){
try (FileWriter fileWriter = new FileWriter("output.html")) {
fileWriter.write(Jsoup.parse(htmlPage.asXml()).toString());
} catch (IOException e) {
e.printStackTrace();
}
}
Returns a little bit more, including some javascripts tags, but still nothing usefull. Also, this code above throws so many exceptions, that they don't even fit in console.
I also tried to set up agents like this:
java.net.URLConnection conn = url.openConnection();
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
and this:
System.setProperty("http.agent", "")

You need to use Chrome's Dev tools to view the HTTP requests/responses
The page loads up tons of javascript. This in turn churns out a whole load of HTTP requests and waits for the responses: the first that looks interesting is:
https://auchan.ua/graphql which is a POST request with an important http header referer: https://auchan.ua/govjadina-v-kartofel-nom-pjure-so-svekloj-hipp-6440-220-g-297668/ - The response body for the request is: {"data":{"urlResolver":{"type":"PRODUCT","id":297668}}}
Taking the product ID value and searching for it in the subsequent responses I found the product ID was contained. The responses are all escaped unicode characters but if you open the URLs in a browser the content is rendered.
This particular URL that starts with auchan.ua/graphql/?query=query%20getProductDetail... looked promising and sure enough the special_price matches whats displayed on the page. So you'd need to find a way of generating/extracting these URLs from the initial page source.
link to product details
You may also find this response I gave useful for processing JSON data.
The second shop you linked to requires a username/password but the process for getting the data will likely be very similar; use dev tools to view the http requests, work out where the price info is coming from (find the value in one of the responses) then try to recreate the same request from the initial URL and the response returned.
Good luck!

An issue with an URLConnection using java

I'm trying to read out the code of a website.
But there is an issue if I want to receive the code of this site for example: "https://www.amazon.de/gp/bestsellers/pet-supplies/#2"
I tried a lot, but still im just receiving the code of https://www.amazon.de/gp/bestsellers/pet-supplies". So something does not work right as I want to receive place 21-40 and not 1-20.
I'm using an URLConneciton and a BufferedReader:
public String fetchPage(String urlS){
String s = null;
String qc = null;
try{
URL url = new URL(urlS);
URLConnection uc = url.openConnection();
uc.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0");
BufferedReader reader = new BufferedReader(new InputStreamReader(uc.getInputStream()));
while((s = reader.readLine()) != null){
qc += s;
}
reader.close();
} catch(IOException e) {
e.printStackTrace();
qc = "receiving qc failed";
}
return qc;
}
Thank you in advance for your effort :)

The URL you're fetching, contains an achor (the #2 at the end). An anchor is a client-side concept and is originally used to jump to a certain part of the page. Some webapps (mostly single-page apps) use the anchor to keep track of some sort of state (eg. what page of products you're viewing).
Since the anchor is a client side concept, the responding webserver (or your browser/HTTP client library) just drops any anchors as if you actually requested https://www.amazon.de/gp/bestsellers/pet-supplies.
Bottom line is that you'll never get the second page... Goog luck in scraping Amazon though ;)

Read XML from valid URL not returned. Header formatting issue?

I am trying to use the code below to read from a valid url. I can copy and paste the url into my browser and it works perfectly (displays the xml) but when I try to access it programatically it returns nothing (no data and no error). I have already tried to set the user-agent via this post: Can't read in HTML content from valid URL but it didnt fix my problem. If it matters I am trying to do a single Eve API call. I believe the problem is that I do not have my headers formatted correctly, and the Eve site is rejecting the query. I can access the data fine using PHP, but I recently had to change languages.
public static void readFileToXML(String urlString,String fName)
{
try{
java.net.URL url = new java.net.URL(urlString);
System.out.println(url);
URLConnection cnx = url.openConnection();
cnx.setAllowUserInteraction(false);
cnx.setDoOutput(true);
cnx.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/531.0 (KHTML, like Gecko) Chrome/3.0.183.1 Safari/531.0");
System.out.println(cnx.getContentLengthLong());// a change suggested in the comments. returns -1
InputStream is = cnx.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
File file=new File("C:\\Users\\xxx\\Desktop\\"+fName);
BufferedWriter bw = new BufferedWriter(new FileWriter(file,false));
String inputLine;
while ((inputLine = br.readLine()) != null) {
bw.write(inputLine);
System.out.println(inputLine);
}
System.out.println("Finished read");
bw.close();
br.close();
}
catch(Exception e)
{
System.out.println("Exception: "+e.getMessage());
}
}

Different behavior between my java app and my navigator when querying an url

I would like to query a portlet from a meteo service. But as my navigator gives me the correct page, my java application does not.
Here is my code without try/catch:
public static void main(String[] args) {
// The portlet URL
String cookieUrl = "http://france.meteofrance.com/france/meteo?PREVISIONS_PORTLET.path=previsionspluie/290190";
HttpURLConnection connection = (HttpURLConnection) new URL(url).openConnection();
// Some proxy configuration
// Basic properties set to perform my task, but they might be useless as it does not work as I would
connection.setRequestProperty("Accept-Charset", "UTF-8");
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12");
System.out.println(handler.getHtml(reader));
}
private String getHtml(BufferedReader reader)
{
String html = "";
for (String line; (line = reader.readLine()) != null;)
html += line + "\n";
return html;
}
You can test it with the portle url given as example.
In the navigator the response contains the correct "prevision de pluie pour Brest".
In the java app the response contains this page : http://france.meteofrance.com/NoCookie.htm
It seems to be a cookie matter. But how could I handle it as my first tries to get cookie and send them back were unsuccessful.
Any help please?

It looks like the site you are trying to access relies on Cookies which are not supported by HttpURLConnection. A way around this issue is to use a library like HtmlUnit which simulates a browser (supports cookies, javascript, etc..).

HttpsURLConnection and POST

some time ago i wrote this program in python, that logged in a website using https, took some info and logged out.
The program was quite simple:
class Richiesta(object):
def __init__(self,url,data):
self.url = url
self.data = ""
self.content = ""
for k, v in data.iteritems():
self.data += str(k)+"="+str(v)+"&"
if(self.data == ""):
self.req = urllib2.Request(self.url)
else:
self.req = urllib2.Request(self.url,self.data)
self.req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 5.1; rv:2.0b6) Gecko/20100101 Firefox/4.0b6')
self.req.add_header('Referer', baseurl+'/www/')
self.req.add_header('Cookie', cookie )
def leggi(self):
while(self.content == ""):
try:
r = urllib2.urlopen(self.req)
except urllib2.HTTPError, e:
print("Errore del server, nuovo tentativo tra 15 secondi")
time.sleep(15)
except urllib2.URLError, e:
print("Problema di rete, proverò a riconnettermi tra 20 secondi")
time.sleep(20)
else:
self.content = r.read().decode('utf-8')
def login(username,password):
global cookie
print("Inizio la procedura di login")
url = "https://example.com/auth/Authenticate"
data = {"login":"1","username":username,"password":password}
f = Richiesta(url,data)
f.leggi()
Now, for some reason, I have to translate it in java. Untill now, this is what i've writte:
import java.net.*;
import java.security.Security.*;
import java.io.*;
import javax.net.ssl.*;
public class SafeReq {
String baseurl = "http://www.example.com";
String useragent = "Mozilla/5.0 (Windows NT 5.1; rv:2.0b6) Gecko/20100101 Firefox/4.0b6";
String content = "";
public SafeReq(String s, String sid, String data) throws MalformedURLException {
try{
URL url = new URL(s);
HttpsURLConnection request = ( HttpsURLConnection ) url.openConnection();
request.setUseCaches(false);
request.setDoOutput(true);
request.setDoInput(true);
request.setFollowRedirects(true);
request.setInstanceFollowRedirects(true);
request.setRequestProperty("User-Agent",useragent);
request.setRequestProperty("Referer","http://www.example.com/www/");
request.setRequestProperty("Cookie","sid="+sid);
request.setRequestProperty("Origin","http://www.example.com");
request.setRequestProperty("Content-Type","application/x-www-form-urlencoded");
request.setRequestProperty("Content-length",String.valueOf(data.length()));
request.setRequestMethod("POST");
OutputStreamWriter post = new OutputStreamWriter(request.getOutputStream());
post.write(data);
post.flush();
BufferedReader in = new BufferedReader(new InputStreamReader(request.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
content += inputLine;
}
post.close();
in.close();
} catch (IOException e){
e.printStackTrace();
}
}
public String leggi(){
return content;
}
}
The problem is, the login doesn't work, and when i try to get a page that require me to be logged, i get the "Login Again" message.
The two classes seems quite the same, and i can't understand why i can't make the second one to work ... any idea?

Where do you get your sid from? From the symptoms, I would guess that your session cookie is not passed correctly to the server.
See this question for possible solution: Cookies turned off with Java URLConnection.
In general, I recommend you to use HttpClient for implementing HTTP conversations in Java (anything more complicated than a simple one-time GET or POST). See code examples (I guess "Form based logon" example is appropriate in your case).

Anyone looking for this in the future, take a look at HtmlUnit.
This answer has a nice example.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extracting HTML from javascript protected website - java

Related

How to scrape the price from dynamically updated webpages?

An issue with an URLConnection using java

Read XML from valid URL not returned. Header formatting issue?

Different behavior between my java app and my navigator when querying an url

HttpsURLConnection and POST

Categories

Resources