Reading HTML using jsoup

Reading HTML using jsoup - java

so i am trying to get an HTML element from a website using Jsoup, but the HTML that i get from the Jsoup.connect(url) is not complete compared to the one that i get using the inspector on the website.
EDIT : this is the link i'm working with https://www.facebook.com/livemap##35.831640894,24.82275312499999,2z
The numbers in the end designate the coordinates of the map, and you don't have to sign in to access the page, so there is no authentication problem
UPDATE :
So i have found that the element that i want does not get expanded using jsoup, is this a problem related to slow page loading ? If so, how can i make sure that Jsoup.connect(url) fully loads the webpage before fetching the HTML
from inspector (the <div id="u_0_e"> is expanded)
from jsoup.connect (the <div id="u_0_e"> is not expanded)

Jsoup dont execute javascript or jQuery events, so you will get a initial page before executing javascript.

Related

Going to next page on an aspx form with JSoup

I'm trying to go to the next page on an aspx form using JSoup.
I can find the next button itself. I just don't know what to do with it.
The idea is that, for that particular form, if the next button exists, we would simulate a click and go to the next page. But any other solution other than simulating a click would be fine, as long as we get to the next page.
I also need to update the results once we go to the next page.
// Connecting, entering the data and making the first request
...
// Submitting the form
Document searchResults = form.submit().cookies(resp.cookies()).post();
// reading the data. Everything up to this point works as expected
...
// finding the next button (this part also works as expected)
Element nextBtn = searchResults.getElementById("ctl00_MainContent_btnNext");
if (nextBtn != null) {
// click? I don't know what to do here.
searchResults = ??? // updating the search results to include the results from the second page
}
The page itself is www.somePage.com/someForm.aspx, so I can't use the solution stated here:
Android jsoup, how to select item and go to next page
I was unable to find any other suggestions.
Any ideas? What am I missing? Is simulating a click even possible with JSoup? The documentation says nothing about it. But I'm sure people are able to navigate these type of forms.
Also, I'm working with Android, so I can't use HtmlUnit, as stated here:
importing HtmlUnit to Android project
Thank you.

This is not Jsoup work! Jsoup is a parser with a nice DOM API that allows you to deal with wild HTML as if it were well-formed and not crippled with errors and nonsenses.
In your specific case you may be able to scrape the target site directly from your app by finding links and retrieving HTML pages recursively. Something like
private void scrape(String url) {
Document doc = Jsoup.connect(url).get();
// Analyze current document content here...
// Then continue
for (Element link : doc.select(".ctl00_MainContent_btnNext")) {
scrape(link.attr("href"));
}
}
But in the general case what you want to do requires far more functionality that Jsoup provides: a user agent capable of interpreting HTML, CSS and Javascript with a scriptable API that you can call from your app to simulate a click. For example Selenium:
WebDriver driver = new FirefoxDriver();
driver.findElement(By.name("next_page")).click();
Selenium can't be bundled in an Android app, so I suggest you put your Selenium code on a server and make it accessible with some REST API.

Pagination on ASPX can be a pain. The best thing you can do is to use your browser to see the data parameters it sends to the server, then try to emulate this in code.
I've written a detailed tutorial on how to handle it here but it uses the univocity HTML parser (which is commercial closed source) instead of JSoup.
In short, you should try to get a <form> element with id="aspnetForm", and read the form elements to generate a POST request for the next page. The form data usually comes out with stuff such as this:
__EVENTTARGET =
__EVENTARGUMENT =
__VIEWSTATE = /wEPDwUKMTU0OTkzNjExNg8WBB4JU29ydE9yZ ... a very long string
__VIEWSTATEGENERATOR = 32423F7A
... and other gibberish
Then you need to look at each one of these and compare with what your browser sends. Sometimes you need to get values from other elements of the page to generate a similar POST request. You may have to REMOVE some of the parameters you get - again, make your code behave exactly the same as your browser
After some (frustrating) trial and error you will get it working. The server should return a pipe-delimited result, which you can break down and parse. Something like:
25081|updatePanel|ctl00_ContentPlaceHolder1_pnlgrdSearchResult|
<div>
<div style="font-weight: bold;">
... more stuff
|__EVENTARGUMENT||343908|hiddenField|__VIEWSTATE|/wEPDwU... another very long string ...1Pni|8|hiddenField|__VIEWSTATEGENERATOR|32423F7A| other gibberish
From THAT sort of response you need to generate new POST requests for the subsequent pages, for example:
String viewState = substringBetween(ajaxResponse, "__VIEWSTATE|", "|");
Then:
request.setDataParameter("__VIEWSTATE", viewState);
There are will be more data parameters to get from each response. But a lot depends on the site you are targeting.
Hope this helps a little.

Parsing modern web pages (javascript/html5/json) using java

I used to have a tool that parse yahoo finance webpage, using jsup.
Recently yahoo changed the layout of their pages, and now the page is full of javascript and what looks like json data.
Please see example here:
http://finance.yahoo.com/quote/AAPL/financials?ltr=1
Inspecting the page in chrome shows a different view (after javascript had executed and the dom was created) than what the java document looks like in jsup:
Document d = Jsoup.connect(link).get();// link same as above
Element body = d.body();
returns an Element (body) that contains huge data document that looks like:
<div class="footer Py(10px) Ta(c) Bgc(#fff) Py(0) BdT Bdc($lightGray)" data-reactid=".1vh5ojua4n4.1.$0.0.0.3.1.$main-0-Quote-Proxy.$main-0-Quote.0.2.1.3.0.$footer">
<div class="Fz(s) Py(20px) " data-reactid=".1vh5ojua4n4.1.$0.0.0.3.1.$main-0-Quote-Proxy.$main-0-Quote.0.2.1.3.0.$footer.0">
<div class="Pb(10px) D(b)" data-reactid=".1vh5ojua4n4.1.$0.0.0.3.1.$main-0-Quote-Proxy.$main-0-Quote.0.2.1.3.0.$footer.0.0">
<a class="Mend(10px)" href="http://help.yahoo.com/kb/index?page=content&y=PROD_FIN&locale=en-US&id=SLN2310&pir=Zm7qO7BibUkC.4dK5GxJ95B3DCru2iA5odBNM0pj" data-reactid=".1vh5ojua4n4.1.$0.0.0.3.1.$main-0-Quote-Proxy.$main-0-Quote.0.2.1.3.0.$footer.0.0.0">
Any idea how I can parse this type of document in java? I suspect I need to run it in using a java script engine first and then parse the outcome, or maybe there is another way.

Jsoup href with function jscript

Guys I'm using the JSoup library to extract some data from a html page, but now I'm needing to jump to the next page, and this link on the next line.
<a class="jsEnabled nextBtn cursorPointer" href="javascript:setSelectedLink('NextPageButton');" title="Next page" alt="Next page"></a>
Ie, he is in a jscript function, how do I do to get the link dynamically?

Unfortunately jsoup can't execute javascripts. But you can use other libraries, eg. HtmlUnit do do so.
Did you check if the website has some plain html alternatives in, which allow you to get to the next page?

Fetching the website with Jsoup - page view source and Jsoup shows different content

I use Jsoup to scrap the website:
doc = Jsoup.connect(String.valueOf(urls[0])).userAgent("Mozilla").get();
Here is the link:
http://www.yelp.com/search?find_desc=restaurant&find_loc=willowbrook%2C+IL&ns=1#l=p:IL:Willowbrook::&sortby=rating&rpp=40
I have added rpp=40 parameter to the link in the command line to display 40 results per page. I'm able to see all the results in page view source.
I know that Jsoup is for the static content only and cannot fetch the websites that use AJAX/JS Libraries technique to generate content. However why Jsoup cannot retrieve the same content as I can see in the browser via page view source? Page view source shows 40 results whereas Jsoup is able to retrieve elements from only 10 results? How can I obtain every elements visible via page view source.

Short answer Jsoup can't execute the Javascript.
Long answer
http://www.yelp.com/search?find_desc=restaurant&find_loc=willowbrook%2C+IL&ns=1#l=p:IL:Willowbrook::&sortby=rating&rpp=40
The webpage your are looking for accepts the Http Get with the parameters. In the normal browser it accepts the params and loads the page . But Not with willowbrook checked(in your example). It loads the JS after it loads the page and the Javascript does the check box for Fliters the serach results. Therefore when you use Jsoup you are getting more results because it loads 'state=IL' without 'willowbrook' filtered.

How to get HTML tag After rendering the html on webpage using java or javascript or xslt

how to get html source code which was rendered by a javascript in webpage. How can i proceed this? Using xsl or javascript or java.

Get entire HTML in current page:
function getHTML(){
var D=document,h=D.getElementsByTagName('html')[0],e;
if(h.outerHTML)return h.outerHTML;
e=D.createElement('div');
e.appendChild(h.cloneNode(true));
return e.innerHTML;
}
outerHTML is non-standard property thus might not supported in some browser (i.e., Firefox), in this case this function mimic the outerHTML feature by cloning the html node into unattached element and read it's innerHTML property.

Javascript provides
document.getElementByTagName('')
You can get any tag from this line. Moreover if you want to do any operation to this tag then assign any id to that tag. then you can use document.getElementById('') to do any operation on it.
These will give you source code.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading HTML using jsoup - java

Jsoup dont execute javascript or jQuery events, so you will get a initial page before executing javascript.

Related

Going to next page on an aspx form with JSoup

Parsing modern web pages (javascript/html5/json) using java

Jsoup href with function jscript

Fetching the website with Jsoup - page view source and Jsoup shows different content

How to get HTML tag After rendering the html on webpage using java or javascript or xslt

Categories

Resources