Reading dynamic web page content in java - java

I need help reading the contents of a webpage. Currently i am using the following method to read the contents
BufferedReader in = new BufferedReader(new InputStreamReader(page.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
{Content = Content + inputLine;}
However with this method there is a problem. . some jsp pages have ajax in them which randomly updates a css class of a webpage like so
Javascript code just to give an idea:
if (request.readyState === 4 && request.status === 200)
{
var type = request.getResponseHeader("Content-Type");
$('.update').empty();
$('.update').append(request.responseText); //update the css class
}
So as a result when this page reader is read through my java method as mentioned above i just get
<div class="update"></div>
although on the screen this class has a value.
Now however if i save the page first (by clicking save as in Firefox) then the values appended in the CSS class by jquery are also visible.
Is there a method or a way on how i could read the values or obtain the values like firefox does by saving the pages.. I want to read the contents of the entire webpage with the Ajax values present in the string.
On one side i read that this is difficult since the JAvascript in rendered and executed by the browser so i wanted to know does firefox have any apis that might help ? Any suggestions would be appreciated.

You may find the following project useful:
HTMLUnit
Here is also a very informative blog post from Data Big Bang.

Also check out PhantomJS. In the same way that Crowbar is a headless Mozilla browser, PhantomJS is a headless WebKit browser - the engine that Safari and Google Chrome use.

Related

Going to next page on an aspx form with JSoup

I'm trying to go to the next page on an aspx form using JSoup.
I can find the next button itself. I just don't know what to do with it.
The idea is that, for that particular form, if the next button exists, we would simulate a click and go to the next page. But any other solution other than simulating a click would be fine, as long as we get to the next page.
I also need to update the results once we go to the next page.
// Connecting, entering the data and making the first request
...
// Submitting the form
Document searchResults = form.submit().cookies(resp.cookies()).post();
// reading the data. Everything up to this point works as expected
...
// finding the next button (this part also works as expected)
Element nextBtn = searchResults.getElementById("ctl00_MainContent_btnNext");
if (nextBtn != null) {
// click? I don't know what to do here.
searchResults = ??? // updating the search results to include the results from the second page
}
The page itself is www.somePage.com/someForm.aspx, so I can't use the solution stated here:
Android jsoup, how to select item and go to next page
I was unable to find any other suggestions.
Any ideas? What am I missing? Is simulating a click even possible with JSoup? The documentation says nothing about it. But I'm sure people are able to navigate these type of forms.
Also, I'm working with Android, so I can't use HtmlUnit, as stated here:
importing HtmlUnit to Android project
Thank you.
This is not Jsoup work! Jsoup is a parser with a nice DOM API that allows you to deal with wild HTML as if it were well-formed and not crippled with errors and nonsenses.
In your specific case you may be able to scrape the target site directly from your app by finding links and retrieving HTML pages recursively. Something like
private void scrape(String url) {
Document doc = Jsoup.connect(url).get();
// Analyze current document content here...
// Then continue
for (Element link : doc.select(".ctl00_MainContent_btnNext")) {
scrape(link.attr("href"));
}
}
But in the general case what you want to do requires far more functionality that Jsoup provides: a user agent capable of interpreting HTML, CSS and Javascript with a scriptable API that you can call from your app to simulate a click. For example Selenium:
WebDriver driver = new FirefoxDriver();
driver.findElement(By.name("next_page")).click();
Selenium can't be bundled in an Android app, so I suggest you put your Selenium code on a server and make it accessible with some REST API.
Pagination on ASPX can be a pain. The best thing you can do is to use your browser to see the data parameters it sends to the server, then try to emulate this in code.
I've written a detailed tutorial on how to handle it here but it uses the univocity HTML parser (which is commercial closed source) instead of JSoup.
In short, you should try to get a <form> element with id="aspnetForm", and read the form elements to generate a POST request for the next page. The form data usually comes out with stuff such as this:
__EVENTTARGET =
__EVENTARGUMENT =
__VIEWSTATE = /wEPDwUKMTU0OTkzNjExNg8WBB4JU29ydE9yZ ... a very long string
__VIEWSTATEGENERATOR = 32423F7A
... and other gibberish
Then you need to look at each one of these and compare with what your browser sends. Sometimes you need to get values from other elements of the page to generate a similar POST request. You may have to REMOVE some of the parameters you get - again, make your code behave exactly the same as your browser
After some (frustrating) trial and error you will get it working. The server should return a pipe-delimited result, which you can break down and parse. Something like:
25081|updatePanel|ctl00_ContentPlaceHolder1_pnlgrdSearchResult|
<div>
<div style="font-weight: bold;">
... more stuff
|__EVENTARGUMENT||343908|hiddenField|__VIEWSTATE|/wEPDwU... another very long string ...1Pni|8|hiddenField|__VIEWSTATEGENERATOR|32423F7A| other gibberish
From THAT sort of response you need to generate new POST requests for the subsequent pages, for example:
String viewState = substringBetween(ajaxResponse, "__VIEWSTATE|", "|");
Then:
request.setDataParameter("__VIEWSTATE", viewState);
There are will be more data parameters to get from each response. But a lot depends on the site you are targeting.
Hope this helps a little.

Jsoup - read from an html url where code is hidden

I'm trying to using the jsoup library to get 'li' from a website. The problem is this:
If I open the source of website with CTRL+U(which is the same read by jsoup), the 'ul' tag is hidden.
if I open the code with the fuction "inspect code" of google chrome,'li' are shown.
Posting the code is not necessary; I only want to know how can access to this 'li' with jsoup or other java free libraries, Whereas in the source code(and through jsoup) these informations are hidden.
The site is https://farmaci.agenziafarmaco.gov.it/bancadatifarmaci/cerca-farmaco and try to search something(i.e. Tachi)
The problem with Jsoup is that it won't handle scripts. It is just getting html as it is before the AJAX code is executed.
You can use something like HtmlUnit, which is basically a GUI-less browser. So, it can handle scripts.
You can try something like this after getting the HtmlUnit library:
String url = "https://farmaci.agenziafarmaco.gov.it/bancadatifarmaci/cerca-farmaco?search=Tachi";
try(final WebClient webClient = new WebClient()) {
final HtmlPage page = webClient.getPage(url);
final HtmlUnorderedList list = page.getHtmlElementById("ul_farm_results");
System.out.println(list.asText());
}
I couldn't check the code as the website's certificate is improperly configured and I didn't want to import it's certificate. You may want to take a look at this to resolve the certificate errors.
JSoup does not execute all the scripts, it just gets the HTML returned by the server. What you are looking for is call rendered HTML, that is the HTML produced by the browser after executing all the scripts.
The best solution in Java is to use Selenium with your preferred browser. Selenium was developed for UI testing, it is however very popular as a scraping tool.
A good getting started page is to be found here.
Some code example with Firefox:
WebDriver driver = new FirefoxDriver();
driver.get("https://farmaci.agenziafarmaco.gov.it/bancadatifarmaci/cerca-farmaco");
// Find the element
String id = "ul_farm_results";
WebElement element = driver.findElement(By.id(id));

Java HTML Parsing a Page with Infinite Scroll

How can I grab a page's HTML in java if the page has infinite scroll? I'm currently grabbing a page this way:
URL url = new URL(stringUrl);
URLConnection con = url.openConnection();
InputStream in = con.getInputStream();
String encoding = con.getContentEncoding();
encoding = encoding == null ? "UTF-8" : encoding;
String html = IOUtils.toString(in, encoding);
Document document = Jsoup.parse(html);
But it doesn't return any of the content associated with the infinite scroll section of the page. How can I trigger this scrolling on the HTML page so that my Jsoup document contains this section?
Infinite scroll describes a technique where the page does not contain the content. Some JavaScript code runs in the browser, sends a request to the server for addiional content and adds it to the page. When you scroll towards the end of the available content, the JavaScript code repeats the process: it sends another request and adds the additional content.
Therefore, you need a web browser with a JavaScript engine that can run the JavaScript code and produce the events that cause the code to load content.
#dsh is right, the content is most likely loaded dynmically via AJAX. As an alternative to using a real browser, i.e. selenium webdriver, you may look into the network traffic and idetify the API call that the page triggers. You can maybe call that Api directly with Jsoup. Often the content is however not HTML but JSON, XML or some other format. It still may be very worth while doing this, since using webdriver is often pretty slow and resource-heavy.

getByXpath() not working inside frame

i am new to Htmlunit and trying to extract data from a website http://capitaline.com/new/index.asp. I have logged into the website successfully. When we log into website there are three frames.
One on the top to search for the company(like ACC ltd.) for which we are extracting data.
2nd frame has a tree which provide links to various data we want to look at.
3rd frame has the resulted data outcome on the basis of link you clicked in frame.
I managed to get the frame i need below:
HtmlPage companyAtGlanceTopWindow =(HtmlPage)companyAtGlanceLink.click().getEnclosingWindow().getTopWindow().getEnclosedPage();
HtmlPage companyAtGlanceFrame = (HtmlPage)companyAtGlanceTopWindow.getFrameByName("mid2").getEnclosedPage();
System.out.println(companyAtGlanceFrame.toString()); // This line returns the frame URL as i can see in my browser.
Output of print statement is
HtmlPage(http://capitaline.com/user/companyatglance.asp?id=CGO&cocode=6)#1194282974
Now i want my code to navigate down to the table inside this frame and for that i am using getByXPath() but it gives me nullPointerException. Here is the code for that.
HtmlTable companyGlanceTable1 = companyAtGlanceFrame.getFirstByXPath("/html/body/table[4]/tbody/tr/td/table/tbody/tr/td[1]/table");
My XPath for the current webpage(after i clicked the link)from which i am trying to extract table is seems correct, as it is copied from chrome element inspect. Please suggest some way to extract the table. I have done this type of extraction before but there i had id of table so, i used it.
Here is the HTML code for the table in the webpage.
<table width="100%" class = "tablelines" border = "0" >
I want to know that can you see the inner contents of each iframes in console (print asXml()), are they nested iframes?
well try this
List<WebWindow> windows = webClient.getWebWindows();
for(WebWindow w : windows){
HtmlPage hpage = (HtmlPage) w.getEnclosedPage();
System.out.println(hpage.asXml());
}
once you can see the contents,
HtmlPage hpage = (HtmlPage)webClient.getWebWindowByName(some_name).getEnclosedPage();
then using xpath grab your table contents(make sure your xpath is correct). It will work.(worked for me)
Thank you RDD for your feedback.
I solved the problem. Actually issue was not with the frame but with the XPath provided by chrome.
XPath Provided by chrome is:
/html/body/**table[4]**/tbody/tr/td/table/tbody/tr/td[1]/table
But the XPath worked for me is:
/html/body/**table[3]**/tbody/tr/td/table/tbody/tr/td[1]/table
It seems as, XPath provided by chrome has some glitch when there is a table within the path(Or may be some bug in htmlunit itself). I did many experiments and found that chrome always gives ../../table[row+1]/.. as XPath, while working XPath for htmlunit is ../../table[row]/..
SO, this code is working fine for me
HtmlTable companyGlanceTable1 = companyAtGlanceFrame.getFirstByXPath("/html/body/table[3]/tbody/tr/td/table/tbody/tr/td[1]/table");

The Web Browser show the correct values but when I use Jsoup the HTML doesn't have the values

I'm trying to get some values from a site but these values only appears when I use a Browser, like Mozilla. When I use the Jsoup I can get the HTML from the site but without values, only with the tags.
This is the site I'm trying to parse:
http://www.submarinoviagens.com.br/Passagens/selecionarvoo?Origem=nat&Destino=mia&Data=05/11/2012&Hora=&Origem=mia&Destino=nat&Data=09/11/2012&Hora=&NumADT=1&NumCHD=0&NumINF=0&SomenteDireto=0&Cia=&SelCabin=&utm_source=&utm_medium=&utm_campaign=&CPId=
I'm trying to get the values that appears inside these span tags:
If I access the previous URL from a web browser I can see the following values: '', 'R$ 2634,22' and 'R$ 2634,22', but when I use the following code the values disapears.
URL url = new URL("http://www.submarinoviagens.com.br/Passagens/selecionarvoo?Origem=nat&Destino=mia&Data=05/11/2012&Hora=&Origem=mia&Destino=nat"+
"&Data=09/11/2012&Hora=&NumADT=1&NumCHD=0&NumINF=0&SomenteDireto=0&Cia=&SelCabin=&utm_source=&utm_medium=&utm_campaign=&CPId=");
Document doc = Jsoup.parse(url, 100000);
String title = doc.title();
System.out.println(doc.toString());
If I try to see the source code via Mozilla Firefox the values disapears too.
But If I use the firebug plugin I can see them.
Thank's for the help!
The website uses JavaScript to populate all of the values you are trying to parse. You will have to use a library that can compute the javascript within the page. Not sure if there is one though.
anyone else?
Htmlunit is a headless browser that renders Javascript and should be able to present this page correctly.

Categories