Parsing modern web pages (javascript/html5/json) using java - java

I used to have a tool that parse yahoo finance webpage, using jsup.
Recently yahoo changed the layout of their pages, and now the page is full of javascript and what looks like json data.
Please see example here:
http://finance.yahoo.com/quote/AAPL/financials?ltr=1
Inspecting the page in chrome shows a different view (after javascript had executed and the dom was created) than what the java document looks like in jsup:
Document d = Jsoup.connect(link).get();// link same as above
Element body = d.body();
returns an Element (body) that contains huge data document that looks like:
<div class="footer Py(10px) Ta(c) Bgc(#fff) Py(0) BdT Bdc($lightGray)" data-reactid=".1vh5ojua4n4.1.$0.0.0.3.1.$main-0-Quote-Proxy.$main-0-Quote.0.2.1.3.0.$footer">
<div class="Fz(s) Py(20px) " data-reactid=".1vh5ojua4n4.1.$0.0.0.3.1.$main-0-Quote-Proxy.$main-0-Quote.0.2.1.3.0.$footer.0">
<div class="Pb(10px) D(b)" data-reactid=".1vh5ojua4n4.1.$0.0.0.3.1.$main-0-Quote-Proxy.$main-0-Quote.0.2.1.3.0.$footer.0.0">
<a class="Mend(10px)" href="http://help.yahoo.com/kb/index?page=content&y=PROD_FIN&locale=en-US&id=SLN2310&pir=Zm7qO7BibUkC.4dK5GxJ95B3DCru2iA5odBNM0pj" data-reactid=".1vh5ojua4n4.1.$0.0.0.3.1.$main-0-Quote-Proxy.$main-0-Quote.0.2.1.3.0.$footer.0.0.0">
Any idea how I can parse this type of document in java? I suspect I need to run it in using a java script engine first and then parse the outcome, or maybe there is another way.

Related

Reading HTML using jsoup

so i am trying to get an HTML element from a website using Jsoup, but the HTML that i get from the Jsoup.connect(url) is not complete compared to the one that i get using the inspector on the website.
EDIT : this is the link i'm working with https://www.facebook.com/livemap##35.831640894,24.82275312499999,2z
The numbers in the end designate the coordinates of the map, and you don't have to sign in to access the page, so there is no authentication problem
UPDATE :
So i have found that the element that i want does not get expanded using jsoup, is this a problem related to slow page loading ? If so, how can i make sure that Jsoup.connect(url) fully loads the webpage before fetching the HTML
from inspector (the <div id="u_0_e"> is expanded)
from jsoup.connect (the <div id="u_0_e"> is not expanded)
Jsoup dont execute javascript or jQuery events, so you will get a initial page before executing javascript.

Parsing of a dynamic web page using HtmlUnit in java is not working

Image explaining the data to be extracted
I'm trying to extract data from a web page (marked red in the image) using HtmlUnit library of java. But I can't get that particular value.
WebClient webClient = new WebClient(BrowserVersion.CHROME);
Thread.sleep(5000);
HtmlPage page = webClient.getPage("https://earth.nullschool.net/#current/wind/isobaric/500hPa/orthographic=-283.71,14.19,2183/loc=76.850,11.440");
Thread.sleep(5000);
System.out.println(page.asXml());
I checked the html which I got on console window. It doesn't contain the value.
<p>
<span id="location-wind" class="location">
</span>
<span id="location-wind-units" class="location text-button">
</span>
</p>
It's because these are filled in via JavaScript. When you load the page, these fields are initially empty. You can check this by looking at the source code and searching for id="location.
The page makes two additional HTTP requests to fetch dynamic data:
https://earth.nullschool.net/data/earth-topo.json?v3
https://gaia.nullschool.net/data/gfs/current/current-wind-isobaric-500hPa-gfs-0.5.epak
Somewhere in this data (and combined they are around 1.2 MB) is the data that you're looking for. Your best bet is to use a tool (perhaps an online one) to convert the JSON to a Java object, or to study the JSON and write code to get the specific data that you're after.
That is, if that data is in the JSON, which I'm not convinced about. The EPAK file appears to be some sort of binary data with embedded JSON, but I couldn't figure out if the data is perhaps in there.
Another approach is to use Selenium, have it parse the page for you, and retrieve the data from there.

Webdriver Automation - Unable to find element using xpath (in Octane 2.0 Benchmark page)

I am trying to parse Octane benchmark page http://octane-benchmark.googlecode.com/svn/latest/index.html , with WebElements:
<div class="hero-unit" id="inside-anchor">
<h1 id="main-banner" align="center">Start Octane 2.0</h1>
<div id="bar-appendix"></div>
</div>
I've started Selenium WebDriver on my tablet device (using Java, Eclipse, Selendoroid)
SelendroidConfiguration config = new SelendroidConfiguration();
selendroidServer = new SelendroidLauncher(config);
selendroidServer.lauchSelendroid();
DesiredCapabilities caps = SelendroidCapabilities.android();
driver = new SelendroidDriver(caps);
and I've initialized driver with Octane page:
driver.get("http://octane-benchmark.googlecode.com/svn/latest/index.html");
I am trying to parse it with xpath:
String xpathString = "//div[#class='hero-unit']//h1";
String line = driver.findElement(By.xpath(xpathString)).getText();
System.out.println(line);
but Java returns NullPointer Exception (on line)- function FindElement() can not find anything on this .html page.
Driver is started well, it returns appropriate value for getCurrentUrl() function, but can not return PageSource(), and can not return any value for findElement(By.something...).
Looks like, this Octane page has something that stops every search request (during parsing process). On the same way I have parsed 7 other benchmark pages, and they worked well, but this Octane page...acts just like it is "empty" for WebDriver...
I don't know is it because of
<script type="text/javascript">
part, or something else?
Is this Octane benchmark page special about something?
Thanks...
xPath() works with sites that conform to XML standard.HTML is more forgiving; you can have missing end tags and other errors but in XML this is forbidden. So chances were that the html does not conform with the XML standard so I double checked by validating your link at this site:
http://www.w3schools.com/xml/xml_validator.asp
And guess what? It had some errors. You save yourself the trouble next time by validating on this site first. Of course, that doesn't mean that XML conforming sites are all suitable for xPath() webscraping(hidden elements, javascript, etc.). However, from the nature of the reported error you might be able to tell which is not.
The By.xpath() only works if the html page conforms to XML standards. Probably the Octane 2.0 page does not comply and hence the method returns null.

Retrieving a webpage that requires loading time

I'm using Jsoup to parse the content from a website. The problem is that there are some data on the page that requires a couple of seconds to load. For this reason, my program can only get the loading graphic rather than the loaded data. Here is what I got:
<div class="sidebar_section">
<h3>Counsel</h3>
<ul style="display:none;" id="counsel">
<li>Loading <img src="/members/images/ajax-loader3.gif" /></li>
</ul>
</div>
If I open this url in a browser, I can actually see the contents for this block rather than the "loading" word.
I was wondering if there is anyway to get the content after the page is fully loaded. Here is my simple code:
Document doc = Jsoup.connect(url).get();
Any help is really really appreciated.
HttpURLConnection may be a better method for grabbing a web page as it gives more control and error handling, plus you can get the MIME type and character encoding.

How to get HTML tag After rendering the html on webpage using java or javascript or xslt

how to get html source code which was rendered by a javascript in webpage. How can i proceed this? Using xsl or javascript or java.
Get entire HTML in current page:
function getHTML(){
var D=document,h=D.getElementsByTagName('html')[0],e;
if(h.outerHTML)return h.outerHTML;
e=D.createElement('div');
e.appendChild(h.cloneNode(true));
return e.innerHTML;
}
outerHTML is non-standard property thus might not supported in some browser (i.e., Firefox), in this case this function mimic the outerHTML feature by cloning the html node into unattached element and read it's innerHTML property.
Javascript provides
document.getElementByTagName('')
You can get any tag from this line. Moreover if you want to do any operation to this tag then assign any id to that tag. then you can use document.getElementById('') to do any operation on it.
These will give you source code.

Categories