Retrieving a webpage that requires loading time - java

I'm using Jsoup to parse the content from a website. The problem is that there are some data on the page that requires a couple of seconds to load. For this reason, my program can only get the loading graphic rather than the loaded data. Here is what I got:
<div class="sidebar_section">
<h3>Counsel</h3>
<ul style="display:none;" id="counsel">
<li>Loading <img src="/members/images/ajax-loader3.gif" /></li>
</ul>
</div>
If I open this url in a browser, I can actually see the contents for this block rather than the "loading" word.
I was wondering if there is anyway to get the content after the page is fully loaded. Here is my simple code:
Document doc = Jsoup.connect(url).get();
Any help is really really appreciated.

HttpURLConnection may be a better method for grabbing a web page as it gives more control and error handling, plus you can get the MIME type and character encoding.

Related

Reading HTML using jsoup

so i am trying to get an HTML element from a website using Jsoup, but the HTML that i get from the Jsoup.connect(url) is not complete compared to the one that i get using the inspector on the website.
EDIT : this is the link i'm working with https://www.facebook.com/livemap##35.831640894,24.82275312499999,2z
The numbers in the end designate the coordinates of the map, and you don't have to sign in to access the page, so there is no authentication problem
UPDATE :
So i have found that the element that i want does not get expanded using jsoup, is this a problem related to slow page loading ? If so, how can i make sure that Jsoup.connect(url) fully loads the webpage before fetching the HTML
from inspector (the <div id="u_0_e"> is expanded)
from jsoup.connect (the <div id="u_0_e"> is not expanded)
Jsoup dont execute javascript or jQuery events, so you will get a initial page before executing javascript.

Parsing modern web pages (javascript/html5/json) using java

I used to have a tool that parse yahoo finance webpage, using jsup.
Recently yahoo changed the layout of their pages, and now the page is full of javascript and what looks like json data.
Please see example here:
http://finance.yahoo.com/quote/AAPL/financials?ltr=1
Inspecting the page in chrome shows a different view (after javascript had executed and the dom was created) than what the java document looks like in jsup:
Document d = Jsoup.connect(link).get();// link same as above
Element body = d.body();
returns an Element (body) that contains huge data document that looks like:
<div class="footer Py(10px) Ta(c) Bgc(#fff) Py(0) BdT Bdc($lightGray)" data-reactid=".1vh5ojua4n4.1.$0.0.0.3.1.$main-0-Quote-Proxy.$main-0-Quote.0.2.1.3.0.$footer">
<div class="Fz(s) Py(20px) " data-reactid=".1vh5ojua4n4.1.$0.0.0.3.1.$main-0-Quote-Proxy.$main-0-Quote.0.2.1.3.0.$footer.0">
<div class="Pb(10px) D(b)" data-reactid=".1vh5ojua4n4.1.$0.0.0.3.1.$main-0-Quote-Proxy.$main-0-Quote.0.2.1.3.0.$footer.0.0">
<a class="Mend(10px)" href="http://help.yahoo.com/kb/index?page=content&y=PROD_FIN&locale=en-US&id=SLN2310&pir=Zm7qO7BibUkC.4dK5GxJ95B3DCru2iA5odBNM0pj" data-reactid=".1vh5ojua4n4.1.$0.0.0.3.1.$main-0-Quote-Proxy.$main-0-Quote.0.2.1.3.0.$footer.0.0.0">
Any idea how I can parse this type of document in java? I suspect I need to run it in using a java script engine first and then parse the outcome, or maybe there is another way.

js to determine if iframe contains images

I am working on an application that allows users to enroll in my program. My problem is that at the end of enrollment I generate a PDF for them to look over and accept the terms and e-sign. Sometimes the PDF server fails to stream and when that happens the iFrame just contains the alt text for the images. Is there a way to look into the iFrame and see if the images of the PDF are there or the alt text is there. That way I can keep them from proceeding and display an error message.
One Jsp looks like this
<c:forEach items="${images}" var="src">
<img src="${src}" alt="Image" />
</c:forEach>
This Jsp calls a generate function which makes the pdf and turns them into images which then saves them to a remote server. The controller then returns the first jsp as the view which should populate the iFrame.
<div id="image">
<img id="loading" src="/blah/resources/images/loading.gif" />
<iframe style="width: 775px; height: 600px; display: none"
src="blah/blah/pdf/generateImages?product=<c:out value="${fn:toLowerCase(enrollmentConversation.product.textKey)}" />&state=<c:out value="${stateCodeAbbreviation}" />&pdfGuid=<c:out value="${pdfGUIDForLookup}" />&sizeType=775/p2"
id="pdfIframe"
onLoad="jQuery('#pdfIframe').show();
jQuery('#loading').hide();
jQuery('.hideWhileWaiting').show();">
</iframe>
</div>
So is there a way to look at the iFrame and say does this contain the images or does it contain alt text="Images"?
Your iFrame can be another application which is cross-domain (or same application on same domain).
When you create pdf and converts them in to images, I suggest you to write SUCCESS/FAILURE entry in database.
Then from your calling application, using AJAX database call, you can easily figure out whether pdf->image was generated successfully or not.

How can I using Javascript Swap Out A h2 URL Destination with Limited Access to HTML?

I don't have access to my HTML code but I have access to Javascript in the footer of my document. With that being said I would like to switch out the URL "/vistor_signup" with a new URL of my choosing. Lets say "http://www.example.com/account_signup"
And I would also like to do the same for "/user_signup", lets say swap to "http://www.example.com/master_signup"
I have to use JavaScript to do so and I don't have any understanding of JS.
How do I make this work with JS code?
My code
<div class="grid_12">
<div id="login">
<div class="panel" id="login-form">
<div id="login-promo">
<div class="clear"></div>
<h2>Visitor Sign-Up ></h2>
<h2>User Sign-Up ></h2>
</div>
</div>
</div>
</div>
</div>
you mean something like this:
var anchors = document.body.getElementsByTagName("a");
for(var i=0; i < anchors.length; i++) {
var anc = anchors[i];
if (anc.getAttribute("href") == "/visitor_signup") {
anc.setAttribute("href", "http://www.example.com/account_signup");
}
}
WARNING: due to the way browser render HTML (parsing the page, semi-sequentially fetching referenced resources, evaluating javascript along the way), it might happen that someone sees the html before your script gets executed, and even clicks the '/visitor_signup' link.
Under your limitations, esp.
No access to code
No id tag on elements
your best bet is to
use document.body.GetElementsByTagName() to find all tags
on those check the href property
change it accordingly
EDIT: This is exactly what #milan's answer does, so please disregard this one
Since you can't edit the HTML and the <h2>s aren't differentiated, using jQuery might be easier than using plain JS in order to reach the elements.
This jQuery could be:
$('#login-promo h2:first a').attr("href", "/account_signup").parent().next().find('a').attr("href", "/master_signup");
Here we are selecting the first <h2> <a> and changing its href. Then we go back tho the <a>s parent, find the next <h2> <a>and change its href too.
You can check an example in this jsfiddle.

GWT - easiest way to do a simple loading screen until file is loaded

When clicking a button, my GWT application returns a PDF file embedded in an HTML page which looks something like:
<html><head></head>
<body marginwidth="0" marginheight="0" bgcolor="rgb(38,38,38)">
<embed width="100%" height="100%" name="plugin"
src="http://myserver/?cmd=getMyPdf" type="application/pdf">
</body>
</html>
Problem is it can take a while for the server to create this PDF file, so what I want is a waiting screen with a loading animation which can have the PDF file download in the background, and then when the file is done, display the page as described above.
One obvious way would be to display a loading page, send an asynchronous command to the server and then once the onSucceed method is called, call the page as normal. Downside is I'd have to add some server-side logic for making the PDF creation work in the background...
Is there any way to do this client-side with the GWT API?
Did you see this stackoverflow question Detect when browser receives file download? Basically the answer given is that you set a cookie in the return response and wait on the client side for this cookie to be set. This can be done easily with GWT as it has a Scheduler (for the repeated timer check) and easy access to Cookies. You still need to make some server changes, but you don't have to create a background process.
I don't have the full answer, but the following code works for me in Safari, and maybe you can modify it, to make it work with other browsers, too (?):
<html><head>
<script type="text/javascript">
function showPdf() {
document.getElementById("loading").style.visibility = "hidden";
document.getElementById("pdf").style.visibility = "visible";
}
</script>
</head>
<body marginwidth="0" marginheight="0" bgcolor="rgb(38,38,38)">
<div id="loading"
style="position: absolute; background-color: white;">Loading...</div>
<iframe id="pdf" width="100%" height="100%" name="plugin"
src="http://myserver/?cmd=getMyPdf" onload="javascript:showPdf();"
style="visibility: hidden;"></iframe>
</body>
</html>
This is pure JavaScript - but could certainly be done with GWT, too. Note, that I'm using an iframe instead of embed, because embed doesn't really support the onload method (and embed is not a standard HTML element, as far as I remember).
The reason, why this may not be the full answer, is that Chrome fires the onload event as soon as the PDF starts downloading (but after the PDF generation on the server side has finished). I'm not sure, if this is what you want?

Categories