I need to get the contents of a URL as a String, but the page this url points to has javascript that runs on page load that manipulates the DOM. How can I retrieve the HTML with the javascript DOM manipulation included? Is something like Selenium the right option? If so, how would I do that?
Try doing this :
Use the pause command
And then use driver.getPageSource()
Related
I need to pause my script before parsing (i want to wait for some information), but how can i do this in Jsoup?
I tried this:
link = Jsoup.connect("link").wait(100).get();
But this doesn't work for me.
Usually the need for waiting arises when content is loaded via AJAX. Jsoup can't deal with such stuff, because it is not a browser. Jsoup simply interprets HTML. The connection stuff is more or less only a wrapper around Java connections.
I guess you need to either identify the AJAX calls directly and get the contents of the AJAX calls, or you can use a real browser, i.e. selenium with phantomjs or similar.
I am making a new browser and I need to get the DOM of a webpage to do it. How do I do this?
You will have to write an HTML parser or use an existing one such as http://jsoup.org/
I am working with Liferay and I need to show a preview of HTML output of an URL as an embeded window in a JSP view. I am assessing different possibilities.
Store somewhat the interface to preview as a screenshot image and show it as an embeded image. Good thing is that formatting would be totally the same.
Parse URL output stream with a BufferedReader and clean all html, scripting, body tags with indexOf. Embed images as cid:
Some kind of include, jsp:include or liferay-util:include form direct URL of downloaded temporal HTML output
Any JQuery AJAX $().html() kind of solution
Any HTML-level solution in iframe, applet, frame, appletwindow or whatever if it exists
What do you think is the best or recommended way: simplest, reliable and exact looking? Any code or reference?
And in case I had to send it as a JavaMail Message content into an email direction?
Thank you!!
This should probably be done client-side, in Javascript, or even via iframes. Either put the iframes in the page directly, or have javascript code that generates the iframes, and point the iframes at the URL to be previewed. Keep it simple.
I am using Jsoup to parse an webpage this one https://daisy.dsv.su.se/servlet/schema.MomentInfoRuta?id=261020&kind=100&nochange=true&allPrio=true&multiple=true&allEx=true
In that webpage i can see something in the browser but when i am trying to parse it with Jsoup
Document doc = Jsoup.parse("https://daisy.dsv.su.se/servlet/schema.MomentInfoRuta?id=261020&kind=100&nochange=true&allPrio=true&multiple=true&allEx=true");
System.out.println(doc);
It will return
<html>
<head></head>
<body>
https://daisy.dsv.su.se/servlet/schema.MomentInfoRuta?id=261020&kind=100&nochange=true&allPrio=trueμltiple=true&allEx=true
</body>
</html>
Which is not all HTML.
Any suggestions how i can solve it or why it is happening?
That looks like they're detecting a crawler, usually via your user agent, and sending different content. Try setting your user agent string to a standard browser's string, and see if that resolves the issue you're having.
One other potential problem, though I don't think it's the issue here, is data loaded in via AJAX will not be downloaded by JSoup. It parses the HTML that gets served up, but it doesn't execute the JavaScript, so it can't get any extra content that comes in later. You might be able to resolve that issue using something like PhantomJS which can process and render HTML, CSS, and JavaScript, and would (in theory) give you the actual HTML you end up seeing in your browser.
I need a solution for getting HTML content from a browser. As rendering in a browser, js will be ran, and if not, js won't be ran. So any html libraries like lxml, beautifulsoup and others are all not gonna work.
I've searched a project named pywebkitgtk, but it's purpose is to create a browser with a front end.
Is there any way to put a url into a "fake browser" and render it and run its all javascript and save it into a html file? I don't need any front-end, just back-end is ok.
I need to use Python or java to do that.
selenium-rc lets you drive an actual browser for your purpose, under control of any of several languages at your choice, which include both Python and Java. Check it out!
For a detailed example of use with Python, see here.