I need to crawl a site by Java while a section of content of it is loaded by ajax. Does anyone have experience how to get those content?
Thank you!
You can use HtmlUnit. It is a headless browser.
For example, with html unit you can press button on the page, wait for content get loaded by ajax and grab it.
Related
The website I am using jsoup to parse loads incrementally. The data I am trying to access loads into the page after a couple of seconds but jsoup only gets everything that was loaded initially.
Is there a way to force jsoup to wait for the page to load completely before attempting to parse it or to build in a delay to allow the page to load completely?
Jsoup can't deal with such a requirement. Jsoup simply interprets HTML (The wait is because of some AJAX call done after page load, since you have mentioned the page loads incrementally).
Alternate would be to use Selenium which can emulate your browser and handle these elegantly.
im working on a server-side html render.
The case: The user has a simpel page with 3 cells. In each cell he can fill the html, css and JS code. After that, it will be send to the server which render the html and css code considering the javascript code.
My idea was to "simulate" a headless-browser. Till now i just found PhantomJS but i think its not really comfortable.
My result should be only the rendered HTML DOM
thank you
Try headless Chrome, this works on all operating systems:
https://chromium.googlesource.com/chromium/src/+/lkgr/headless/README.md
On Linux you have one more option. You can run any normal browser with virtual screen buffer.
Thank you for your response. As far as i see i have to use node.js. Is there a way to stay in the java envoirment without node.js?
I'm writing an Android app that parses a web page, filters the image links from it and load them in a WebView.
It works fine for static pages, but i have no idea how to handle pages that dynamically add content as i scroll down, such as 9gag, imgur, Facebook etc.
Is there a solution for this? I guess the dynamic content is handled by JavaScript. Maybe there's a way to call this JavaScript code before parsing the page?
I'd appreciate any advice.
Thanks in advance.
You should try looking at the requests that dynamic pages make.
All of them use a pattern of dynamic pagination, or a cursor.
Imgur for example issues requests with an url like this.
https://imgur.com/gallery/hot/viral/page/4/hit?set=0
Where you specify the page and the set is the portion of the page (Normaly they go up to 3)
I use Java. I want to get web page source code but on the page works JavaScript and I want get code generated by JavaScript (code which we see in firebug in firefox)
Anyone knows what I should do?
To inspect the page after modification by JavaScript, you need a client-side JavaScript engine that can run the scripts and then let you inspect the DOM.
HtmlUnit can do this - it is a "GUI-Less browser for Java programs".
See also this question
However, this won't give you the exact original page source, because that has already been parsed into a DOM by this point.
I think you want to see the source code of DOM Elements created after the page load via AJAX.
If that´s what you want, the only way to see it is through a DOM inspector, like firebug in firefox or Developers Tools in Chrome.
Going to "View source code" only shows the source at load-time.
If I understand your question, yes your javascript objects can be passed back to your java backend either by a creating a html <form> element with inputelements, fill them with your values and then submit the form, or asynchronously via ajax/json (which doesn't require re-loading your web page). For both methods you need to configure an endpoint on your java side to receive the submitted data and return some kind of confirmation to the client, i.e. your javascript. I would recommend googling "jQuery.post" for the javascript side and finding some examples for your java backend.
I am working with Liferay and I need to show a preview of HTML output of an URL as an embeded window in a JSP view. I am assessing different possibilities.
Store somewhat the interface to preview as a screenshot image and show it as an embeded image. Good thing is that formatting would be totally the same.
Parse URL output stream with a BufferedReader and clean all html, scripting, body tags with indexOf. Embed images as cid:
Some kind of include, jsp:include or liferay-util:include form direct URL of downloaded temporal HTML output
Any JQuery AJAX $().html() kind of solution
Any HTML-level solution in iframe, applet, frame, appletwindow or whatever if it exists
What do you think is the best or recommended way: simplest, reliable and exact looking? Any code or reference?
And in case I had to send it as a JavaMail Message content into an email direction?
Thank you!!
This should probably be done client-side, in Javascript, or even via iframes. Either put the iframes in the page directly, or have javascript code that generates the iframes, and point the iframes at the URL to be previewed. Keep it simple.