I'm writing an Android app that parses a web page, filters the image links from it and load them in a WebView.
It works fine for static pages, but i have no idea how to handle pages that dynamically add content as i scroll down, such as 9gag, imgur, Facebook etc.
Is there a solution for this? I guess the dynamic content is handled by JavaScript. Maybe there's a way to call this JavaScript code before parsing the page?
I'd appreciate any advice.
Thanks in advance.
You should try looking at the requests that dynamic pages make.
All of them use a pattern of dynamic pagination, or a cursor.
Imgur for example issues requests with an url like this.
https://imgur.com/gallery/hot/viral/page/4/hit?set=0
Where you specify the page and the set is the portion of the page (Normaly they go up to 3)
Related
I'm trying to use Jsoup to gather wave height information from Surfline.com. I have the element I desire in the screenshot and the it's showing in the dev tools. When I scrape the site with Jsoup, the returned string includes everything seen in the dev tool but the "1-2ft" which is what I need. The site is Javascript heavy and I'm assuming that jsoup is snagging the html before the javascript actually runs (I have no clue really). Do I need to specifically tell jsoup to wait for the pageload or am I missing some other critical component?
This is the code I'm using.
Document doc = Jsoup.connect("http://www.surfline.com/surf-report/folly-beach-pier-southside-southeast_5294/").get();
Elements content = doc.select("div[id=current-surf-range]");
System.out.println(content);
and this is the output I'm seeing in my IDE
<div id="current-surf-range" style="font-size:21px;font-weight:bold;padding-top:7px; padding-bottom: 7px;"></div>
it seems really odd that the contents of the div wouldn't be returned with it. This is my first time using Jsoup and I tried to read through the docs as best I could but nothing seemed to touch on this particular issue. Any insight would be awesome and greatly appreciated.
What you see in the browser is not what necessarily you would get when download the page by URL with your HTTP library of a choice. In fact, you should never expect them to be the same. In the modern web, webpages are quite dynamic and are loaded asynchronously involving multiple API calls to different resource providers and javascript being executed in the browser (which has the javascript engine).
What you get with JSoup in this case is the initial HTML that browser starts to form the page with. Then, there is a set of XHR calls to the surfline API that brings the data into the browser which then dynamically fills up different parts of the page, including the current surf range.
The simplest way to approach the problem is to switch to browser automation tool called selenium which would fire up a real browser. You can then wait for the current surf range element to have a value and, if you wish to continue with JSoup, get the page source and feed it to JSoup for further parsing.
Another approach would involve looking into the requests that the page makes in the browser developer tools and then try to simulate these requests in your code, parsing the JSON responses and extracting the surf forecast data.
I need to crawl a site by Java while a section of content of it is loaded by ajax. Does anyone have experience how to get those content?
Thank you!
You can use HtmlUnit. It is a headless browser.
For example, with html unit you can press button on the page, wait for content get loaded by ajax and grab it.
I fetch the website using Jsoup. Here is the link to the web:
http://www.yelp.com/search?find_desc=restaurants&find_loc=westmont%2C+il&ns=1&ls=43131f934bb3adf3#find_loc=Hinsdale,+IL&l=p:IL:Hinsdale::&sortby=rating&unfold=1
Now I'm trying to extract the number of sub-pages on the web. For example the numbers next to "Go to Page" as shown in the picture below:
Unfortunately either 'view source' in the browser or Jsoup is not able to see these elements. I guess this content is embedded dynamically into the web. If so what is the best way to access dynamically generated web? Thanks.
For the website that use AJAX/JS Libraries technique to generate content, you may want to use HTMLUnit instead (HTMLUnit can simulate Javascript events). JSoup is only for static HTML, or things that you could receive via viewsource.
I am working with Liferay and I need to show a preview of HTML output of an URL as an embeded window in a JSP view. I am assessing different possibilities.
Store somewhat the interface to preview as a screenshot image and show it as an embeded image. Good thing is that formatting would be totally the same.
Parse URL output stream with a BufferedReader and clean all html, scripting, body tags with indexOf. Embed images as cid:
Some kind of include, jsp:include or liferay-util:include form direct URL of downloaded temporal HTML output
Any JQuery AJAX $().html() kind of solution
Any HTML-level solution in iframe, applet, frame, appletwindow or whatever if it exists
What do you think is the best or recommended way: simplest, reliable and exact looking? Any code or reference?
And in case I had to send it as a JavaMail Message content into an email direction?
Thank you!!
This should probably be done client-side, in Javascript, or even via iframes. Either put the iframes in the page directly, or have javascript code that generates the iframes, and point the iframes at the URL to be previewed. Keep it simple.
i want to display an external webpage (exactly as it's rendered in that site) into a webpage in my application in a way that's fast and better for SEO crawlers, and i was wondering if there's a way to do that with javaee ?
if not then what is better in performance and for SEO the XMLHTTPRequest way or the iframes way.
please advise with sample code or link if possible, thanks
Update: example website is: http://www.akhbarak.net/
If you need to display content from different pages inline, use iframe (iframe stands for inline frame - it has nothing to do with Apple).
If you'd like to use AJAX to display pages, I would recommend colorbox.
Note that accessing pages in a different domain via AJAX is next to impossible - this is a very, very big security hole. I would not recommend doing it. You would have to use a proxy on your own server to fetch the page and return its HTML.
That said, using the iframe in your source code, so it is loaded with the rest of the page, seems like your best bet. Sites like facebook and twitter use this in embeddable "like" and "tweet" widgets so that those widgets can make requests on their own domain - that is, twitter or facebook. While managing lots of iframes isn't very fun, it is a very accepted way of doing what you want to do.
In theory, you could
load the whole page into a PHP variable,
replace the body tags with ,
take out the html tags,
pull out the entire section and put it in the encompassing pages ,
and replace all links with absolute ones (ie '/images' changes to 'http://example.com/images')
Would it be easy to do? Probably not. It's the only way I can think of to accomplish it so that the site appears as part of yours though.