Display external webpage into a webpage in my application - java

i want to display an external webpage (exactly as it's rendered in that site) into a webpage in my application in a way that's fast and better for SEO crawlers, and i was wondering if there's a way to do that with javaee ?
if not then what is better in performance and for SEO the XMLHTTPRequest way or the iframes way.
please advise with sample code or link if possible, thanks
Update: example website is: http://www.akhbarak.net/

If you need to display content from different pages inline, use iframe (iframe stands for inline frame - it has nothing to do with Apple).
If you'd like to use AJAX to display pages, I would recommend colorbox.
Note that accessing pages in a different domain via AJAX is next to impossible - this is a very, very big security hole. I would not recommend doing it. You would have to use a proxy on your own server to fetch the page and return its HTML.
That said, using the iframe in your source code, so it is loaded with the rest of the page, seems like your best bet. Sites like facebook and twitter use this in embeddable "like" and "tweet" widgets so that those widgets can make requests on their own domain - that is, twitter or facebook. While managing lots of iframes isn't very fun, it is a very accepted way of doing what you want to do.

In theory, you could
load the whole page into a PHP variable,
replace the body tags with ,
take out the html tags,
pull out the entire section and put it in the encompassing pages ,
and replace all links with absolute ones (ie '/images' changes to 'http://example.com/images')
Would it be easy to do? Probably not. It's the only way I can think of to accomplish it so that the site appears as part of yours though.

Related

Jsoup Issue scraping non-hardcoded data

I'm trying to use Jsoup to gather wave height information from Surfline.com. I have the element I desire in the screenshot and the it's showing in the dev tools. When I scrape the site with Jsoup, the returned string includes everything seen in the dev tool but the "1-2ft" which is what I need. The site is Javascript heavy and I'm assuming that jsoup is snagging the html before the javascript actually runs (I have no clue really). Do I need to specifically tell jsoup to wait for the pageload or am I missing some other critical component?
This is the code I'm using.
Document doc = Jsoup.connect("http://www.surfline.com/surf-report/folly-beach-pier-southside-southeast_5294/").get();
Elements content = doc.select("div[id=current-surf-range]");
System.out.println(content);
and this is the output I'm seeing in my IDE
<div id="current-surf-range" style="font-size:21px;font-weight:bold;padding-top:7px; padding-bottom: 7px;"></div>
it seems really odd that the contents of the div wouldn't be returned with it. This is my first time using Jsoup and I tried to read through the docs as best I could but nothing seemed to touch on this particular issue. Any insight would be awesome and greatly appreciated.
What you see in the browser is not what necessarily you would get when download the page by URL with your HTTP library of a choice. In fact, you should never expect them to be the same. In the modern web, webpages are quite dynamic and are loaded asynchronously involving multiple API calls to different resource providers and javascript being executed in the browser (which has the javascript engine).
What you get with JSoup in this case is the initial HTML that browser starts to form the page with. Then, there is a set of XHR calls to the surfline API that brings the data into the browser which then dynamically fills up different parts of the page, including the current surf range.
The simplest way to approach the problem is to switch to browser automation tool called selenium which would fire up a real browser. You can then wait for the current surf range element to have a value and, if you wish to continue with JSoup, get the page source and feed it to JSoup for further parsing.
Another approach would involve looking into the requests that the page makes in the browser developer tools and then try to simulate these requests in your code, parsing the JSON responses and extracting the surf forecast data.

Parsing dynamically growing web pages with jsoup

I'm writing an Android app that parses a web page, filters the image links from it and load them in a WebView.
It works fine for static pages, but i have no idea how to handle pages that dynamically add content as i scroll down, such as 9gag, imgur, Facebook etc.
Is there a solution for this? I guess the dynamic content is handled by JavaScript. Maybe there's a way to call this JavaScript code before parsing the page?
I'd appreciate any advice.
Thanks in advance.
You should try looking at the requests that dynamic pages make.
All of them use a pattern of dynamic pagination, or a cursor.
Imgur for example issues requests with an url like this.
https://imgur.com/gallery/hot/viral/page/4/hit?set=0
Where you specify the page and the set is the portion of the page (Normaly they go up to 3)

Convert Web Page to PDF or Image

I need to convert a web page [which has not public access] to PDF or Image [preferably to PNG].
Web page contains set of charts and image. Most of the charts are populated through Ajax calls so there is a delay between page load and chart load.
I am looking answer for any of these questions:
1- I found set of snapshot api's but none of them support accessing my internal page. Since the web page I am trying to export is not public I need to be authenticated. Biggest problem is I cannot send request headers [such as session-id, cookie or other variables] along with these API's. It seems they don't support this kind of functionality.
2- I am not sure if I can do following: Login to my web page with HTTP client, add http headers, send get call and get HTML string. Then use one of the converters to convert it to PDF. What I am not sure is if it's possible to get proper PDF from the HTML string I got from http client since resources [css, js and etc] will be missing. I want my pdf/image looks exactly as it on the web site.
I really appreciate if you can help.
Thanks in advance,
ED
You're probably best of using wkhtmltopdf, which is a server-side tool and is easily installed.
There are two parameters you can use to wait for your Ajax to finish, try:
javascript-delay to influence the time the program waits for the JavaScript to finish
window-status to wait for a certain return code for the window
See the extensive manual for this program here
wkhtmltopdf generates a PDF and wkhtmltoimg generates an image, which is PNG (as you requested) by default.
Authentication is difficult because it involves security. Because the operation you are describing is unusual it is likely to result in all kinds of alarm bells going off. It is entirely possible to do but it is fraught, easy to get wrong and fragile in the face of security updates and code changes.
As such I'm going to suggest an alternate method which is one we often recommend for ABCpdf (on which I work). Yes we support standard authentication methods but the beauty of this approach is that it is robust and is applicable to other solutions (eg Java based) and novel authentication methods.
Typically you just want a PDF of the current page. The easiest way to do this is snaffle the HTML. The way you do this rather depends on your environment. For example under ASP.NET you can obtain the HTML of the current page using the HttpResponse.Filter property or by overriding the Render method of the page. The way you do it will depend on what you're coding in.
Then you need to save this HTML to a file and present it to your solution via a 'file://' protocol URL. Now obviously at this point any relative links will be broken but this is easily fixed by dropping in a BASE tag that references the place they are located.
Generally the types of resources referenced by an server-side page are static. So if you can create a tag that references the actual files rather than a web site, you will bypass any authentication for access to these resources.
That still leaves the AJAX based problems which are another can of worms. The render delay method is something we have supported for many years (from before AJAX was around) however it is not terribly reliable because you just don't know how long to wait.
Much better is a tighter link into the JavaScript via a callback you can use to determine if the page is loaded. I don't think ABCpdf is going to be appropriate for you since it is .NET but I would certainly encourage you to look for a Java based solution that uses this type of more sophisticated approach.

html5: Write content to a page without refreshing

I want to create a page where the top half contains start stop buttons and in the bottom half i want to write content from the server. the buttons invoke functions on the server and the server does some computing and generates timely messages which need to be written to the bottom half of the page.
Possible ways of doing
1. AJAX
2. DWR
3. HTML5
Let me know which method is better and how can i do it.
AJAX
Means "Making an HTTP request without leaving the page". This will let you get content from the server. To write it you need to do DOM manipulation. There are no shortage of Ajax and DOM tutorials on the web
DWR
Is a library to help with Ajax (although not one I've heard of before).
HTML5
Is a now meaningless buzzword. Of all the meanings it has, only the ones which include "Ajax and DOM manipulation" will help, and that's already covered above.
Look up a tutorial on xmlhttprequest. I know there is a good explanation of it here Quirksmode. I am not sure where your current skill set is using HTML, server side programming, XML, and Javascript are, but you will need to have a minimum understanding of all of those in order to successfully implement AJAX.
There are lots of resources on the web, Google

Scraping websites in Java

What I am trying to do is a take a list of URL's and download each URL's content (for indexing). The biggest problem is that if I encounter a link that is something like a facebook event that simply redirects to the login page I need to be able to detect and skip that URL. It seems as though the robots.txt file is there for this purpose. I looked into heritrix, but this seems like way more than I need. Is there a simpler tool that will provide information about robots.txt and scrape site accordingly?
(Also, I don't need to follow additional links and build up a deep index, I just need to index the individual pages in the list.)
you could just take the class you are interested in ie http://crawler.archive.org/xref/org/archive/crawler/datamodel/Robotstxt.html

Categories