How to get the web page size from an url in bytes. This should include all images.
How can we do that. Any help
Thanks in Advance
The way to find the number of bytes used to represent a web page you would need to:
fetch the HTML page and all images, scripts, CSS files, etc that it references, transitively,
evaluate any embedded scripts (as per the HTML spec) to see if they pull in further resources, and
sum the byte counts for all resources loaded to give the "web page size".
But I don't see what you would learn by doing this. For instance, the web page size (as above) is not a good predictor of network usage.
You say:
I am doing this to analyze the performance of an web page.
A better way would be to use something like the "yslow" plugin for Firefox.
Related
I can connect to most sites and get the HTML just fine but when trying to connect to a website where most of the content is generated after the initial page load with JavaScript, it does not get any of that data. Is there any way to do this with Jsoup or does it not support it?
JSoup has some basic connection handling included, but it is not a web browser. It excels at parsing static html content. It does not run any javascript, so you are out of luck. However, there are different options that you might follow:
You can analyze the page that you want to retrieve and find out how the content you are interested in gets loaded. Often it is not very hard to tap the original source of the loaded content and work with this. This approach has the benefit that you get what you want with no need of extra libraries and the retrieval will be fast.
You can use a (full) browser and automate the loading of the page. A very good tool for this is selenium webdriver in combination with the headless webkit browser phantomjs. This however requires extra software and extra libraries in your project and will run much much slower than the first solution.
I need to convert a web page [which has not public access] to PDF or Image [preferably to PNG].
Web page contains set of charts and image. Most of the charts are populated through Ajax calls so there is a delay between page load and chart load.
I am looking answer for any of these questions:
1- I found set of snapshot api's but none of them support accessing my internal page. Since the web page I am trying to export is not public I need to be authenticated. Biggest problem is I cannot send request headers [such as session-id, cookie or other variables] along with these API's. It seems they don't support this kind of functionality.
2- I am not sure if I can do following: Login to my web page with HTTP client, add http headers, send get call and get HTML string. Then use one of the converters to convert it to PDF. What I am not sure is if it's possible to get proper PDF from the HTML string I got from http client since resources [css, js and etc] will be missing. I want my pdf/image looks exactly as it on the web site.
I really appreciate if you can help.
Thanks in advance,
ED
You're probably best of using wkhtmltopdf, which is a server-side tool and is easily installed.
There are two parameters you can use to wait for your Ajax to finish, try:
javascript-delay to influence the time the program waits for the JavaScript to finish
window-status to wait for a certain return code for the window
See the extensive manual for this program here
wkhtmltopdf generates a PDF and wkhtmltoimg generates an image, which is PNG (as you requested) by default.
Authentication is difficult because it involves security. Because the operation you are describing is unusual it is likely to result in all kinds of alarm bells going off. It is entirely possible to do but it is fraught, easy to get wrong and fragile in the face of security updates and code changes.
As such I'm going to suggest an alternate method which is one we often recommend for ABCpdf (on which I work). Yes we support standard authentication methods but the beauty of this approach is that it is robust and is applicable to other solutions (eg Java based) and novel authentication methods.
Typically you just want a PDF of the current page. The easiest way to do this is snaffle the HTML. The way you do this rather depends on your environment. For example under ASP.NET you can obtain the HTML of the current page using the HttpResponse.Filter property or by overriding the Render method of the page. The way you do it will depend on what you're coding in.
Then you need to save this HTML to a file and present it to your solution via a 'file://' protocol URL. Now obviously at this point any relative links will be broken but this is easily fixed by dropping in a BASE tag that references the place they are located.
Generally the types of resources referenced by an server-side page are static. So if you can create a tag that references the actual files rather than a web site, you will bypass any authentication for access to these resources.
That still leaves the AJAX based problems which are another can of worms. The render delay method is something we have supported for many years (from before AJAX was around) however it is not terribly reliable because you just don't know how long to wait.
Much better is a tighter link into the JavaScript via a callback you can use to determine if the page is loaded. I don't think ABCpdf is going to be appropriate for you since it is .NET but I would certainly encourage you to look for a Java based solution that uses this type of more sophisticated approach.
i want to display an external webpage (exactly as it's rendered in that site) into a webpage in my application in a way that's fast and better for SEO crawlers, and i was wondering if there's a way to do that with javaee ?
if not then what is better in performance and for SEO the XMLHTTPRequest way or the iframes way.
please advise with sample code or link if possible, thanks
Update: example website is: http://www.akhbarak.net/
If you need to display content from different pages inline, use iframe (iframe stands for inline frame - it has nothing to do with Apple).
If you'd like to use AJAX to display pages, I would recommend colorbox.
Note that accessing pages in a different domain via AJAX is next to impossible - this is a very, very big security hole. I would not recommend doing it. You would have to use a proxy on your own server to fetch the page and return its HTML.
That said, using the iframe in your source code, so it is loaded with the rest of the page, seems like your best bet. Sites like facebook and twitter use this in embeddable "like" and "tweet" widgets so that those widgets can make requests on their own domain - that is, twitter or facebook. While managing lots of iframes isn't very fun, it is a very accepted way of doing what you want to do.
In theory, you could
load the whole page into a PHP variable,
replace the body tags with ,
take out the html tags,
pull out the entire section and put it in the encompassing pages ,
and replace all links with absolute ones (ie '/images' changes to 'http://example.com/images')
Would it be easy to do? Probably not. It's the only way I can think of to accomplish it so that the site appears as part of yours though.
We have a Java desktop app with an embedded browser, now using XULRunner (Firefox engine) on SWT. This browser's API allows us to load webs specifying an URI or its HTML content.
What we need is to load HTML webpages including resources but being everything in memory. The best solution would be to provide a listener used when the engine tries to load resources so we can send it the appropriate content.
Any ideas? thank you!
It sounds like you need a small HTTP / web server. There is Jetty, there are also a few smaller ones, just search for "small java web server" or so.
In HTML 5 your can put your resources inside the HTML itself.
So you can use SWT with browser that supports HTML 5 and prepare your webpages to have resources inside HTML 5.
With SWT Browser your can simply do browser.setText(html) to load the page from memory.
Let me outline the problem space. I want to create a SEO friendly page that contains dynamic information, but also has areas of information that are easily editable by HTML content editors (NOT programmers) outside of the normal development lifecycle (I'll call this content 'static' content). For example think of a product page that has fluffy content about a product and some pictures on top (the static content) , and then at the bottom real-time dynamic search results from our site for that product (the dynamic content).
Some constraints:
AJAX is not an option for the dynamic portion (spiders will not get the dynamic content)
an IFrame is not an option for the dynamic portion (dilutes benefit of SEO)
the static content should be easily editable at any time by someone outside of development, and the changes should take effect in a timely manner (real time is not necessary, but they should not need to wait until we restart the webapp servers, for example).
these pages will be hit hard so performance and system impact is a factor (for example going to the database or file system for content on every page hit is not reasonable).
What I am thinking is that the entire page needs to be a standard dynamic servlet, with customized areas of HTML that the content editors can somehow edit. It is this editing aspect that I am looking for suggestions on. I could solve the problem with text files that are available on our NAS in a shared location to both the content editors and the webapp server cluster, and are read in by the webapp servers and cached at page access and pushed to portions of the view layer, but I'm hoping there is something out there that makes this a bit less hackish or at least does some of the plumbing for me that could plug into our view or controller layer.
Of course if there is a way to keep the entire page static, but pull in some dynamic data in a way that spiders will see it as part of the same page, that would be ideal.
Notes on technologies:
- we use an open source java stack with Velocity as the view layer to serve our dynamic servlet content
- Apache serves all static html pages
you may want to look at clickframes, or at drupal- the first is a way to write up pages in xml, that then get generated into a site;
the second is a portal toolkit.