I don't want just the source code. I want the rendered page. This is an important distinction that I apparently cannot make by simply searching Google.
Does anyone know how I can get the rendered page from a URL?
This needs to be done in Java, hopefully without an extra library.
Another solution would be to use HTMLUnit which is a "GUI-less browser for JAVA". It is recommended by Google to generate snapshots of ajax-based webpages to make them crawlable.
You can try using a library that wraps a web browser, for example Berkelium. If you need it in Java, a Google search produced this Java wrapper API for Berkelium (I haven't tried it personally).
sites.google has an example of its use:
Related
I am developing an JSF web application and would like to introduce a lot of documentation to be visible directly in the web application. Technically I would like to use Markdown language and made already first experimence with.
I am currently playing around with flexmark Java library to render e. g. HTML strings from a markdown document. Also this seems to work fine. But what to do with links to other md files?
If I do have my markdown part: See also [here](Background.md)
Then this will be rendered correct to HTML with a link like: See also here.
But how should I tell my web server to react on this link and update the document part of the page with the rendered md file?
I would need to manually find such links in the generated HTML and change them to a kind of JavaScript call, telling my server to render the panel using the other md file.
Or should I create an IFrame so that within this frame, I could follow the link to e. g. a web servlet, rendering the md files to new HTML?
But this all feels a bit clumbsy to me. Am I missing a more easy solution?
Ok, no other answers, so I answer on my own.
The comment about primefaces extension with localized is interesting, but too far away from my focus and some features did not really match to my requirements.
Therefore I stayed with a pure markdown library and made the rest on my own.
With the links it was much more easy than expected! Within JavaScript you can very easily detect all links of the page (document.links), iterate over them and just set an onclick function (see here).
This question was asked me during my interview, and i was asked to implement it. The question is as follows:
Your application
Will take the username and password for the linkedIn profile,
On the page www.linkedin.com , use that to login into the page.
Simulate the Clicking of the Profile->Edit profile menu
Scrape the page of that user that comes below in the format below and dumps it in a text file. ( hint: you can use the beautiful soup library )
On fetching this url, you need to read the following information and put it in an csv/excel file.
Can somebody give me an idea on how to do it? It should be done using java only.
I'd use Web Browser Automation software like Selenium http://www.seleniumhq.org/ which seems like will solve this problem. You can choose any of its bindings (Java, C#, Ruby, Python, JavaScript) to implement the solution.
Take a look at the tutorials https://www.airpair.com/selenium/posts/selenium-tutorial-with-java
It seems related to web crawler, and we can do this using JSOUP library very well.
You have to read implementation using JSOUP library and we can filter out the link which has something like
https://www.linkedin.com/profile/edit?trk=nav_responsive_sub_nav_edit_profile"
Here if you see we are having the keywords as edit_profile which can be used to filter out the results we require.
Link u should follow and explore more about JSOUP
Webcrawler using JSOUP
In the project I am working on I need to access information from the website explainxkcd.com which gives the explanation of specific xkcd comics. The information I am looking for would be the explanation of the comic as a string. Explainxkcd is a site that runs using mediawiki, software that forms a template for the "wiki" feel. Mediawiki has an api that allows you to extract information from their websites and I have gone to http://www.mediawiki.org/wiki/API:Main_page trying to figure out how to use their API for this particular wiki site but to no avail. It seems that you can replace the "index.php" in a URL with api.php to use the API but when I try this for http://explainxkcd.com/9/api.php it doesn't seem to work. I guess my URL is wrong but I don't see any information on how to find the specific URL to use for Explainxkcd.com
QUESTION:
How can I access information from a third party wikipedia page in a Java program? This can be through the mediawiki api or some other solution. If you know a good way to find the URL that can be used with mediawiki that would be preferred. Just looking for a nudge in the right direction here.
Thanks
Using the same method, s/index.php/api.php/, I get a different result: http://www.explainxkcd.com/wiki/api.php which seems to work. If a wiki is using pretty URLs (e.g. example.com/wiki/Main_Page), just click on edit, view source or history.
Yes, please use the API instead of screen-scraping. You can see a few existing Java libraries for that here.
What I need to write is a code snippet that would go to a website e.g. www.google.com find the search box put in the phrase and retrieve HTML code of results page/pages. Is it possible to achieve this in Java?
e.g. www.google.com
For Google, use the JSON/Atom Custom Search API. It is the only (legal) way to access Google search.
Yes, use something like HttpClient, although there are other similar options.
Most probably you should be able to pass a parameter to the url (have a look at the google url after issuing a search, there are plenty of parameters) or use a post request (if the site supports it, check for an API description).
If you read the URL directly from Java (e.g. using the URL class) you'll get the returned HTMl as is.
The first tool I thought of was Selenium. It is primarily a web testing framework, but can be used to automate a browser for the kind of operation you're suggesting.
http://seleniumhq.org/docs/03_webdriver.html#getting-started-with-selenium-webdriver
HttpUnit can also be used. It's a well documented, open source and easy to use unit test framework.
Is it possible to convert an html page with charts generated by javascript to an Image or PDF in Java?
I familiar with iText framework and it seems to be suitable but I am not sure how it handle JS generated things.
A quick search turned up this as a possible answer.
Using a library to convert to XSL-FO then another one to convert that to PDF.
Edit: This might interest you as well. There's a bit on some JBrowser class that seems to let you print web pages.
It depends on how the were generated. I suppose three possibilities:
Canvas tag.
You need to add a bit of JS code to get image using toDataURL canvas method
SVG.
You can add some code to get the full code of generated SVG document via innerHTML method.
Flash.
The worst case. I think it's hardly possible to achive what you want.
Solution 1
If you have access to plain HTML (taken after the JavaScript has executed and built the page), you can easily pass it to iText and convert it to PDF. I would recommend using Flyng-Saucer (which in turn uses iText) which has a very good and convenient API for this (See http://code.google.com/p/flying-saucer/ ).
Solution 2
On the other hand, if you do not have access to the final HTML output, you could use the Swing libraries to render the page and then take a screenshot of it. This will allow you to even use Flash, but I'm not sure whether this approach will be suitable to your problem.
However, if it is the case, you can load the web page into a Swing application (you will need to rely on a third-party browser component for JS support, but there are quite a lot out there), and then you can use the Robot class to get a screenshot of it.
Take a look at http://download.oracle.com/javase/6/docs/api/java/awt/Robot.html