Creating an image from a webpage - java

I'm working on a way to detect defacement on my website. The idea is to crawl the whole website and for each page, take a screenshot or render the website as an image and compare it with the last time the page has been checked.
I'm looking for a way to convert a whole webpage (HTML, CSS, JS) into an image, like a screenshot, no matter the language is (but I would prefer Java, Python or C#)
I need it to be fast and usable on a server.
I already tried the folowing in Java:
CssBox, but the rendering isn't good enough (no JS)
Selenium Web Driver, but it's way too slow (Time to open firefox, display the page etc...) and not usable without GUI
I think a solution would be a kind of wrapper for a web engine but I didn't find anything about that (at least in Java). I've been told PhantomJS would fit for this need, is it right?
The perfect result would be to create something like that: http://www.page2images.com/home

Use a browser which you can control via a script or command line options like phantomjs. The documentation contains examples how to make screenshots from URLs.

The website you linked offer some good rest API that perform the task: it's not a viable option for you?

Selenium is your best bet. Depending on your page content (ie. JS libraries, etc) it might take some time, but you could automate this with a script to run nightly via cron. Or using screen.
It has a rich language of assertions and simulated mouse events, and ways to regression-test and/or monitor the state of a set of pages.
Good luck.

With no GUI, it's probably not possible to do something like this.
If you're not too tight on the GUI and related things, you can use the JavaFX Webview and take a screenshot of the node using the following code
WritableImage image = webView.snapshot(null, null);
BufferedImage bufferedImage = SwingFXUtils.fromFXImage(image, null);
....
References:
WebView#snapshot
SwingFXUtils#fromFXImage

Related

Jsoup: Getting a link that doesn't show in the HTML

I am working on a little app for myself. I am trying to get a list of links from a site. The site is for example: http://kinox.to/Stream/Prison_Break.html
If you hover over the big window in the middle that says kinox.to best online, it show the link that I want in the bottom left. The problem is if I look at the html file I can't find the link anywhere. I guess it has to do something with the site using JavaScript or Ajax.
Is it possible to somehow get the link using JSoup or are there any other Java libraries that could help me?
I did not look closely into the page you try to load, but here is what I think the problem may be: The link is loaded/generated dynamically via JavaScript. Jsoup does not run JavaScript, so therefore you can't find the link in the html.
Two possible solutions:
1) Use something like selenium webdriver to access the content. The Java bindings allow to remote control a real browser which should have no problems loading the page and running all scripts within. Solution 1 is simple to program, but runs slowly. It may depend on an extern browser program which must be installed on the machine. An alternative to webdriver is the JavaFx webkit engine in case you are on java 8.
2) Analyse the traffic and the JavaScript on the page and find out where the link comes from. This may take a bit of time to find out, but when you succeed you can use Jsoup to get all the data you need. This solution should run much faster than solution 1.
One solution and probably the easiest would be to use Selenium:
WebDriver driver = new FirefoxDriver();
driver.get("http://kinox.to/Stream/Prison_Break.html");
String mylink = driver.findElement(By.cssSelector("#AjaxStream > a")).getText();

Reduce HTML using Applets

My supervisor has tasked me with programmatically reducing a website's content by looking at the HTML tags to reveal only the core content. Importantly, this particular piece of the project must be written in Java.
Now having learnt about the differences betweenPlugins, Extensions, Applets, and Widgets, I think I want to use an Extension that calls a client-side Applet. My approach was going to be this:
Using the Google-Chrome API, I was going to display a button that
the user can click.
If clicked, the action is to launch a new browser tab that has the
Applet embedded within it.
The applet automatically sources the called tab's HTML code and
filters it.
Once filtered, the reduced copy of the original site appears.
So I have a few questions. To start, is it even possible to use an Extension with an Applet? Moreover, is it possible for an applet to look # another tabs HTML code? If not, is it possible to just reload the original tab with the Applet now embedded within it and complete the function. Thanks.
Javascript is already on most mobile web platforms. Java is not, and there is no reasonable way mobile customers will be able to install Java. Android, which runs many, but not all, mobile devices has a Java run time environment, and is basically a loader for Java apps. But an Apple iPhone is not an Android device... nor is a Windows Phone.
If you want to summarize content on the client, and in Javascript, as I see it you have two choices:
Succeed with some inner burst of genius where dozens of the best expert PhDs in Natural Language Computing have just begun exploring how to extract "true meaning" from text; OR
look at document.title and be done with it.
The 2nd approach assumes that the authors of web pages set titles and set a title appropriate for summarizing their website. This isn't a perfect assumption, but it is OK
most of the time. It is also a lot less expensive than #1
With the 1st approach you can get a head start with a "natural language toolkit" that can do things like scan text for unusual words and phrases. To get a rough idea of the kinds of software that have been built in this area, review wikipedia: Outline of natural language processing:: toolkits. A popular tookit for python is called NLTK. Whether you use a toolkit from java, or python, it means working on the server because the client will not have the storage, network speed, or CPU. For python there are server side app frameworks like django or web2py that can make building out a server app faster, and on Java there are servlets frameworks. Ultimately you'll need a lot of help, training, or luck and as I have hinted above it can easily be beyond the capabilities of a small team of fresh hires, and certainly way beyond what a single new developer eager to prove his/her capabilities can do in a few weeks on their own with limited help.
Most web pages have titles set like this near the beginning of the downloaded HTML:
<head><title>My Furry Kittens!</title></head>
You don't need to write a parser. If you are running in the browser, the title has been parsed into the DOM or Document Object Model already. The string "My Furry Kittens!" in this example would be available in the global variable document.title.
If you like, you could put a button into a plugin and let people push it to summarize the website. Or, they could just look up at the title. It is already on the page. Of course, if the goal is to scrape titles one can avoid writing a parser and use a "fake" headless scriptable browser like phantomJS or similar.
You can read more about document.title on the Mozilla Developer Network. MDN is a great reference for learning how web browsers work. They are the maintainers of the Mozilla Firefox browser. Most of what you can learn there will also work on Chrome, Internet Explorer, and various mobile platforms.
Good Luck!
How about implementing a local proxy server on the mobile device. The browser would just need to be configured to use the proxy, while the custom proxy implementation can transform the requested html however it likes.

how to extract HTML data from a webpage which scrolls down for a fixed number of times?

I want to extract HTML data from a website using JAVA. The problem is the webpage keeps scrolling down once the user reaches the bottom of the page. Number of times it scrolls down is fixed. My JAVA code can extract only for the 1st part. How do I extract for the remaining scrolls? Is there a way to load the whole page at once with JAVA? ANy help would be appreciated :)
This might be the type of thing that PhantomJS (http://phantomjs.org/) was designed for. It will crawl entire web pages and even execute JavaScript, using a "real" browser in headless mode. I suggest stopping what you're doing with Java and take a look at PhantomJS instead. It could save you a LOT of time. :)
This type of behavior is implemented in the browser, interpreting the user's scrolling actions to load more content via AJAX and dynamically modifying the in-memory DOM in the browser. Consider that your Java runs in a web container on the server, and that web container (i.e. Tomcat, JBoss, etc) provides a huge amount of underlying code so your app doesn't have to worry about the plumbing.
Conceptually, a similar thing occurs at the client, with the DHTML web page running in its own "container" (the browser), which provides a wealth of functionality, from UI to networking, to DOM, etc. If you remove the browser from the equation and replace it with a Java program, you will need to provide the equivalent of the browser in which the DHTML/Javascript can execute.
I believe that HTMLUnit may fill the bill, but have not worked with it personally.

Displaying a webpage in Java Applet

Can I display a website like wikipedia, google or etc. in my Java applet? I am looking for something like WebBrowser component in C#. Does anyone know how to achieve this?
Take a look at those two answers:
answer 1
answer 2
There are several browser components, and you can see Best Java/Swing browser component? here on SO for some discussion on the "best".
A thought aside - since an applet is already in a web browser, it might be better to bridge to get the browser to display the website you want in, say, an <iframe>, rather than load a browser into a browser.
Swing supports basic HTML (I think version 2.0 or so). So, you can try to use it.
Also there are a several good pure java fully functional HTML browsers.
The question is: why? Java applet runs into browser that knows to show HTML pages. You can easily cause applet to show HTML page into the native browser where it is running now.

Java: "Control" External Application

Is it possible to programmatically start an application from Java and then send commands to it and receive the program's output?
I'm trying to realize this scenario:
I want to access a website that uses lots of javascript and special html + css features -> the website isn't properly displayed in swt.browser or any of the other of the available Browser Widgets. But the website can be displayed without any problems in firefox. So I want to run a hidden instance of firefox, load the website and get the data. (It would be nice if FF can be embedded in a JFrame or so..)
Has anybody got an idea how to realize this?
Any help would really be appreciated!
EDIT: The website loads some Javascript that does some html magic and loads some pictures. When I only read the html from the website I see nothing more than some JavaScript calls. But when the website is loaded in a Browser, it displays some images overlayed with text. That's what I'm trying to show the user of my app.
To start Firefox from within the application, you could use:
Runtime runtime = Runtime.getRuntime();
try {
String path = "/path/to/firefox";
Process process = runtime.exec(path + " " + url);
} catch (IOException e) {
// ...
}
To manipulate processes once they have started, one can often use process.getInputStream() and process.getOutputStream(), but that would not help you in the case of Firefox.
You should probably look into ways of solving your specific problem other than trying to interact directly between your application and a browser instance. Consider either moving the whole interface into a Java gui, or doing a web app from the ground up -- not half and half.
See this article - it will teach you how to start a process, read its output and write to its input stream.
However this solution may be not be the best for your problem. What kind of data do you need to get from the Web Page? Would it be better to read the html with an HTTP GET and then parse it with an Html parser?
If you have a text-mode browser available (like links2 on linux) you might want to see how well that can render the page. For example, the command "links -dump http://someurl.com" will format the page as text and exit immediately, resulting in output that might be easily parseable using the methods that Ray Myers and kgiannakakis suggest.
If the website is static, you could use a web scraper like Jericho to load the URL, parse the HTML and wander your way through the DOM to the info you need.
Although a similar feature to what you describe is planned for FireFox in the future, it is not available yet. The feature is dubbed TaskFox, and from the linked wiki, "its aim is to allow users to quickly access information and perform tasks that would normally take several steps to complete."
News of the upcoming TaskFox feature just broke today, in fact. Perhaps you should consider a career being a psychic instead of a programmer.

Categories