Jsoup: Getting a link that doesn't show in the HTML - java

I am working on a little app for myself. I am trying to get a list of links from a site. The site is for example: http://kinox.to/Stream/Prison_Break.html
If you hover over the big window in the middle that says kinox.to best online, it show the link that I want in the bottom left. The problem is if I look at the html file I can't find the link anywhere. I guess it has to do something with the site using JavaScript or Ajax.
Is it possible to somehow get the link using JSoup or are there any other Java libraries that could help me?

I did not look closely into the page you try to load, but here is what I think the problem may be: The link is loaded/generated dynamically via JavaScript. Jsoup does not run JavaScript, so therefore you can't find the link in the html.
Two possible solutions:
1) Use something like selenium webdriver to access the content. The Java bindings allow to remote control a real browser which should have no problems loading the page and running all scripts within. Solution 1 is simple to program, but runs slowly. It may depend on an extern browser program which must be installed on the machine. An alternative to webdriver is the JavaFx webkit engine in case you are on java 8.
2) Analyse the traffic and the JavaScript on the page and find out where the link comes from. This may take a bit of time to find out, but when you succeed you can use Jsoup to get all the data you need. This solution should run much faster than solution 1.

One solution and probably the easiest would be to use Selenium:
WebDriver driver = new FirefoxDriver();
driver.get("http://kinox.to/Stream/Prison_Break.html");
String mylink = driver.findElement(By.cssSelector("#AjaxStream > a")).getText();

Related

Creating an image from a webpage

I'm working on a way to detect defacement on my website. The idea is to crawl the whole website and for each page, take a screenshot or render the website as an image and compare it with the last time the page has been checked.
I'm looking for a way to convert a whole webpage (HTML, CSS, JS) into an image, like a screenshot, no matter the language is (but I would prefer Java, Python or C#)
I need it to be fast and usable on a server.
I already tried the folowing in Java:
CssBox, but the rendering isn't good enough (no JS)
Selenium Web Driver, but it's way too slow (Time to open firefox, display the page etc...) and not usable without GUI
I think a solution would be a kind of wrapper for a web engine but I didn't find anything about that (at least in Java). I've been told PhantomJS would fit for this need, is it right?
The perfect result would be to create something like that: http://www.page2images.com/home
Use a browser which you can control via a script or command line options like phantomjs. The documentation contains examples how to make screenshots from URLs.
The website you linked offer some good rest API that perform the task: it's not a viable option for you?
Selenium is your best bet. Depending on your page content (ie. JS libraries, etc) it might take some time, but you could automate this with a script to run nightly via cron. Or using screen.
It has a rich language of assertions and simulated mouse events, and ways to regression-test and/or monitor the state of a set of pages.
Good luck.
With no GUI, it's probably not possible to do something like this.
If you're not too tight on the GUI and related things, you can use the JavaFX Webview and take a screenshot of the node using the following code
WritableImage image = webView.snapshot(null, null);
BufferedImage bufferedImage = SwingFXUtils.fromFXImage(image, null);
....
References:
WebView#snapshot
SwingFXUtils#fromFXImage

How to code an automated bot that can browse and do operations on a webpage. JAVA

need to code a bot that needs to do the following:
Go to a jsp page and search for something by:
writing something on a search box
clicking the search button(submit button)
clicking one of the the resulting buttons/links(same jsp page
with different output)
get the entire html of the new page(same jsp page with different
output)
The 4th one can be done with screen scraping and I do not think I need help with it. But I need some guidance to do the options from 1 to 3. Any links or just some keyword that will help me Google to learn about it will be appreciated. I plan to do this with java.
My suggestion is to use Selenium (http://docs.seleniumhq.org/download/).
Install Selenium IDE in your firefox, and it can record what you do on a website, store it into a script and reply it.
This video (http://www.youtube.com/watch?v=gsHyDIyA3dg) is gonna be helpful if you are a beginner.
And if you want to do it in Java, its easy, just export the scripts in Selenium IDE to JUnit Webdriver code.
Of course you can use Selenium Java webdriver in Java to write your program to operate on website directly.
Selenium automates browsers. That's it. What you do with that power is entirely up to you.
The above steps can be done by using selenium(which is a testing tool in java)
Even points 1 to 3 are screenscraping - you're figuring out (using either manual or automated means) what's there in the page and performing actions on them. You could try exploring the Apache HTTP Client for an easy way to run HTTP commands and get responses.
I hope you're doing this for legitimate means - screenscraping is almost always frowned upon if done without permission.

Displaying a webpage in Java Applet

Can I display a website like wikipedia, google or etc. in my Java applet? I am looking for something like WebBrowser component in C#. Does anyone know how to achieve this?
Take a look at those two answers:
answer 1
answer 2
There are several browser components, and you can see Best Java/Swing browser component? here on SO for some discussion on the "best".
A thought aside - since an applet is already in a web browser, it might be better to bridge to get the browser to display the website you want in, say, an <iframe>, rather than load a browser into a browser.
Swing supports basic HTML (I think version 2.0 or so). So, you can try to use it.
Also there are a several good pure java fully functional HTML browsers.
The question is: why? Java applet runs into browser that knows to show HTML pages. You can easily cause applet to show HTML page into the native browser where it is running now.

Java(or any lang) library for offline rendering of web pages?

I am developing a java application. I have scenario to take screen shot of the URL that comes in to the server.
Is there any java(or any lang) browser library to load webpages and get some screenshots of the loaded page. It would be nice if the lib allows DOM traversal.
Update:
java(or any lang): Any other language is not a problem but the library should co-operate with java.
I have tried to setup Qt Jambi and spent a lot of time on this but the result is nothing.
If you provide any concrete material to setup Jambi, it would be appreciative.
I also gave a try to spynner.py. My native language is Java and i thought i could use spynner.py with Jython. But, PyQt cannot be used with Jython. So, i am not expecting any answers related to Python.
Basically, I need a library to do:
Take Screen shot.
Some DOM traversing.
Some Javascript Execution.
and to get the result of the Executed JS code.
Thanks.
I appreciate all the responses. I ended up with phantomjs. It fits well for my needs. Its a command line tool.
Selenium/Webdriver provides all this functionality.
Webdriver provides a simple api allowing you to "drive" a browser instance. Many browsers are supported.
See here for a simple example:
http://seleniumhq.org/docs/03_webdriver.html#getting-started-with-selenium-webdriver
Traversal of the dom using the "By" locators:
Good examples here: http://www.qaautomation.net/?p=388
driver.findElement(By.name("q"));
Execution of Javascript:
http://code.google.com/p/selenium/wiki/FrequentlyAskedQuestions#Q:_How_do_I_execute_Javascript_directly?
WebDriver driver; // Assigned elsewhere
JavascriptExecutor js = (JavascriptExecutor) driver;
js.executeScript("return document.title");
Screenshot capture:
http://seleniumhq.org/docs/04_webdriver_advanced.html#taking-a-screenshot
File scrFile = ((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE);
In java, you should read the following stackoverflow posts :
Programmatic web browser Java library
Take a screenshot of a webpage with JavaScript?
Embed a web browser within a java application
Because you say "or any lang" :
In Python, you have Spynner :
Spynner is a stateful programmatic web browser module for Python with Javascript/AJAX support based upon the QtWebKit framework.
According to the documentation, here's a small snippet :
import spynner
browser = spynner.Browser()
browser.load("http://www.wordreference.com")
browser.runjs("console.log('I can run Javascript!')")
browser.runjs("_jQuery('div').css('border', 'solid red')") # and jQuery!
browser.select("#esen")
browser.fill("input[name=enit]", "hola")
browser.click("input[name=b]")
browser.wait_page_load()
print browser.url, len(browser.html)
browser.close()
This site does the screenshot job:
Tutorial:
http://www.paulhammond.org/webkit2png/
The program:
http://www.paulhammond.org/2009/03/webkit2png-0.5/webkit2png-0.5.txt
Could it be any easier ? :)
There are some other tools mentioned at that page:
"
If you use a mac, but don't like the command line then you may want to try Paparazzi or Little Snapper.
If you use linux you may be more interested in khtml2png, Matt Biddulph's Mozilla screenshot script or Roland Tapken's QT Webkit script.
"
You could use Rhino, Gecko for the javascript execution.
For dom traversal there are many options, but if you are using Rhino you could use jQuery to make it even easier!
Hope that works out for you!
If you need a screenshot, I guess the quality of rendering is important for you.
We had a similar scenario. What we ended up doing is to run firefox on headless mode, actually browse the webpage and get a screen shot in memory. It is not trivial, but I can give you more details if you wanted to go for it.

Java: "Control" External Application

Is it possible to programmatically start an application from Java and then send commands to it and receive the program's output?
I'm trying to realize this scenario:
I want to access a website that uses lots of javascript and special html + css features -> the website isn't properly displayed in swt.browser or any of the other of the available Browser Widgets. But the website can be displayed without any problems in firefox. So I want to run a hidden instance of firefox, load the website and get the data. (It would be nice if FF can be embedded in a JFrame or so..)
Has anybody got an idea how to realize this?
Any help would really be appreciated!
EDIT: The website loads some Javascript that does some html magic and loads some pictures. When I only read the html from the website I see nothing more than some JavaScript calls. But when the website is loaded in a Browser, it displays some images overlayed with text. That's what I'm trying to show the user of my app.
To start Firefox from within the application, you could use:
Runtime runtime = Runtime.getRuntime();
try {
String path = "/path/to/firefox";
Process process = runtime.exec(path + " " + url);
} catch (IOException e) {
// ...
}
To manipulate processes once they have started, one can often use process.getInputStream() and process.getOutputStream(), but that would not help you in the case of Firefox.
You should probably look into ways of solving your specific problem other than trying to interact directly between your application and a browser instance. Consider either moving the whole interface into a Java gui, or doing a web app from the ground up -- not half and half.
See this article - it will teach you how to start a process, read its output and write to its input stream.
However this solution may be not be the best for your problem. What kind of data do you need to get from the Web Page? Would it be better to read the html with an HTTP GET and then parse it with an Html parser?
If you have a text-mode browser available (like links2 on linux) you might want to see how well that can render the page. For example, the command "links -dump http://someurl.com" will format the page as text and exit immediately, resulting in output that might be easily parseable using the methods that Ray Myers and kgiannakakis suggest.
If the website is static, you could use a web scraper like Jericho to load the URL, parse the HTML and wander your way through the DOM to the info you need.
Although a similar feature to what you describe is planned for FireFox in the future, it is not available yet. The feature is dubbed TaskFox, and from the linked wiki, "its aim is to allow users to quickly access information and perform tasks that would normally take several steps to complete."
News of the upcoming TaskFox feature just broke today, in fact. Perhaps you should consider a career being a psychic instead of a programmer.

Categories