Website Scraping in Java once the whole site is loaded - java

In few websites few scripts might take some time to run which results the website scraping to work inefficiently or the html which is returned from the scraper is incomplete.How to scrape the website once the site scripts are fully ran.
I am using URL Connection in java when I am reading the text from it I am getting HTML which is pre matured (i.e) I have script which is a bit long which takes some time to load which changes color of the text which is not reflecting in the text which is read using URL CONNECTION.

You can use PhantomJS. It's a browser but headless. It will render all js on the page. You might find this thread useful Any Java equivalent to PhantomJS?

I have used Selenium in java (and kotlin using the java libarary) to do website automation and testing
it can be set up to wait a specified time before looking for elements or wait until it is loaded, since it really just remote controls a webbrowser you can use javascript on pages and act just like any user would
https://www.seleniumhq.org/download/
https://mvnrepository.com/artifact/org.seleniumhq.selenium/selenium-java
RemoteWebDriver driver = new ChromeDriver()
driver.get(url)
driver.findElement(by.name("search")).sendKeys("some query")
driver.find(by.id("submit")).click()
you can wait for all things to load as described here
https://stackoverflow.com/a/33349203/9006779
(or at least in a similar way, the api might have changed)

Related

Jsoup: Getting a link that doesn't show in the HTML

I am working on a little app for myself. I am trying to get a list of links from a site. The site is for example: http://kinox.to/Stream/Prison_Break.html
If you hover over the big window in the middle that says kinox.to best online, it show the link that I want in the bottom left. The problem is if I look at the html file I can't find the link anywhere. I guess it has to do something with the site using JavaScript or Ajax.
Is it possible to somehow get the link using JSoup or are there any other Java libraries that could help me?
I did not look closely into the page you try to load, but here is what I think the problem may be: The link is loaded/generated dynamically via JavaScript. Jsoup does not run JavaScript, so therefore you can't find the link in the html.
Two possible solutions:
1) Use something like selenium webdriver to access the content. The Java bindings allow to remote control a real browser which should have no problems loading the page and running all scripts within. Solution 1 is simple to program, but runs slowly. It may depend on an extern browser program which must be installed on the machine. An alternative to webdriver is the JavaFx webkit engine in case you are on java 8.
2) Analyse the traffic and the JavaScript on the page and find out where the link comes from. This may take a bit of time to find out, but when you succeed you can use Jsoup to get all the data you need. This solution should run much faster than solution 1.
One solution and probably the easiest would be to use Selenium:
WebDriver driver = new FirefoxDriver();
driver.get("http://kinox.to/Stream/Prison_Break.html");
String mylink = driver.findElement(By.cssSelector("#AjaxStream > a")).getText();

Download web page using Java after javascript executes

I have a problem that doesn't seem to be answered clearly in StackOverflow. I want to download a page using Java and retrieve some data from it in order to give some values to an application that I develop. This page is a betting site so it contains javascrit methods to change the betting values.
In order to do some tests I downloaded the page manually using Ctrl-S and then I made a programm (with FileReader, BufferedReader, etc...) which retrieves the data. This worked perfectly. So I would make a bash script to be executed in order to download the page every time when the user opens my application.
After that I searched for methods who download the page programmatically (I used Jsoup, URL, ...). What I noticed is that the javascript variable values couldn't be printed because the javascript code wasn't executed.
What i want to know is that if there is some way to download programmatically the executed website (download the instance of the javascript values) without having to make some bash script to do it every time before someone opens my app.
Try HtmlUnit. It is used for automatic testing but should fit your purpose well too!

how to extract HTML data from a webpage which scrolls down for a fixed number of times?

I want to extract HTML data from a website using JAVA. The problem is the webpage keeps scrolling down once the user reaches the bottom of the page. Number of times it scrolls down is fixed. My JAVA code can extract only for the 1st part. How do I extract for the remaining scrolls? Is there a way to load the whole page at once with JAVA? ANy help would be appreciated :)
This might be the type of thing that PhantomJS (http://phantomjs.org/) was designed for. It will crawl entire web pages and even execute JavaScript, using a "real" browser in headless mode. I suggest stopping what you're doing with Java and take a look at PhantomJS instead. It could save you a LOT of time. :)
This type of behavior is implemented in the browser, interpreting the user's scrolling actions to load more content via AJAX and dynamically modifying the in-memory DOM in the browser. Consider that your Java runs in a web container on the server, and that web container (i.e. Tomcat, JBoss, etc) provides a huge amount of underlying code so your app doesn't have to worry about the plumbing.
Conceptually, a similar thing occurs at the client, with the DHTML web page running in its own "container" (the browser), which provides a wealth of functionality, from UI to networking, to DOM, etc. If you remove the browser from the equation and replace it with a Java program, you will need to provide the equivalent of the browser in which the DHTML/Javascript can execute.
I believe that HTMLUnit may fill the bill, but have not worked with it personally.

Java(or any lang) library for offline rendering of web pages?

I am developing a java application. I have scenario to take screen shot of the URL that comes in to the server.
Is there any java(or any lang) browser library to load webpages and get some screenshots of the loaded page. It would be nice if the lib allows DOM traversal.
Update:
java(or any lang): Any other language is not a problem but the library should co-operate with java.
I have tried to setup Qt Jambi and spent a lot of time on this but the result is nothing.
If you provide any concrete material to setup Jambi, it would be appreciative.
I also gave a try to spynner.py. My native language is Java and i thought i could use spynner.py with Jython. But, PyQt cannot be used with Jython. So, i am not expecting any answers related to Python.
Basically, I need a library to do:
Take Screen shot.
Some DOM traversing.
Some Javascript Execution.
and to get the result of the Executed JS code.
Thanks.
I appreciate all the responses. I ended up with phantomjs. It fits well for my needs. Its a command line tool.
Selenium/Webdriver provides all this functionality.
Webdriver provides a simple api allowing you to "drive" a browser instance. Many browsers are supported.
See here for a simple example:
http://seleniumhq.org/docs/03_webdriver.html#getting-started-with-selenium-webdriver
Traversal of the dom using the "By" locators:
Good examples here: http://www.qaautomation.net/?p=388
driver.findElement(By.name("q"));
Execution of Javascript:
http://code.google.com/p/selenium/wiki/FrequentlyAskedQuestions#Q:_How_do_I_execute_Javascript_directly?
WebDriver driver; // Assigned elsewhere
JavascriptExecutor js = (JavascriptExecutor) driver;
js.executeScript("return document.title");
Screenshot capture:
http://seleniumhq.org/docs/04_webdriver_advanced.html#taking-a-screenshot
File scrFile = ((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE);
In java, you should read the following stackoverflow posts :
Programmatic web browser Java library
Take a screenshot of a webpage with JavaScript?
Embed a web browser within a java application
Because you say "or any lang" :
In Python, you have Spynner :
Spynner is a stateful programmatic web browser module for Python with Javascript/AJAX support based upon the QtWebKit framework.
According to the documentation, here's a small snippet :
import spynner
browser = spynner.Browser()
browser.load("http://www.wordreference.com")
browser.runjs("console.log('I can run Javascript!')")
browser.runjs("_jQuery('div').css('border', 'solid red')") # and jQuery!
browser.select("#esen")
browser.fill("input[name=enit]", "hola")
browser.click("input[name=b]")
browser.wait_page_load()
print browser.url, len(browser.html)
browser.close()
This site does the screenshot job:
Tutorial:
http://www.paulhammond.org/webkit2png/
The program:
http://www.paulhammond.org/2009/03/webkit2png-0.5/webkit2png-0.5.txt
Could it be any easier ? :)
There are some other tools mentioned at that page:
"
If you use a mac, but don't like the command line then you may want to try Paparazzi or Little Snapper.
If you use linux you may be more interested in khtml2png, Matt Biddulph's Mozilla screenshot script or Roland Tapken's QT Webkit script.
"
You could use Rhino, Gecko for the javascript execution.
For dom traversal there are many options, but if you are using Rhino you could use jQuery to make it even easier!
Hope that works out for you!
If you need a screenshot, I guess the quality of rendering is important for you.
We had a similar scenario. What we ended up doing is to run firefox on headless mode, actually browse the webpage and get a screen shot in memory. It is not trivial, but I can give you more details if you wanted to go for it.

Java: "Control" External Application

Is it possible to programmatically start an application from Java and then send commands to it and receive the program's output?
I'm trying to realize this scenario:
I want to access a website that uses lots of javascript and special html + css features -> the website isn't properly displayed in swt.browser or any of the other of the available Browser Widgets. But the website can be displayed without any problems in firefox. So I want to run a hidden instance of firefox, load the website and get the data. (It would be nice if FF can be embedded in a JFrame or so..)
Has anybody got an idea how to realize this?
Any help would really be appreciated!
EDIT: The website loads some Javascript that does some html magic and loads some pictures. When I only read the html from the website I see nothing more than some JavaScript calls. But when the website is loaded in a Browser, it displays some images overlayed with text. That's what I'm trying to show the user of my app.
To start Firefox from within the application, you could use:
Runtime runtime = Runtime.getRuntime();
try {
String path = "/path/to/firefox";
Process process = runtime.exec(path + " " + url);
} catch (IOException e) {
// ...
}
To manipulate processes once they have started, one can often use process.getInputStream() and process.getOutputStream(), but that would not help you in the case of Firefox.
You should probably look into ways of solving your specific problem other than trying to interact directly between your application and a browser instance. Consider either moving the whole interface into a Java gui, or doing a web app from the ground up -- not half and half.
See this article - it will teach you how to start a process, read its output and write to its input stream.
However this solution may be not be the best for your problem. What kind of data do you need to get from the Web Page? Would it be better to read the html with an HTTP GET and then parse it with an Html parser?
If you have a text-mode browser available (like links2 on linux) you might want to see how well that can render the page. For example, the command "links -dump http://someurl.com" will format the page as text and exit immediately, resulting in output that might be easily parseable using the methods that Ray Myers and kgiannakakis suggest.
If the website is static, you could use a web scraper like Jericho to load the URL, parse the HTML and wander your way through the DOM to the info you need.
Although a similar feature to what you describe is planned for FireFox in the future, it is not available yet. The feature is dubbed TaskFox, and from the linked wiki, "its aim is to allow users to quickly access information and perform tasks that would normally take several steps to complete."
News of the upcoming TaskFox feature just broke today, in fact. Perhaps you should consider a career being a psychic instead of a programmer.

Categories