I am currently working on a project that involves a lot of html parsing. I have come across and issue that I cannot find the solution to. Basically, what I have is an application that downloads html from a website and parses it (I am using HTMLCleaner as my parser). However, this website contains some javascript elements, which after execution change the html. What I need to do is execute the javascript function from my application and then download the html.
I have been looking for the solution for days, but all I managed to find was how to do this using WebView, which in my case I do not want.
I do have an idea for solving the problem, which is making an unrendered WebView and using that. However, I am sure there is one better then that.
Thank you in advance.
I have been looking for the solution for days, but all I managed to find was how to do this using WebView, which in my case I do not want.
You need something that knows how to render Web pages. Nothing else is going to do what you want (create a DOM, then run JavaScript that modifies that DOM).
I do have an idea for solving the problem, which is making an unrendered WebView and using that
That is one solution. Or, you could play around with Firefox's GeckoView, which will do similar stuff, just with their own rendering engine.
However, I am sure there is one better then that.
You can build your own Web browser from scratch.
Related
I am working on a little app for myself. I am trying to get a list of links from a site. The site is for example: http://kinox.to/Stream/Prison_Break.html
If you hover over the big window in the middle that says kinox.to best online, it show the link that I want in the bottom left. The problem is if I look at the html file I can't find the link anywhere. I guess it has to do something with the site using JavaScript or Ajax.
Is it possible to somehow get the link using JSoup or are there any other Java libraries that could help me?
I did not look closely into the page you try to load, but here is what I think the problem may be: The link is loaded/generated dynamically via JavaScript. Jsoup does not run JavaScript, so therefore you can't find the link in the html.
Two possible solutions:
1) Use something like selenium webdriver to access the content. The Java bindings allow to remote control a real browser which should have no problems loading the page and running all scripts within. Solution 1 is simple to program, but runs slowly. It may depend on an extern browser program which must be installed on the machine. An alternative to webdriver is the JavaFx webkit engine in case you are on java 8.
2) Analyse the traffic and the JavaScript on the page and find out where the link comes from. This may take a bit of time to find out, but when you succeed you can use Jsoup to get all the data you need. This solution should run much faster than solution 1.
One solution and probably the easiest would be to use Selenium:
WebDriver driver = new FirefoxDriver();
driver.get("http://kinox.to/Stream/Prison_Break.html");
String mylink = driver.findElement(By.cssSelector("#AjaxStream > a")).getText();
My JSP project is the back-end of a fairly simple site with the purpose to show many submissions which I want to present on the website. They are organized in categories, basically similar to a typical forum.
The content is loaded entirely from a database since making separate files for everything would be extremely redundant.
However, I want to give the users the possibility to navigate properly on my site and also give unique links to each submission.
So for example a link can be: site.com/category1/subcategory2/submission3.jsp
I know how to generate those links, but is there a way to automatically redirect all the theoretically possible links to the main site.com/index.jsp ?
The Java code of the JSP needs access to the original link of course.
Hope someone has an idea..
Big thanks in advance! :)
Alright, in case someone stumbles across this one day...
The way I've been able to solve this was by using a Servlet. Eclipse allows their creation directly in the project and the wizard even allows you to set the url-mapping, for example /main/* so you don't have to mess with the web.xml yourself.
The doGet function simply contains the redirection as follows:
request.getRequestDispatcher("/index.jsp").forward(request,response);
This kind of redirection unfortunately causes all relative links in the webpage to fail. This can be solved by hardlinking to the root directory for example though. See the neat responses here for alternatives: Browser can't access/find relative resources like CSS, images and links when calling a Servlet which forwards to a JSP
I want to extract HTML data from a website using JAVA. The problem is the webpage keeps scrolling down once the user reaches the bottom of the page. Number of times it scrolls down is fixed. My JAVA code can extract only for the 1st part. How do I extract for the remaining scrolls? Is there a way to load the whole page at once with JAVA? ANy help would be appreciated :)
This might be the type of thing that PhantomJS (http://phantomjs.org/) was designed for. It will crawl entire web pages and even execute JavaScript, using a "real" browser in headless mode. I suggest stopping what you're doing with Java and take a look at PhantomJS instead. It could save you a LOT of time. :)
This type of behavior is implemented in the browser, interpreting the user's scrolling actions to load more content via AJAX and dynamically modifying the in-memory DOM in the browser. Consider that your Java runs in a web container on the server, and that web container (i.e. Tomcat, JBoss, etc) provides a huge amount of underlying code so your app doesn't have to worry about the plumbing.
Conceptually, a similar thing occurs at the client, with the DHTML web page running in its own "container" (the browser), which provides a wealth of functionality, from UI to networking, to DOM, etc. If you remove the browser from the equation and replace it with a Java program, you will need to provide the equivalent of the browser in which the DHTML/Javascript can execute.
I believe that HTMLUnit may fill the bill, but have not worked with it personally.
I had to take a surveymonkey survey today, and the format was as follows: a question was asked, then after hitting the next button, the answer was displayed as "Answer: _" along with an explanation. For kicks, I'd like to make a program that could take this survey, answering any letter, then going to the next page and reading the answer, then going back and changing the answer to the correct one, then going 2 pages ahead and repeating.
I am familiar with Java and Python, but I'm not sure how to make them be able to "know" where the button is, and how to "read" text without unnecessary image recognition.
This is just a fun project, nothing serious, but I would appreciate any ideas to get me started.
Assuming that the text was just that (text rather than images), there are a few useful tools for you:
.Net WebControl - I've scripted this before from .Net. It has the advantage of making all of the JS on the page still work. I know this isn't Java, but it is surprisingly easy to work with for this kind of task.
Selenium - It is primarily a web testing framework, but it would be easy to script it from Java to auto-submit forms.
TagSoup for Java - If the pages do not have significant javascript code that needs to run, there are many HTML parsers for Java that could potentially be used to develop a scraper.
Would it be unrealistic to make it post to the survey monkey pages? You could then do some regex's to pull "answer:__" out and look for that pattern in the original page. It would definitely be easier than trying to click things in a browser, etc. Basically, write a java app or python for that matter that does http posts to the survey pages in order and uses regex's to find the next page, etc and then use a stack to keep track of the history.
Edit if this isn't clear, let me know, I'll clarify
Edit 2: I completely forgot about HTMLUnit, my bad. It is a testing framework like suggested by jsight but specifically for Java and functions very similarly to JUnit, however, because it is designed for testing web applications, it can be used to automate interactions with other sites
You can do it using a simple image search. First screenshot the a unique part of the button and save it. This will be used as the relative reference on where you click the mouse. Then during the actual running of the application, have a screenshot of the entire screen and find a part matching the previously saved image and then let the mouse click on the appropriate location based on the button image location.
Is it possible to programmatically start an application from Java and then send commands to it and receive the program's output?
I'm trying to realize this scenario:
I want to access a website that uses lots of javascript and special html + css features -> the website isn't properly displayed in swt.browser or any of the other of the available Browser Widgets. But the website can be displayed without any problems in firefox. So I want to run a hidden instance of firefox, load the website and get the data. (It would be nice if FF can be embedded in a JFrame or so..)
Has anybody got an idea how to realize this?
Any help would really be appreciated!
EDIT: The website loads some Javascript that does some html magic and loads some pictures. When I only read the html from the website I see nothing more than some JavaScript calls. But when the website is loaded in a Browser, it displays some images overlayed with text. That's what I'm trying to show the user of my app.
To start Firefox from within the application, you could use:
Runtime runtime = Runtime.getRuntime();
try {
String path = "/path/to/firefox";
Process process = runtime.exec(path + " " + url);
} catch (IOException e) {
// ...
}
To manipulate processes once they have started, one can often use process.getInputStream() and process.getOutputStream(), but that would not help you in the case of Firefox.
You should probably look into ways of solving your specific problem other than trying to interact directly between your application and a browser instance. Consider either moving the whole interface into a Java gui, or doing a web app from the ground up -- not half and half.
See this article - it will teach you how to start a process, read its output and write to its input stream.
However this solution may be not be the best for your problem. What kind of data do you need to get from the Web Page? Would it be better to read the html with an HTTP GET and then parse it with an Html parser?
If you have a text-mode browser available (like links2 on linux) you might want to see how well that can render the page. For example, the command "links -dump http://someurl.com" will format the page as text and exit immediately, resulting in output that might be easily parseable using the methods that Ray Myers and kgiannakakis suggest.
If the website is static, you could use a web scraper like Jericho to load the URL, parse the HTML and wander your way through the DOM to the info you need.
Although a similar feature to what you describe is planned for FireFox in the future, it is not available yet. The feature is dubbed TaskFox, and from the linked wiki, "its aim is to allow users to quickly access information and perform tasks that would normally take several steps to complete."
News of the upcoming TaskFox feature just broke today, in fact. Perhaps you should consider a career being a psychic instead of a programmer.