Java: "Control" External Application - java

Is it possible to programmatically start an application from Java and then send commands to it and receive the program's output?
I'm trying to realize this scenario:
I want to access a website that uses lots of javascript and special html + css features -> the website isn't properly displayed in swt.browser or any of the other of the available Browser Widgets. But the website can be displayed without any problems in firefox. So I want to run a hidden instance of firefox, load the website and get the data. (It would be nice if FF can be embedded in a JFrame or so..)
Has anybody got an idea how to realize this?
Any help would really be appreciated!
EDIT: The website loads some Javascript that does some html magic and loads some pictures. When I only read the html from the website I see nothing more than some JavaScript calls. But when the website is loaded in a Browser, it displays some images overlayed with text. That's what I'm trying to show the user of my app.

To start Firefox from within the application, you could use:
Runtime runtime = Runtime.getRuntime();
try {
String path = "/path/to/firefox";
Process process = runtime.exec(path + " " + url);
} catch (IOException e) {
// ...
}
To manipulate processes once they have started, one can often use process.getInputStream() and process.getOutputStream(), but that would not help you in the case of Firefox.
You should probably look into ways of solving your specific problem other than trying to interact directly between your application and a browser instance. Consider either moving the whole interface into a Java gui, or doing a web app from the ground up -- not half and half.

See this article - it will teach you how to start a process, read its output and write to its input stream.
However this solution may be not be the best for your problem. What kind of data do you need to get from the Web Page? Would it be better to read the html with an HTTP GET and then parse it with an Html parser?

If you have a text-mode browser available (like links2 on linux) you might want to see how well that can render the page. For example, the command "links -dump http://someurl.com" will format the page as text and exit immediately, resulting in output that might be easily parseable using the methods that Ray Myers and kgiannakakis suggest.

If the website is static, you could use a web scraper like Jericho to load the URL, parse the HTML and wander your way through the DOM to the info you need.

Although a similar feature to what you describe is planned for FireFox in the future, it is not available yet. The feature is dubbed TaskFox, and from the linked wiki, "its aim is to allow users to quickly access information and perform tasks that would normally take several steps to complete."
News of the upcoming TaskFox feature just broke today, in fact. Perhaps you should consider a career being a psychic instead of a programmer.

Related

Creating an image from a webpage

I'm working on a way to detect defacement on my website. The idea is to crawl the whole website and for each page, take a screenshot or render the website as an image and compare it with the last time the page has been checked.
I'm looking for a way to convert a whole webpage (HTML, CSS, JS) into an image, like a screenshot, no matter the language is (but I would prefer Java, Python or C#)
I need it to be fast and usable on a server.
I already tried the folowing in Java:
CssBox, but the rendering isn't good enough (no JS)
Selenium Web Driver, but it's way too slow (Time to open firefox, display the page etc...) and not usable without GUI
I think a solution would be a kind of wrapper for a web engine but I didn't find anything about that (at least in Java). I've been told PhantomJS would fit for this need, is it right?
The perfect result would be to create something like that: http://www.page2images.com/home
Use a browser which you can control via a script or command line options like phantomjs. The documentation contains examples how to make screenshots from URLs.
The website you linked offer some good rest API that perform the task: it's not a viable option for you?
Selenium is your best bet. Depending on your page content (ie. JS libraries, etc) it might take some time, but you could automate this with a script to run nightly via cron. Or using screen.
It has a rich language of assertions and simulated mouse events, and ways to regression-test and/or monitor the state of a set of pages.
Good luck.
With no GUI, it's probably not possible to do something like this.
If you're not too tight on the GUI and related things, you can use the JavaFX Webview and take a screenshot of the node using the following code
WritableImage image = webView.snapshot(null, null);
BufferedImage bufferedImage = SwingFXUtils.fromFXImage(image, null);
....
References:
WebView#snapshot
SwingFXUtils#fromFXImage

Reduce HTML using Applets

My supervisor has tasked me with programmatically reducing a website's content by looking at the HTML tags to reveal only the core content. Importantly, this particular piece of the project must be written in Java.
Now having learnt about the differences betweenPlugins, Extensions, Applets, and Widgets, I think I want to use an Extension that calls a client-side Applet. My approach was going to be this:
Using the Google-Chrome API, I was going to display a button that
the user can click.
If clicked, the action is to launch a new browser tab that has the
Applet embedded within it.
The applet automatically sources the called tab's HTML code and
filters it.
Once filtered, the reduced copy of the original site appears.
So I have a few questions. To start, is it even possible to use an Extension with an Applet? Moreover, is it possible for an applet to look # another tabs HTML code? If not, is it possible to just reload the original tab with the Applet now embedded within it and complete the function. Thanks.
Javascript is already on most mobile web platforms. Java is not, and there is no reasonable way mobile customers will be able to install Java. Android, which runs many, but not all, mobile devices has a Java run time environment, and is basically a loader for Java apps. But an Apple iPhone is not an Android device... nor is a Windows Phone.
If you want to summarize content on the client, and in Javascript, as I see it you have two choices:
Succeed with some inner burst of genius where dozens of the best expert PhDs in Natural Language Computing have just begun exploring how to extract "true meaning" from text; OR
look at document.title and be done with it.
The 2nd approach assumes that the authors of web pages set titles and set a title appropriate for summarizing their website. This isn't a perfect assumption, but it is OK
most of the time. It is also a lot less expensive than #1
With the 1st approach you can get a head start with a "natural language toolkit" that can do things like scan text for unusual words and phrases. To get a rough idea of the kinds of software that have been built in this area, review wikipedia: Outline of natural language processing:: toolkits. A popular tookit for python is called NLTK. Whether you use a toolkit from java, or python, it means working on the server because the client will not have the storage, network speed, or CPU. For python there are server side app frameworks like django or web2py that can make building out a server app faster, and on Java there are servlets frameworks. Ultimately you'll need a lot of help, training, or luck and as I have hinted above it can easily be beyond the capabilities of a small team of fresh hires, and certainly way beyond what a single new developer eager to prove his/her capabilities can do in a few weeks on their own with limited help.
Most web pages have titles set like this near the beginning of the downloaded HTML:
<head><title>My Furry Kittens!</title></head>
You don't need to write a parser. If you are running in the browser, the title has been parsed into the DOM or Document Object Model already. The string "My Furry Kittens!" in this example would be available in the global variable document.title.
If you like, you could put a button into a plugin and let people push it to summarize the website. Or, they could just look up at the title. It is already on the page. Of course, if the goal is to scrape titles one can avoid writing a parser and use a "fake" headless scriptable browser like phantomJS or similar.
You can read more about document.title on the Mozilla Developer Network. MDN is a great reference for learning how web browsers work. They are the maintainers of the Mozilla Firefox browser. Most of what you can learn there will also work on Chrome, Internet Explorer, and various mobile platforms.
Good Luck!
How about implementing a local proxy server on the mobile device. The browser would just need to be configured to use the proxy, while the custom proxy implementation can transform the requested html however it likes.

Confused about integration of Java file, JSPs, servlets?

This is my first time working with Java and tomcat and I'm a little confused about how everything fits together - I've googled endlessly but can't seem to wrap my head around a few concepts.
I have completed a Java program that outputs bufferedImages. My goal is to eventually get these images to display on a webpage.
I'm having trouble understanding how my java file (.java) which is currently running in NetBeans interacts with a servlet and/or JSP.
Ideally, a servlet or JSP (not 100% clear on how either of those works. I mostly understand the syntax by looking at various examples, however) could get my output (the bufferedImages) when the program runs and the HTML file could somehow interact with whatever they are doing so that the images could be displayed on the webage. I'm not sure if this is possible. If anyone could suggest a general order of going about things, that would be awesome.
In every example/tutorial i find, no one uses .java files - there are .classes in the WEB-INF folder -- it doesn't seem like people are using full on java programs. However, I need my .java program to run so that I can retrieve the output and use it on the webapp.
Any general guidance would be greatly appreciated!
I think this kind of documentation is sadly lacking; too many think that an example is an explanation, and for all the wonderful things you can get out of an example, sometimes an explanation is not one of them. I'm going to attempt to explain some of the overall concepts you mentioned; they aren't going to help you solve your buffered image display problem directly, unfortunately.
Tomcat and other programs like it are "web servers"; these are programs that accept internet connections from other computers and return information in a particular format. When you enter a "www" address in a browser, the string in that address eventually ends up (as a "request") at a web server, which then returns you a web page (also called a "response"). Tomcat, Apache, Jetty, JBoss, and WebSphere are all similar programs that do this sort of thing. In the original form of the world-wide-web, the request string represented a file on the server machine, and the web server's job was to return that (html) file for display in the browser.
A Servlet is a kind of java program that runs on some web servers. The servlet itself is a java class with methods defined by the javax.servlet.Servlet interface. In webservers that handle servlets, someone familiar with the configuration files can instruct the web server program to accept certain requests and, instead of returning an HTML file (or whatever) from the server, to instead execute the servlet code. A servlet, by its nature, returns content itself - think of a program that outputs HTML and you're on the right track.
But it turns out to be a pain to output complete HTML from a program -- there's a tedious amount of HTML that doesn't have much to do with the "heavy lifting" for which you need a programming language of some sort. You have to have Java (or some language) to make database inquiries, filter results, etc., but you don't really need Java to put in the and the hundreds of other tags that a modern web page needs.
So a JavaServerPage (JSP) is a special kind of hybrid, a combination of HTML and things related to servlets. You CAN put java code directly in a JSP file, but it is usually considered better to use html-like 'tags' which are then interpreted by a "JSP compiler" and turned into a servlet. So the creator of the JSP page learns how to use these tags, which are (if correctly constructed) more logical for web page creators than the java programming language is, and in fact doesn't have to be a programmer at all. So a programmer, working with this content-oriented person, creates tags for the page to use to describe how it wants its page to look, then the programmer does the programming and the content-person creates the web pages with it.
For your specific problem, we'll need more detail to help you. Do you envision this program running and using some information provided by the user as part of his request to generate the images? Or are the images generated once and now you just need to display them? I think that's a topic for another question, actually.
This ought to be enough to get you started. I would now suggest the wikipedia articles on these things to get more details, and good luck getting your head around the concepts. I hope this has helped.
This addendum provided after a comment you made about wanting to do a slideshow.
An important web programming concept is the client-server and request-response nature of it. In the traditional, non-Javascript web environment, the client (read browser) sends a request to the server, and the server sends back bytes. There is no ongoing connection between the two computers after the stream of bytes finishes, and there are restrictions on how long that stream of bytes can continue. Additionally, outside of this request and response, the server usually has no capability to send anything to the client unless the client requests it; the client 'drives' the exchange of data.
So a 'slideshow', for instance, where the server periodically sends bytes representing an additional image, is not the way HTML works (or was meant to work). You could do one under the user's control: the user presses a button for each next picture, the browser sends a request for the next picture and it appears in the place where the previous one was. That fits the request-response paradigm.
Now, the effect of an automatic slideshow is possible using Javascript. Javascript, based on Java but otherwise unrelated, is a scripting language; it is part of an HTML page, is downloaded with the page to the browser, and it runs in the browser's environment (as opposed to a JSP/servlet, which executes on the server). You can write a timer in Javascript, and it can wait N seconds and send another request to the server (for another picture or whatever). Javascript has its own rules, etc., but even so I think it a good idea to keep in mind that you aren't just doing HTML any more.
If a slideshow is what you are after, then you don't need JSP at all. You can create an HTML page with places for the picture being displayed, labels and text and etc., buttons for stopping the slideshow and so forth, in HTML, and Javascript for requesting additional pictures.
You COULD use JSP to create the page, and it might help you depending on how complex the page is, but it isn't going to help you with an essential function: getting the next picture for the slideshow. When the browser requests a JSP page:
the request goes to the server,
the server determines the page you want and that it is a JSP page,
the server compiles that page to a servlet if it hasn't already,
the servlet runs, producing HTML output according to the tags now compiled into Java,
the server returns HTML to the browser.
Then the server is done, and more bytes won't go to the browser until another request is made.
Again, I hope this has helped. Your example of a slideshow has revealed some basic concepts that need to be understood about web programming, servers, HTML, JSPs, and Javascript, and I wish you luck on your journey through them all. And if you come to think of it all as a bit more convoluted than it seems it needed to be, well, you won't be the first.
You can create a JSP that invokes a method in your Java class to retrieve the BufferedImage. Then you must set the content type to the adequate image type:
response.setContentType()
The tricky part is that you must print the image from the JSP, so you have to call:
response.getOutputStream()
from your JSP, and with that OutputStream you must pass the bytes of your BufferedImage.
Note that in that JSP you'll not be able to print out HTML, only the image.
I'm not sure where you need more clarification, as it seems you're a bit confused about the concepts.
BTW.: A JSP is just a servlet that has an easier syntax to write HTML and Java code together.

how to extract HTML data from a webpage which scrolls down for a fixed number of times?

I want to extract HTML data from a website using JAVA. The problem is the webpage keeps scrolling down once the user reaches the bottom of the page. Number of times it scrolls down is fixed. My JAVA code can extract only for the 1st part. How do I extract for the remaining scrolls? Is there a way to load the whole page at once with JAVA? ANy help would be appreciated :)
This might be the type of thing that PhantomJS (http://phantomjs.org/) was designed for. It will crawl entire web pages and even execute JavaScript, using a "real" browser in headless mode. I suggest stopping what you're doing with Java and take a look at PhantomJS instead. It could save you a LOT of time. :)
This type of behavior is implemented in the browser, interpreting the user's scrolling actions to load more content via AJAX and dynamically modifying the in-memory DOM in the browser. Consider that your Java runs in a web container on the server, and that web container (i.e. Tomcat, JBoss, etc) provides a huge amount of underlying code so your app doesn't have to worry about the plumbing.
Conceptually, a similar thing occurs at the client, with the DHTML web page running in its own "container" (the browser), which provides a wealth of functionality, from UI to networking, to DOM, etc. If you remove the browser from the equation and replace it with a Java program, you will need to provide the equivalent of the browser in which the DHTML/Javascript can execute.
I believe that HTMLUnit may fill the bill, but have not worked with it personally.

Launching a website from within a program, and inputting data to specific fields

Although I've been programming for a few years I've only really dabbled in the web side of things, it's been more application based for computers up until now. I was wondering, in java for example, what library defined function or self defined function I would use to have a program launch a web browser to a certain site? Also as an extension to this how could I have it find a certain field in the website like a search box for instance (if it wasnt the current target of the cursor) and then populate it with a string and submit it to the server? (maybe this is a kind of find by ID scenario?!)
Also, is there a way to control whethere this is visible or not to the user. What I mean is, if I want to do something as a background task whilst the user carries on using the program, I will want the program to be submitting data to a webpage without the whole visual side of things that would interrupt the user?
This may be basic but like I say, I've never tried my hand at it so perhaps if someone could just provide some rough code outlines I'd really appreciate it.
Many thanks
I think Selenium might be what you are looking for.
Selenium allows you to start a Web browser, launch it to a certain website and interact with it. Also, there is a Java API (and a lot of other languages, by the way) allowing you to control the launched browser from a Java application.
There are some tweaking to do, but you can also launch Selenium in background, using a headless Web browser.
as i understand it you want to submit data to a server via the excisting webinterface?
in that case you need to find out how the URL for the request is build and then make a http-call using the corresponding URL
i advice reading this if it involves a POST submit

Categories