Monitoring chat rooms with HtmlUnit, is it possible? - java

I started with HtmlUnit recently, had some success scraping some pages and interacting with it, really powerful tool...
But, as far as my knowledge goes, I just retrieved a page with a certain state... My next step is to make HtmlUnit to read the messages from a chat room, constantly, and store/do something when a certain string/regexp matches. I was thinking even about interacting with the chat room.
I'm not sure if HtmlUnit goes that far, I did some research and found something about webDriver, webWindow, etc, maybe I will need to work with Threads to do this....
Can you guys point me in the right direction?
Thank you very much

HtmlUnit tries to simulate as much as possible of real browsers behavior.
If the target website is simple, then HtmlUnit would work. But in some cases, the website is too complex for the current HtmlUnit, you need to isolate a root cause to be fixed.
You can start with WebDriver, and you can easily change the implementation from e.g. ChromeDriver/FirefoxDriver to HtmlUnitDriver with a single line change.

Related

HtmlUnit with Knockout JS

I am trying to automate a login page which appears to be using Knockout.js.
HtmlUnit doesnt seem to load the full page, it is missing all the input fields which makes it impossible to actually login.
I have tried ensuring that the JavaScript timeouts are set and have also enabled NicelyResynchronizingAjaxController I am waiting after the page has loaded using:
waitForBackgroundJavaScript,
waitForBackgroundJavaScriptStartingBefore
Thread.sleep (just for
good measure)
I have even checked for additional windows (WebClient.getWebWindows), but there just seems to be the one.
It appears Knockout (assuming it is actually Knockout) is creating the inputs, is this just too much for htmlunit or have I missed something?
This is a know problem (see https://github.com/HtmlUnit/htmlunit/issues/37).
Hopefully i will find some time to figure out what is going wrong here.

Reduce HTML using Applets

My supervisor has tasked me with programmatically reducing a website's content by looking at the HTML tags to reveal only the core content. Importantly, this particular piece of the project must be written in Java.
Now having learnt about the differences betweenPlugins, Extensions, Applets, and Widgets, I think I want to use an Extension that calls a client-side Applet. My approach was going to be this:
Using the Google-Chrome API, I was going to display a button that
the user can click.
If clicked, the action is to launch a new browser tab that has the
Applet embedded within it.
The applet automatically sources the called tab's HTML code and
filters it.
Once filtered, the reduced copy of the original site appears.
So I have a few questions. To start, is it even possible to use an Extension with an Applet? Moreover, is it possible for an applet to look # another tabs HTML code? If not, is it possible to just reload the original tab with the Applet now embedded within it and complete the function. Thanks.
Javascript is already on most mobile web platforms. Java is not, and there is no reasonable way mobile customers will be able to install Java. Android, which runs many, but not all, mobile devices has a Java run time environment, and is basically a loader for Java apps. But an Apple iPhone is not an Android device... nor is a Windows Phone.
If you want to summarize content on the client, and in Javascript, as I see it you have two choices:
Succeed with some inner burst of genius where dozens of the best expert PhDs in Natural Language Computing have just begun exploring how to extract "true meaning" from text; OR
look at document.title and be done with it.
The 2nd approach assumes that the authors of web pages set titles and set a title appropriate for summarizing their website. This isn't a perfect assumption, but it is OK
most of the time. It is also a lot less expensive than #1
With the 1st approach you can get a head start with a "natural language toolkit" that can do things like scan text for unusual words and phrases. To get a rough idea of the kinds of software that have been built in this area, review wikipedia: Outline of natural language processing:: toolkits. A popular tookit for python is called NLTK. Whether you use a toolkit from java, or python, it means working on the server because the client will not have the storage, network speed, or CPU. For python there are server side app frameworks like django or web2py that can make building out a server app faster, and on Java there are servlets frameworks. Ultimately you'll need a lot of help, training, or luck and as I have hinted above it can easily be beyond the capabilities of a small team of fresh hires, and certainly way beyond what a single new developer eager to prove his/her capabilities can do in a few weeks on their own with limited help.
Most web pages have titles set like this near the beginning of the downloaded HTML:
<head><title>My Furry Kittens!</title></head>
You don't need to write a parser. If you are running in the browser, the title has been parsed into the DOM or Document Object Model already. The string "My Furry Kittens!" in this example would be available in the global variable document.title.
If you like, you could put a button into a plugin and let people push it to summarize the website. Or, they could just look up at the title. It is already on the page. Of course, if the goal is to scrape titles one can avoid writing a parser and use a "fake" headless scriptable browser like phantomJS or similar.
You can read more about document.title on the Mozilla Developer Network. MDN is a great reference for learning how web browsers work. They are the maintainers of the Mozilla Firefox browser. Most of what you can learn there will also work on Chrome, Internet Explorer, and various mobile platforms.
Good Luck!
How about implementing a local proxy server on the mobile device. The browser would just need to be configured to use the proxy, while the custom proxy implementation can transform the requested html however it likes.

Getting html after javascript execution WITHOUT WebView (Android)

I am currently working on a project that involves a lot of html parsing. I have come across and issue that I cannot find the solution to. Basically, what I have is an application that downloads html from a website and parses it (I am using HTMLCleaner as my parser). However, this website contains some javascript elements, which after execution change the html. What I need to do is execute the javascript function from my application and then download the html.
I have been looking for the solution for days, but all I managed to find was how to do this using WebView, which in my case I do not want.
I do have an idea for solving the problem, which is making an unrendered WebView and using that. However, I am sure there is one better then that.
Thank you in advance.
I have been looking for the solution for days, but all I managed to find was how to do this using WebView, which in my case I do not want.
You need something that knows how to render Web pages. Nothing else is going to do what you want (create a DOM, then run JavaScript that modifies that DOM).
I do have an idea for solving the problem, which is making an unrendered WebView and using that
That is one solution. Or, you could play around with Firefox's GeckoView, which will do similar stuff, just with their own rendering engine.
However, I am sure there is one better then that.
You can build your own Web browser from scratch.

Launching a website from within a program, and inputting data to specific fields

Although I've been programming for a few years I've only really dabbled in the web side of things, it's been more application based for computers up until now. I was wondering, in java for example, what library defined function or self defined function I would use to have a program launch a web browser to a certain site? Also as an extension to this how could I have it find a certain field in the website like a search box for instance (if it wasnt the current target of the cursor) and then populate it with a string and submit it to the server? (maybe this is a kind of find by ID scenario?!)
Also, is there a way to control whethere this is visible or not to the user. What I mean is, if I want to do something as a background task whilst the user carries on using the program, I will want the program to be submitting data to a webpage without the whole visual side of things that would interrupt the user?
This may be basic but like I say, I've never tried my hand at it so perhaps if someone could just provide some rough code outlines I'd really appreciate it.
Many thanks
I think Selenium might be what you are looking for.
Selenium allows you to start a Web browser, launch it to a certain website and interact with it. Also, there is a Java API (and a lot of other languages, by the way) allowing you to control the launched browser from a Java application.
There are some tweaking to do, but you can also launch Selenium in background, using a headless Web browser.
as i understand it you want to submit data to a server via the excisting webinterface?
in that case you need to find out how the URL for the request is build and then make a http-call using the corresponding URL
i advice reading this if it involves a POST submit

Script to take web survey for me

I had to take a surveymonkey survey today, and the format was as follows: a question was asked, then after hitting the next button, the answer was displayed as "Answer: _" along with an explanation. For kicks, I'd like to make a program that could take this survey, answering any letter, then going to the next page and reading the answer, then going back and changing the answer to the correct one, then going 2 pages ahead and repeating.
I am familiar with Java and Python, but I'm not sure how to make them be able to "know" where the button is, and how to "read" text without unnecessary image recognition.
This is just a fun project, nothing serious, but I would appreciate any ideas to get me started.
Assuming that the text was just that (text rather than images), there are a few useful tools for you:
.Net WebControl - I've scripted this before from .Net. It has the advantage of making all of the JS on the page still work. I know this isn't Java, but it is surprisingly easy to work with for this kind of task.
Selenium - It is primarily a web testing framework, but it would be easy to script it from Java to auto-submit forms.
TagSoup for Java - If the pages do not have significant javascript code that needs to run, there are many HTML parsers for Java that could potentially be used to develop a scraper.
Would it be unrealistic to make it post to the survey monkey pages? You could then do some regex's to pull "answer:__" out and look for that pattern in the original page. It would definitely be easier than trying to click things in a browser, etc. Basically, write a java app or python for that matter that does http posts to the survey pages in order and uses regex's to find the next page, etc and then use a stack to keep track of the history.
Edit if this isn't clear, let me know, I'll clarify
Edit 2: I completely forgot about HTMLUnit, my bad. It is a testing framework like suggested by jsight but specifically for Java and functions very similarly to JUnit, however, because it is designed for testing web applications, it can be used to automate interactions with other sites
You can do it using a simple image search. First screenshot the a unique part of the button and save it. This will be used as the relative reference on where you click the mouse. Then during the actual running of the application, have a screenshot of the entire screen and find a part matching the previously saved image and then let the mouse click on the appropriate location based on the button image location.

Categories