Why doesn't HTMLunit work on this https webpage? - java

I'm trying to learn more about HTMLunit and doing some tests at the moment. I am trying to get basic information such as page title and text from this site:
https://....com (removed the full url, important part is that it is https)
The code I use is this, which is working fine on other websites:
final WebClient webClient = new WebClient();
final HtmlPage page;
page = (HtmlPage)webClient.getPage("https://medeczane.sgk.gov.tr/eczane/login.jsp");
System.out.println(page.getTitleText());
System.out.println(page.asText());
Why can't I get this basic information ? If it is because of security measures, what are the specifics and can I bypass them ? Thanks.
Edit:Hmm the code stops working after webclient.getpage(); , test2 is not written. So I can not check if page is null or not.
final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_2);
final HtmlPage page;
System.out.println("test1");
try {
page = (HtmlPage)webClient.getPage("https://medeczane.sgk.gov.tr/eczane/login.jsp");
System.out.println("test2");

I solved this by adding this line of code:
webClient.setUseInsecureSSL(true);
which is deprecated way of disabling secure SSL. In current HtmlUnit version you have to do:
webClient.getOptions().setUseInsecureSSL(true);

I think that this is an authentication problem - If I go tho that page in Firefox I get a login box.
Try
webClient.setAuthentication(realm,username,password);
before the call the getPage()

Related

Is there any way to access and retrieve data from the web in the background while running a java project?

I am trying to retrieve data from a website, however, i want this to run in the background.
I have already managed to use a chrome extension, but it always opens up a chrome tab and displays the underlying actions.
Is it possible to retrieve data from the web without having to see the open chrome browser?
This is what i have so far:
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true);
HtmlPage page = webClient.getPage("https://en.wikipedia.org/wiki/Main_Page");
String pageContent = page.asText();
System.out.println(pageContent);
Sure you can. You can spawn a thread and run it in the background.
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true); new Thread(){
public void run(){
HtmlPage page =
webClient.getPage("https://en.wikipedia.org/wiki/Main_Page");
String pageContent = page.asText();
System.out.println(pageContent);`
}
}.start();

HtmlUnit can't find forms on website

At the following website I try to access the login and password forms with HtmlUnit: https://zof.interreport.com/diveport#
However this very simple javascript returns an empty list [].
void homePage() throws Exception{
final WebClient webClient = new WebClient(BrowserVersion.CHROME);
final HtmlPage page = webClient.getPage("https://zof.interreport.com/diveport#");
System.out.println(page.getForms());
}
So somehow HtmlUnit doesn't recognize the forms on the page. How can I fix this?
At first: you only show some java code but you talk about javascript - is there anything missing?
Regarding the form. The page you are trying to test is one of these pages that doing some work on the client side. This implies, that after the page is loaded, the real page/dom is created inside your browser by invoking javascript. When using HtmlUnit you have to take care of that. In simple cases it is sufficient to wait for the javacript to be processed.
This code works for me:
final WebClient webClient = new WebClient(BrowserVersion.CHROME);
final HtmlPage page = webClient.getPage("https://zof.interreport.com/diveport#");
webClient.waitForBackgroundJavaScriptStartingBefore(5000);
System.out.println(page.getForms());
Take care to use the latest SNAPSHOT build of HtmlUnit.
I have not worked on that API but here is the trick
Open same page in your browser by disabling the JavaScript. It is not working.
This means the page loading its content using some JavaScript dom operations.
If you can not get html here there must be some way out in API you are using.
Check with the HtmlUnit api documentation. The class JAVADOC
There is method
public ScriptResult executeJavaScript(String sourceCode)
The key here is if API you are using will not execute the JavaScript on its won and you have to code for it.

Java Logging in to a website that uses complex javascript

I'd first like to start by saying, I've managed this using phantomJS and Selenium. I load phantomjs, load the url (sports.coral.co.uk) and then check my balance. I am however trying to find a more lightweight option.
I have tried manually sending http get/post requests using apache's HttpClient. Monitoring the login process, using postman for chrome, shows 4 requests sent once the login button has been pressed. I have tried editing and re-sending them using postman. However, from what I can tell there's a requestID that gets sent along with the requests. This is generated using the javascript on the page.
var requestId = (new Date().getTime()) + Math.round(Math.random() * 1000000);
var failedTimer = setTimeout('iapiRequestFailed(' + requestId + ')', iapiConf['loginDomainRetryInterval'] * 1000);
iapiRegisterRequestId(requestId, iapiCALLOUT_MESSAGES, failedTimer, request[3], request[4], request[5]);
return;
It looks like the original ID is a random generated number, that then gets registered using another javascript function. I'm guessing the login is partly failing due to me not being able to provide an acceptable requestID. When I re-send the old requests the user is part logged in. Once i click on my account it says an error occurred. The only explanation would be the requestID.
I then decided to give HtmlUnit a go. This seems like the type of thing I require. I did some research on using HttpClient with a javascript engine, such as Rhino and it seems HtmlUnit is the tool for that.
Before I even try to log in to the page, I get errors caused by the javascript on the page.
Heres the simple bit of code I use to connect to the page;
#Test
public void htmlunit() throws Exception {
LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log", "org.apache.commons.logging.impl.NoOpLog");
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF);
WebClient client = new WebClient(BrowserVersion.CHROME);
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setThrowExceptionOnScriptError(false);
client.getOptions().setThrowExceptionOnFailingStatusCode(false);
HtmlPage page = client.getPage("http://sports.coral.co.uk");
System.out.println(page.asText());
client.close();
}
When I comment out the LogFactory bit I can see that there are loads of Warnings thrown,
WARNING: Obsolete content type encountered: 'application/x-javascript'.
Feb 09, 2016 4:33:34 PM com.gargoylesoftware.htmlunit.html.HtmlScript isExecutionNeeded
WARNING: Script is not JavaScript (type: application/ld+json, language: ). Skipping execution. etc...
I'm guessing this means that HtmlUnit isn't compatible with the javascript thats being executed on the page?
I'm not very good with javascript and the scripts on the page are obfuscated, which makes it even harder to read. What I don't understand is, why does the JS get executed without error when using phantomJS or chromeDriver but not HtmlUnit? Is it because the Rhino engine isn't good enough to execute it? Am I missing something obvious?
This code will turn off all the javascript warnings caused by the htmlunit library and not by your code.
LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log", "org.apache.commons.logging.impl.NoOpLog");
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF);
WebClient client = new WebClient(BrowserVersion.CHROME);
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setThrowExceptionOnScriptError(false);
client.getOptions().setThrowExceptionOnFailingStatusCode(false);
HtmlPage page = webClient.getPage("http://sports.coral.co.uk");

How to enable Flash to HTMLUnit?

I'm trying to grab html contents by HTMLUnit. Everything went nice, but couldn't get Flash contents those are visible as <img> where its actually in <object>, i have
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setActiveXNative(true);
webClient.getOptions().setAppletEnabled(true);
webClient.getOptions().setCssEnabled(true);
In SO some places i found someone saying HTMLUnit won't support Flash, but those answers seems old, so am raising this question. Someone please help.
Thanks.
I found it with the help
for this i have downgrade my HTMLUnit version from 2.15 to 2.13 as in 2.15 BrowserVersionFeatures.JS_FRAME_RESOLVE_URL_WITH_PARENT_WINDOW seems deprecated and don't know what feature replaced here.
private static BrowserVersion firefox17WithUptoDateFlash = new BrowserVersion(BrowserVersion.FIREFOX_17.getApplicationName(), BrowserVersion.FIREFOX_17.getApplicationVersion(), BrowserVersion.FIREFOX_17.getUserAgent(), BrowserVersion.FIREFOX_17.getBrowserVersionNumeric(),new BrowserVersionFeatures[]{
BrowserVersionFeatures.JS_FRAME_RESOLVE_URL_WITH_PARENT_WINDOW,
BrowserVersionFeatures.STYLESHEET_HREF_EXPANDURL,
BrowserVersionFeatures.STYLESHEET_HREF_STYLE_NULL
});
static {
PluginConfiguration plugin1 = new PluginConfiguration(
"Shockwave Flash",
"Shockwave Flash 11.4 r402",
"NPSWF32_11_4_402_287.dll");
plugin1.getMimeTypes().add(new PluginConfiguration.MimeType(
"application/x-shockwave-flash",
"Adobe Flash movie",
"swf"));
firefox17WithUptoDateFlash.getPlugins().add(plugin1);
}
final WebClient webClient = new WebClient(firefox17WithUptoDateFlash);
Here newly written browser instance will give support to htmlunit to act as flash enabled GUI-less browser

How to open IE from java and perform operations like click() etc. through java?

I wan't to login to a website through Java and perform operations, like click, add text to a textfield, etc., through Java.
I suggest using a testing framework like HtmlUnit. Even through it's designed for testing, it's a perfectly good programmatic "navigator" of remote websites.
Here's some sample code from the site, showing how to navigate to a page and fill in a form:
public void submittingForm() throws Exception {
WebClient webClient = new WebClient();
HtmlPage page1 = webClient.getPage("http://some_url");
HtmlForm form = page1.getFormByName("myform");
HtmlSubmitInput button = form.getInputByName("submitbutton");
HtmlTextInput textField = form.getInputByName("userid");
textField.setValueAttribute("root");
HtmlPage page2 = button.click();
}
You could launch it by
Runtime.getRuntime().exec("command-line command to launch IE");
then use Java's Robot class to send mouse clicks and fill in text. This seems rather crude, though, and you can probably do better by communicating directly with the web server (bypassing the browser entirely).
This question's answers may be helpful.
But you should consider direct HTTP as a better way to interact with websites.
You could also use WebTest from Canoo which actually uses HTMLUnit but with an extra layer on top of it. Should be easier to start du to the scripting layer and it comes with additional abstractions for sending mails, verifying output etc.
http://webtest.canoo.com/webtest/manual/WebTestHome.html
Might as well try Selenium. Its free and has a fairly nice wrapper for IE.
If you really need a 'real' IE you could try Watij, if you just need browser features in java I recommend HttpClient
Update: as the OP indicated using a real browser was not needed/wanted. An example of a form login using HttpClient can be found here: https://github.com/apache/httpcomponents-client/blob/master/httpclient5/src/test/java/org/apache/hc/client5/http/examples/ClientFormLogin.java

Categories