HTMLUnit Java too old browser - java

I'm trying to write an application for automatic update page content (inside my account). I used the HTMLUnit because it supports javascript.
But I faced with "your browser is too old" problem.
My code:
public static void main(String[] args) {
Locale.setDefault(Locale.ENGLISH);
try (final WebClient client = new WebClient(BrowserVersion.FIREFOX_45)) {
client.getOptions().setUseInsecureSSL(true);
client.setAjaxController(new NicelyResynchronizingAjaxController());
client.getOptions().setThrowExceptionOnFailingStatusCode(false);
client.getOptions().setThrowExceptionOnScriptError(false);
client.waitForBackgroundJavaScript(30000);
client.waitForBackgroundJavaScriptStartingBefore(30000);
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setRedirectEnabled(true);
HtmlPage page = client.getPage("https://passport.yandex.ru/passport?mode=auth&retpath=https://toloka.yandex.ru/?ncrnd=5645");
HtmlForm form = page.getForms().get(0);
HtmlInput inputLogin = form.getInputByName("login");
inputLogin.setValueAttribute(userName);
HtmlInput inputPassw = form.getInputByName("passwd");
inputPassw.setValueAttribute(password);
DomElement button = page.getElementsByTagName("button").get(0);
HtmlPage page2 = button.click();
System.out.println(page2.asXml());
}
catch (IOException e) {
}
}
Login is successful, but I can't load second page. (It should redirect to content page)
Answer:
<h1 style="padding-top: 20px;">Browser is too old</h1>
<p>
Unfortunately you are using an old browser.
Please, upgrade to at least IE10 or use one of the modern browsers, e.g.
Yandex.Browser,
Google Chrome or
Mozilla Firefox
</p>
How can I solve it? Thanks.

There is no simple solution for your problem but there are some things you can do.
use the latest snapshot build of HtmlUnit (http://htmlunit.sourceforge.net/gettingLatestCode.html)
try with different simulated browsers (e.g. chrome)
cleanup your client settings, set only the options required (in your case maybe setUseInsecureSSL(true);
waitForBackgroundJavaScript and waitForBackgroundJavaScriptStartingBefore are no options; doing this at client setup is useless
check your log; maybe there are some hints about not supported javascript methods
Place the call of waitForBackgroundJavaScript after the button click; mabey the redirect is done by some javascript with a small delay.
HtmlPage page2 = button.click();
client.waitForBackgroundJavaScript(30000);
And because the javascript might have change the page content you have to get the page content again.
page2 = page2.getEnclosingWindow().getEnclosedPage();
Usually the checks for the browser version are done by some javascript magic. Maybe the magic trick used by your web site is not (correctly) supported/emulated by HtmlUnit. If you are able to find out the root cause for this you can fill a bug (see http://htmlunit.sourceforge.net/submittingJSBugs.html for some hints how to find this).

Related

Jsoup - read from an html url where code is hidden

I'm trying to using the jsoup library to get 'li' from a website. The problem is this:
If I open the source of website with CTRL+U(which is the same read by jsoup), the 'ul' tag is hidden.
if I open the code with the fuction "inspect code" of google chrome,'li' are shown.
Posting the code is not necessary; I only want to know how can access to this 'li' with jsoup or other java free libraries, Whereas in the source code(and through jsoup) these informations are hidden.
The site is https://farmaci.agenziafarmaco.gov.it/bancadatifarmaci/cerca-farmaco and try to search something(i.e. Tachi)
The problem with Jsoup is that it won't handle scripts. It is just getting html as it is before the AJAX code is executed.
You can use something like HtmlUnit, which is basically a GUI-less browser. So, it can handle scripts.
You can try something like this after getting the HtmlUnit library:
String url = "https://farmaci.agenziafarmaco.gov.it/bancadatifarmaci/cerca-farmaco?search=Tachi";
try(final WebClient webClient = new WebClient()) {
final HtmlPage page = webClient.getPage(url);
final HtmlUnorderedList list = page.getHtmlElementById("ul_farm_results");
System.out.println(list.asText());
}
I couldn't check the code as the website's certificate is improperly configured and I didn't want to import it's certificate. You may want to take a look at this to resolve the certificate errors.
JSoup does not execute all the scripts, it just gets the HTML returned by the server. What you are looking for is call rendered HTML, that is the HTML produced by the browser after executing all the scripts.
The best solution in Java is to use Selenium with your preferred browser. Selenium was developed for UI testing, it is however very popular as a scraping tool.
A good getting started page is to be found here.
Some code example with Firefox:
WebDriver driver = new FirefoxDriver();
driver.get("https://farmaci.agenziafarmaco.gov.it/bancadatifarmaci/cerca-farmaco");
// Find the element
String id = "ul_farm_results";
WebElement element = driver.findElement(By.id(id));

HtmlUnit does not find the element

I'm trying to get the textbox with u_0_1e as id, from the page wall but HtmlUnit does not find anything. The last line prints null.
Here's the code:
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
WebClient client = new WebClient(BrowserVersion.CHROME);
JavaScriptEngine engine = new JavaScriptEngine(client);
client.setJavaScriptEngine(engine);
HtmlPage home = client.getPage("https://www.facebook.com/login.php");
HtmlSubmitInput login = (HtmlSubmitInput) home.getElementById("u_0_1");
HtmlTextInput name = (HtmlTextInput) home.getElementById("email");
HtmlPasswordInput pass = (HtmlPasswordInput) home.getElementById("pass");
name.setValueAttribute("myname");
pass.setValueAttribute("mypass");
HtmlPage page = login.click();
HtmlPage wall = client.getPage("https://www.facebook.com/");
System.out.println(wall.getElementById("u_0_1e"));
I have some comments about your issue.
First of all, you have disabled HtmlUnit's logging. So if you have any JavaScript issue then you are not going to see it. If you are actually getting a JavaScript error then the JavaScript code won't be fully executed. If the element you're trying to fetch was dynamically fetched from the server (probably using AJAX) then the JavaScript errors, if any, might result in that element not being fetched.
If you are webscraping, which is clearly the case, then you don't have any control over the JS so you can only accept it as not working or disable JS and manually processing the AJAX requests.
Of course, you will see the page perfectly working on a real browser but take into consideration that the JavaScript engine HtmlUnit uses is different from the real browsers.
Secondly, the two lines containing the word engine are absolutely unneeded.
Thirdly, as I mentioned in a previous question of yours, this will be more suitable to be handled by means of the Facebook API.
Finally, you might find this other answer useful:
JavaScript not being properly executed in HtmlUnit

Prevent HtmlUnit 2.13 from executing JavaScript

Here is my code to get the page:
WebClient webClient = new WebClient();
HtmlPage page = webClient.getPage(url);
The problem is the webClient always executes javascript automatically and throws me a list of error. I just want to get the raw source. How can I prevent it from executing script? I've found there is a way in version 2.9:
webClient.setJavaScriptEnabled(false);
But setJavaScriptEnabled() function was deprecated. Anyone knows how to solve this problem? Please help me. Thank you so much.
Although setJavaScriptEnabled(boolean) was deprecated it was added to the WebClientOptions member of the WebClient. Here is the doc.
In order to disable JavaScript you should do this:
webClient.getOptions().setJavaScriptEnabled(false);
Additionally, if you you want to get the raw HTML code from the webpage you should take a look at this question:
How to get the pure raw HTML of a page in HTMLUnit while ignoring JavaScript and CSS?
Take into account that even the asXml() method change the formatting as well as the content of the web page you fetch (even if JavaScript is disabled).

Java HtmlUnit form.getInputByValue("Login Now!").click();

I am trying to make a application which would connect to a site with the login provided by the user. I don't have any experience in interacting with websites in Java so I googled some and found hmtlunit to fit my needs.
But I ran into an error when trying to click the submit button for the login form:
public static boolean attempt_login(String username, String password) throws ElementNotFoundException, IOException {
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3_6);
webClient.setJavaScriptEnabled(false);
webClient.setThrowExceptionOnScriptError(false);
webClient.setRefreshHandler(new RefreshHandler() {
public void handleRefresh(Page page, URL url, int arg) throws IOException {
System.out.println("handleRefresh");
}
});
HtmlPage page = (HtmlPage) webClient.getPage(Config.LOGIN_PAGE);
List<HtmlForm> forms = page.getForms();
HtmlForm form = null;
for(HtmlForm f : forms){
if(f.getId().equals("login_form")){
form = f;
}
}
if(form == null){
throw new NullPointerException("Could not find form!");
}
form.getInputByName("username").setValueAttribute(username);
form.getInputByName("password").setValueAttribute(password);
page = (HtmlPage) form.getInputByValue("Login Now!").click();
System.out.println(page.asText());
return false;}
Somehow it fails to find the submitButton to login
com.gargoylesoftware.htmlunit.ElementNotFoundException: elementName=[input] attributeName=[value] attributeValue=[Login Now!]
at com.gargoylesoftware.htmlunit.html.HtmlForm.getInputByValue(HtmlForm.java:737)
at domain.Helper.attempt_login(Helper.java:41)
at TesterStartUp.main(TesterStartUp.java:15)
The html source code:
<button type="submit" value="Login Now!" onmouseover="this.style.backgroundPosition='bottom';" onmouseout="this.style.backgroundPosition='top';" onclick="return SetFocus();">Login Now!</button>
When I googled for a solution, I found something about disabling javascript would help. So i told the webclient to disabled it ( webClient.setJavaScriptEnabled(false);) but still had the exception.
At first I had the same issue with trying to select the form ("login_form") but there was a method where I could get the list of all forms and then just see if one matched the list. I couldn't find a way around it for the submit button, So I hoped someone else knows a solution to this problem.
Thanks in advance,
Sir Troll
The HtmlUnit getInputByValue() method operating on a <form> will only return types of HtmlInput, and the only Input Button type -- HtmlButtonInput -- represents <input type="button"> and NOT <button>. You will need to change your HTML or use a different HtmlUnit method call.
I had same kind of issue that I resolved myself:
When you try to login using html unit first check if login box is appearing onclick using jquery or not. If login box/div is appearing after click a button then you need to refresh the page (web client) to access new input elements
HtmlPage page = webClient.getPage("https://www.yourwebsite.com/#");
HtmlAnchor link=page.getElementByName("link");
link.click();
page.refresh();
If your site has a button instead of anchor link change it for html button, After refreshing may you can found the element.
Hope it will help and solve your issue

How to fill HTTP forms through java?

I want to fill a text field of a HTTP form through java and then want to click on the submit button through java so as to get the page source of the document returned after submitting the form.
I can do this by sending HTTP request directly but I don't to this in this way.
I usually do it using HtmlUnit. They have an example on their page :
#Test
public void submittingForm() throws Exception {
final WebClient webClient = new WebClient();
// Get the first page
final HtmlPage page1 = webClient.getPage("http://some_url");
// Get the form that we are dealing with and within that form,
// find the submit button and the field that we want to change.
final HtmlForm form = page1.getFormByName("myform");
final HtmlSubmitInput button = form.getInputByName("submitbutton");
final HtmlTextInput textField = form.getInputByName("userid");
// Change the value of the text field
textField.setValueAttribute("root");
// Now submit the form by clicking the button and get back the second page.
final HtmlPage page2 = button.click();
}
And you can read more here.
If you don't want to talk HTTP directly (why?), then take a look at Watij.
It allows you to invoke a browser (IE) as a COM control within your Java process, navigate through page elements by using their document ids etc., fill in forms and press buttons. Because it's running a browser, Javascript will run as normal (like if you were doing this manually).
You would probably need to write a Java Applet, as the only other way than sending a direct request would be to have it interface with the browser.
Of course, for this to work, you would have to embed the applet in the page. If you don't control the page, this can't be done. If you do control the page, you might as well be using Javascript, instead of trying to get a Java Applet to do it, which would be much more cumbersome and difficult.
Just to clarify, what is the problem you are having creating an HTTP Request and why do you want to use a different method?

Categories