I'm having this weird problem with HtmlUnit in Java. I am using it to download some data from a website, the process is something like this:
1 - Login
2 - For each element (cars)
----- 3 Search for car
----- 4 Download zip file from a link
The code:
Creation of the webclient:
webClient = new WebClient(BrowserVersion.FIREFOX_3_6);
webClient.setJavaScriptEnabled(true);
webClient.setThrowExceptionOnScriptError(false);
DefaultCredentialsProvider provider = new DefaultCredentialsProvider();
provider.addCredentials(USERNAME, PASSWORD);
webClient.setCredentialsProvider(provider);
webClient.setRefreshHandler(new ImmediateRefreshHandler());
Login:
public void login() throws IOException
{
page = (HtmlPage) webClient.getPage(URL);
HtmlForm form = page.getFormByName("formLogin");
String user = USERNAME;
String password = PASSWORD;
// Enter login and password
form.getInputByName("LoginSteps$UserName").setValueAttribute(user);
form.getInputByName("LoginSteps$Password").setValueAttribute(password);
// Click Login Button
page = (HtmlPage) form.getInputByName("LoginSteps$LoginButton").click();
webClient.waitForBackgroundJavaScript(3000);
// Click on Campa area
HtmlAnchor link = (HtmlAnchor) page.getElementById("ctl00_linkCampaNoiH");
page = (HtmlPage) link.click();
webClient.waitForBackgroundJavaScript(3000);
System.out.println(page.asText());
}
Search for car in website:
private void searchCar(String _regNumber) throws IOException
{
// Open search window
page = page.getElementById("search_gridCampaNoi").click();
webClient.waitForBackgroundJavaScript(3000);
// Write plate number
HtmlInput element = (HtmlInput) page.getElementById("jqg1");
element.setValueAttribute(_regNumber);
webClient.waitForBackgroundJavaScript(3000);
// Click on search
HtmlAnchor anchor = (HtmlAnchor) page.getByXPath("//*[#id=\"fbox_gridCampaNoi_search\"]").get(0);
page = anchor.click();
webClient.waitForBackgroundJavaScript(3000);
System.out.println(page.asText());
}
Download pdf:
try
{
InputStream is = _link.click().getWebResponse().getContentAsStream();
File path = new File(new File(DOWNLOAD_PATH), _regNumber);
if (!path.exists())
{
path.mkdir();
}
writeToFile(is, new File(path, _regNumber + "_pdfs.zip"));
}
catch (Exception e)
{
e.printStackTrace();
}
}
The problem:
The first car works okay, pdf is downloaded, but as soon as I search for a new car, when I get to this line:
page = page.getElementById("search_gridCampaNoi").click();
I get this exception:
Exception in thread "main" java.lang.ClassCastException: com.gargoylesoftware.htmlunit.UnexpectedPage cannot be cast to com.gargoylesoftware.htmlunit.html.HtmlPage
After debugging, I've realized that the moment I make this call:
InputStream is = _link.click().getWebResponse().getContentAsStream();
the return type of page.getElementById("search_gridCampaNoi").click() changes from HtmlPage to WebResponse, so instead of receiving a new page, I'm receiving again the file that I already downloaded.
A couple of screenshots of the debugger showing this situation:
First call, return type OK:
Second call, return type changed and I no longer receive a HtmlPage:
Thanks in advance!
Just in case someone encounters the same problem, I found a workaround.Changing the line:
InputStream is = _link.click().getWebResponse().getContentAsStream();
to
InputStream is = _link.openLinkInNewWindow().getWebResponse().getContentAsStream();
seems to do the trick. Im having problems now when doing several iterations, sometimes it works, sometimes it doesn't but at least I have something now.
Related
I have written a simple program which should login via a form on a website.
Unfortunately, the form in the html has no name or id.
I use the latest version of HtmlUnit and Java 11.
I tried to find the form with the .getForms () method, but without success.
Html Snippet from Website i try to login
Here is my code to find the form:
//Get the form
HtmlForm form = LoginPage.getFormByName("I tried several options here");
//Get the Submit button
final HtmlButton loginButton = form.getButtonByName("Anmelden");
//Get the text fields for password and username
final HtmlTextInput username = form.getInputByName("text");
final HtmlTextInput password = form.getInputByName("password");
Whatever I tried, I didn't find any form.
This is my connection class if it helps:
public HtmlPage CslPlasmaConnection(){
//Create Webclient to connect to CslPlasma
WebClient CslPlasmaConnection = new WebClient(BrowserVersion.BEST_SUPPORTED);
//helper variable ini with null
HtmlPage CslPlasmaLoginPage = null;
//Get the content from CslPlasma
try {
CslPlasmaLoginPage = CslPlasmaConnection.getPage(URL);
} catch (IOException e) {
e.printStackTrace();
}
//Return CslPlasma Login Page
return CslPlasmaLoginPage;
}
Without knowing the page i can only guess...
Have a look at this answer https://stackoverflow.com/a/54188201/4804091
And try to use the latest page (maybe there is some js that creates the form).
webClient.getPage(url);
webClient.waitForBackgroundJavaScript(10000);
HtmlPage page = (HtmlPage) webClient.getCurrentWindow().getEnclosedPage();
If you're sure this is the only form on the page or you know which form number it is, you can use page.getForms() to get all forms of the page and get yours from the resulting list.
Like so:
HtmlForm form = LoginPage.getForms().get(0); // if it's the only form, its index is 0
I am trying to get table data from the following url:
Get Data from this URL
and I wrote this code with the help of jaunt API
package org.open.browser;
import com.jaunt.Element;
import com.jaunt.Elements;
import com.jaunt.JauntException;
import com.jaunt.UserAgent;
public class ICICIScraperDemo {
public static void main(String ar[]) throws JauntException{
UserAgent userAgent = new UserAgent(); //create new userAgent (headless browser)
userAgent.visit("https://www.icicidirect.com/idirectcontent/Research/TechnicalAnalysis.aspx/companyprofile/inftec");
Elements links = userAgent.doc.findEvery("<div class=expander>").findEvery("<a>"); //find search result links
String url = null;
for(Element link : links) {
if(link.innerHTML().equalsIgnoreCase("Company Details")){
url = link.getAt("href");
}
}
/*userAgent = new UserAgent(); */ //create new userAgent (headless browser)
userAgent.visit(url);
System.out.println(userAgent.getSource());
Elements results = userAgent.doc.findEvery("<tr>").findEvery("<td>");
System.out.println(results);
}
}
But it didn't work.
Then I tried another API called htmlunit and wrote below code
public void htmlUnitEx(){
String START_URL = "https://www.icicidirect.com/idirectcontent/Research/TechnicalAnalysis.aspx/companyprofile/inftec";
try {
WebClient webClient = new WebClient(BrowserVersion.CHROME);
HtmlPage page = webClient.getPage(START_URL);
WebResponse webres = page.getWebResponse();
//List<HtmlAnchor> companyInfo = (List) page.getByXPath("//input[#id='txtStockCode']");
HtmlTable companyInfo = (HtmlTable) page.getFirstByXPath("//table");
for(HtmlTableRow item : companyInfo.getBodies().get(0).getRows()){
String label = item.getCell(1).asText();
System.out.println(label);
if(!label.contains("Registered Office")){
continue ;
}
}
}
But this also not giving the result .
Can someone please help how to get the data from the above url and other Anchor url in a single session?
Using HtmlUnit you can do this
String url = "https://www.icicidirect.com/idirectcontent/Research/TechnicalAnalysis.aspx/companyprofile/inftec";
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60)) {
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(1000);
final DomNodeList<DomNode> divs = page.querySelectorAll("div.bigcoll");
System.out.println(divs.get(1).asText());
}
Two things to mention:
you have to wait after the getPage call a bit because some parts are created by javascript/AJAX
there are many way to find elements on a page (see Finding a specific element). I have done only a quick hack to show the code is working.
I have made a post before about this but have gained some more details on how to do it, yet i am still unable to do it properly. This is the main part of code. When i run it i get a whole bunch of warnings related to css in the console. And it wont work. Im trying to get the user's name as i mentioned in the code.If someone could help it would mean a lot to me. The website is my school website: https://lionel2.kgv.edu.hk/login/index.php . I have included the logged on website ( I removed most elements except for my user part ) if that helps. Thanks in advance,
Vijay.
website:
https://drive.google.com/a/kgv.hk/file/d/0B-O_Xw0mAw7tajJhVlRxTkFhOE0/view?usp=sharing
//most of this is from https://gist.github.com/harisupriyanto/6805988
String loginUrl = "http://lionel2.kgv.edu.hk";
int loginFormNum = 1;
String usernameInputName = "nameinput";
String passwordInputName = "passinput";
String submitLoginButtonValue = "Sign In";
// create the HTMLUnit WebClient instance
WebClient wclient = new WebClient();
// configure WebClient based on your desired
wclient.getOptions().setPrintContentOnFailingStatusCode(false);
wclient.getOptions().setCssEnabled(true);
wclient.getOptions().setThrowExceptionOnFailingStatusCode(false);
wclient.getOptions().setThrowExceptionOnScriptError(false);
try {
final HtmlPage loginPage = (HtmlPage)wclient.getPage(loginUrl);
final HtmlForm loginForm = loginPage.getForms().get(loginFormNum);
final HtmlTextInput txtUser = loginForm.getInputByName(usernameInputName);
txtUser.setValueAttribute(username);
final HtmlPasswordInput txtpass = loginForm.getInputByName(passwordInputName);
txtpass.setValueAttribute(password);
final HtmlSubmitInput submitLogin = loginForm.getInputByValue(submitLoginButtonValue);
final HtmlPage returnPage = submitLogin.click();
final HtmlElement returnBody = returnPage.getBody();
//if (//there is a class called "Login info, then print out the nodeValue.) {
// }
} catch(FailingHttpStatusCodeException e) {
e.printStackTrace();
} catch(Exception e) {
e.printStackTrace();
}
}
You most likely do not need the CSS so you could disable it.
To improve performance and reduce warnings and errors I disable/limit as much as possible.
webClient.setJavaScriptTimeout(30 * 1000); // 30s
webClient.getOptions().setTimeout(300 * 1000); // 300s
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setThrowExceptionOnScriptError(false); // no Exceptions because of javascript
webClient.getOptions().setPopupBlockerEnabled(true);
I'm trying to send a message on www.meetme.com but can't figure out how to do it. I can type in the message in the comment area but clicking the Send button doesn't do anything. What am I doing wrong? When I login and press the Login button the page does change and everything is fine. Anyone have any ideas or clues?
HtmlPage htmlPage = null;
HtmlElement htmlElement;
WebClient webClient = null;
HtmlButton htmlButton;
HtmlForm htmlForm;
try{
// Create and initialize WebClient object
webClient = new WebClient(BrowserVersion.FIREFOX_17 );
webClient.setCssEnabled(false);
webClient.setJavaScriptEnabled(false);
webClient.setThrowExceptionOnFailingStatusCode(false);
webClient.setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getCookieManager().setCookiesEnabled(true);
/*webClient.setRefreshHandler(new RefreshHandler() {
public void handleRefresh(Page page, URL url, int arg) throws IOException {
System.out.println("handleRefresh");
}
});*/
htmlPage = webClient.getPage("http://www.meetme.com");
htmlForm = htmlPage.getFirstByXPath("//form[#action='https://ssl.meetme.com/login']");
htmlForm.getInputByName("username").setValueAttribute("blah#gmail.com");
htmlForm.getInputByName("password").setValueAttribute("blah");
//Signing in
htmlButton = htmlForm.getElementById("login_form_submit");
htmlPage = (HtmlPage) htmlButton.click();
htmlPage = webClient.getPage("http://www.meetme.com/member/1234567890");
System.out.println("BEFORE CLICK");
System.out.println(htmlPage.asText());
//type message in text area
HtmlTextArea commentArea = (HtmlTextArea)htmlPage.getFirstByXPath("//textarea[#id='profileQMBody']");
commentArea.setText("Testing");
htmlButton = (HtmlButton) htmlPage.getHtmlElementById("profileQMSend");
htmlPage = (HtmlPage)htmlButton.click();
webClient.waitForBackgroundJavaScript(7000);
//The print is exactly the same as the BEFORE CLICK print
System.out.println("AFTER CLICK");
System.out.println(htmlPage.asText());
}catch(ElementNotFoundException e){
e.printStackTrace();
}catch(Exception e){
e.printStackTrace();
}
Without knowing much about the webpage you're accessing, you just can't perform an AJAX request with JavaScript disabled. If changing that doesn't result in success, then you will have to keep debugging, but make sure JavaScript is enabled.
Additionally, make sure you're using HtmlUnit 1.12 and update all the deprecated methods in your code.
BTW, I'd also recommend to turn may JavaScript warnings off. Check this answer to see how you can do that.
My question is very similar to the one at this page, except that I don't have access to the remote server, nor know how it does its authentication.
I'm trying to maintain logged in status across webpages that I can request using webclient.getPage(). The website I'm accessing uses a standard login form with username, password pair. What I've done before is to create a little function to do that for me:
public static HtmlPage logIn(HtmlPage page) {
HtmlPage nextpage = null;
final HtmlForm form = page.getFormByName("login_form");
final HtmlSubmitInput button = form.getInputByValue("Login");
final HtmlTextInput username = form.getInputByName("username");
final HtmlPasswordInput password = form.getInputByName("password");
username.setValueAttribute("user_foo");
password.setValueAttribute("pwd_bar");
// hit submit button and return the requested page
try {
nextpage = button.click();
} catch (IOException e) {
e.printStackTrace();
}
return nextpage;
}
The problem with this is that I have to manually search the page returned by this function in order to find the link to the page I want. More troubling is that this only works for the page right after login, but for not other pages.
Instead, I would like to hold the login information within the browser simulator, "webclient", so that I can seamlessly access any protected page within the site. In addition to attempting the solution in the previous question (linked above), I have attempted the following solution without success:
private static void setCredentials(WebClient webClient) {
String username = "user_foo";
String password = "pwd_bar";
DefaultCredentialsProvider creds = (DefaultCredentialsProvider) webClient.getCredentialsProvider();//new DefaultCredentialsProvider();
try {
creds.addCredentials(username, password);
webClient.setCredentialsProvider(creds);
}
catch (Exception e){
System.out.println("!!! Problem login in");
e.printStackTrace();
}
Edited: here is the main function showing how I use webClient:
public static void main(String[] args) throws Exception {
// Create and initialize WebClient object
WebClient webClient = new WebClient(/*BrowserVersion.CHROME_16*/);
webClient.setThrowExceptionOnScriptError(false);
webClient.setJavaScriptEnabled(false);
webClient.setCssEnabled(false);
webClient.getCookieManager().setCookiesEnabled(true);
setCredentials(webClient);
HtmlPage subj_page = null;
//visit login page and get it
String url = "http://www.website.com/index.php";
HtmlPage page = (HtmlPage) webClient.getPage(url);
HtmlAnchor anchor = null;
page = logIn(page);
// search for content
page = searchPage(page, "recent articles");
// click on the paper link
anchor = (HtmlAnchor) page.getAnchorByText("recent articles");
page = (HtmlPage) anchor.click();
// loop through found articles
//{{{page
int curr_pg = 1;
int last_pg = 5;
page = webClient.getPage(<starting URL of the first article>); // such URLs look like: "www.website.com/view_articles.php?publication_id=17&page=1"
do {
// find sections on this page;
List <HtmlDivision> sections = new ArrayList<HtmlDivision>();
List <HtmlDivision> artdivs = new ArrayList<HtmlDivision>();
List <HtmlDivision> tagdivs = new ArrayList<HtmlDivision>();
sections = (List<HtmlDivision>) page.getByXPath("//div[#class='article_section']");
artdivs = (List<HtmlDivision>) page.getByXPath("//div[#class='article_head']");
tagdivs = (List<HtmlDivision>) page.getByXPath("//div[#class='article_tag']");
int num_ques = sections.size();
HtmlDivision section, artdiv, tagdiv;
// for every section, get its sub-articles
for (int i = 0; i < num_ques; i++) {
section = sections.get(i);
artdiv = artdivs.get(i);
tagdiv = tagdivs.get(i);
// find the sub-article details and print to xml file
String xml = getXMLArticle(artdiv, section.asText(), tagdiv);
System.out.println(xml);
System.out.println("-----------------------------");
}
//remove IllegalMonitorStateException *
synchronized (webClient) {
webClient.wait(2000); // wait for 2 seconds
}
String href = "?publication_id=17&page=" + curr_pg;
anchor = page.getAnchorByHref(href);
page = anchor.click();
System.out.println("anchor val: " + anchor.getHrefAttribute());
curr_pg++;
} while (curr_pg < last_pg);
//}}}page
webClient.closeAllWindows();
}
Other info: I do not have info about the remote site server's authentication mechanism since I have no access to it, but your help would be great. Thank you!