Basic use of getByXpath in HtmlUnit - java

This has been my best attempt at it:
HtmlUnorderedList unorderedList = (HtmlUnorderedList) page.getFirstByXPath("//ul[#id='inbox-message-list-messages']");
However, that getFirstByXpath returns null. Just learned about using xpath today so I'm sure I'm missing something basic.

Once we have a reference to an HtmlPage we can search for a specific HtmlElement using one of get methods or XPath. Check the following example of finding a div by an ID, and getting an anchor by name:
#Test
public void getElements() throws Exception {
try (final WebClient webClient = new WebClient()) {
final HtmlPage page = webClient.getPage("http://some_url");
final HtmlDivision div = page.getHtmlElementById("some_div_id");
final HtmlAnchor anchor = page.getAnchorByName("anchor_name");
}
}
And XPath is the suggested way for more complex searches (tutorial) :
#Test
public void xpath() throws Exception {
try (final WebClient webClient = new WebClient()) {
final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");
//get list of all divs
final List<?> divs = page.getByXPath("//div");
//get div which has a 'name' attribute of 'John'
final HtmlDivision div = (HtmlDivision) page.getByXPath("//div[#name='John']").get(0);
}
}

I would add that, you compare real Chrome result with HtmlUnit, which may differ.
First you need to ensure you construct with Chrome simulation:
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
}
Then you should see what HtmlUnit sees, by printing:
System.out.println(page.asXml());
Then see the elements, and use the XPath accordingly, as hinted by akhil.

Related

Find a form with Java and htmlUnit

I have written a simple program which should login via a form on a website.
Unfortunately, the form in the html has no name or id.
I use the latest version of HtmlUnit and Java 11.
I tried to find the form with the .getForms () method, but without success.
Html Snippet from Website i try to login
Here is my code to find the form:
//Get the form
HtmlForm form = LoginPage.getFormByName("I tried several options here");
//Get the Submit button
final HtmlButton loginButton = form.getButtonByName("Anmelden");
//Get the text fields for password and username
final HtmlTextInput username = form.getInputByName("text");
final HtmlTextInput password = form.getInputByName("password");
Whatever I tried, I didn't find any form.
This is my connection class if it helps:
public HtmlPage CslPlasmaConnection(){
//Create Webclient to connect to CslPlasma
WebClient CslPlasmaConnection = new WebClient(BrowserVersion.BEST_SUPPORTED);
//helper variable ini with null
HtmlPage CslPlasmaLoginPage = null;
//Get the content from CslPlasma
try {
CslPlasmaLoginPage = CslPlasmaConnection.getPage(URL);
} catch (IOException e) {
e.printStackTrace();
}
//Return CslPlasma Login Page
return CslPlasmaLoginPage;
}
Without knowing the page i can only guess...
Have a look at this answer https://stackoverflow.com/a/54188201/4804091
And try to use the latest page (maybe there is some js that creates the form).
webClient.getPage(url);
webClient.waitForBackgroundJavaScript(10000);
HtmlPage page = (HtmlPage) webClient.getCurrentWindow().getEnclosedPage();
If you're sure this is the only form on the page or you know which form number it is, you can use page.getForms() to get all forms of the page and get yours from the resulting list.
Like so:
HtmlForm form = LoginPage.getForms().get(0); // if it's the only form, its index is 0

How to get data from the Java web scraping API?

I am trying to get table data from the following url:
Get Data from this URL
and I wrote this code with the help of jaunt API
package org.open.browser;
import com.jaunt.Element;
import com.jaunt.Elements;
import com.jaunt.JauntException;
import com.jaunt.UserAgent;
public class ICICIScraperDemo {
public static void main(String ar[]) throws JauntException{
UserAgent userAgent = new UserAgent(); //create new userAgent (headless browser)
userAgent.visit("https://www.icicidirect.com/idirectcontent/Research/TechnicalAnalysis.aspx/companyprofile/inftec");
Elements links = userAgent.doc.findEvery("<div class=expander>").findEvery("<a>"); //find search result links
String url = null;
for(Element link : links) {
if(link.innerHTML().equalsIgnoreCase("Company Details")){
url = link.getAt("href");
}
}
/*userAgent = new UserAgent(); */ //create new userAgent (headless browser)
userAgent.visit(url);
System.out.println(userAgent.getSource());
Elements results = userAgent.doc.findEvery("<tr>").findEvery("<td>");
System.out.println(results);
}
}
But it didn't work.
Then I tried another API called htmlunit and wrote below code
public void htmlUnitEx(){
String START_URL = "https://www.icicidirect.com/idirectcontent/Research/TechnicalAnalysis.aspx/companyprofile/inftec";
try {
WebClient webClient = new WebClient(BrowserVersion.CHROME);
HtmlPage page = webClient.getPage(START_URL);
WebResponse webres = page.getWebResponse();
//List<HtmlAnchor> companyInfo = (List) page.getByXPath("//input[#id='txtStockCode']");
HtmlTable companyInfo = (HtmlTable) page.getFirstByXPath("//table");
for(HtmlTableRow item : companyInfo.getBodies().get(0).getRows()){
String label = item.getCell(1).asText();
System.out.println(label);
if(!label.contains("Registered Office")){
continue ;
}
}
}
But this also not giving the result .
Can someone please help how to get the data from the above url and other Anchor url in a single session?
Using HtmlUnit you can do this
String url = "https://www.icicidirect.com/idirectcontent/Research/TechnicalAnalysis.aspx/companyprofile/inftec";
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60)) {
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(1000);
final DomNodeList<DomNode> divs = page.querySelectorAll("div.bigcoll");
System.out.println(divs.get(1).asText());
}
Two things to mention:
you have to wait after the getPage call a bit because some parts are created by javascript/AJAX
there are many way to find elements on a page (see Finding a specific element). I have done only a quick hack to show the code is working.

Java web parser with cookies?

There are some html parsing libraries available,
but what do you do if you need to authenticate, and carry the cookies with each request?
And generally what if you need to press some button in order to get the content that you want to parse,
for example that button calculates something, or gets some data trough websocets etc...
Is there a technology to simulate behaviour in browser (so that all js is actually working) and parse from there...?
UPDATE
Maybe for this purpose i need to embed chromium and use traditional parsers? Tough i dont understand how do i trigger a click...
HtmlUnit: http://htmlunit.sourceforge.net/
public static void main(String... args) throws Exception {
final WebClient webClient = new WebClient();
final HtmlPage page1 = webClient.getPage("http://some_url");
final HtmlForm form = page1.getFormByName("myform");
final HtmlSubmitInput button = form.getInputByName("submitbutton");
final HtmlTextInput textField = form.getInputByName("userid");
textField.setValueAttribute("root");
final HtmlPage page2 = button.click();
webClient.closeAllWindows();
}

Maintaining login credentials across pages in HTMLunit WebClient

My question is very similar to the one at this page, except that I don't have access to the remote server, nor know how it does its authentication.
I'm trying to maintain logged in status across webpages that I can request using webclient.getPage(). The website I'm accessing uses a standard login form with username, password pair. What I've done before is to create a little function to do that for me:
public static HtmlPage logIn(HtmlPage page) {
HtmlPage nextpage = null;
final HtmlForm form = page.getFormByName("login_form");
final HtmlSubmitInput button = form.getInputByValue("Login");
final HtmlTextInput username = form.getInputByName("username");
final HtmlPasswordInput password = form.getInputByName("password");
username.setValueAttribute("user_foo");
password.setValueAttribute("pwd_bar");
// hit submit button and return the requested page
try {
nextpage = button.click();
} catch (IOException e) {
e.printStackTrace();
}
return nextpage;
}
The problem with this is that I have to manually search the page returned by this function in order to find the link to the page I want. More troubling is that this only works for the page right after login, but for not other pages.
Instead, I would like to hold the login information within the browser simulator, "webclient", so that I can seamlessly access any protected page within the site. In addition to attempting the solution in the previous question (linked above), I have attempted the following solution without success:
private static void setCredentials(WebClient webClient) {
String username = "user_foo";
String password = "pwd_bar";
DefaultCredentialsProvider creds = (DefaultCredentialsProvider) webClient.getCredentialsProvider();//new DefaultCredentialsProvider();
try {
creds.addCredentials(username, password);
webClient.setCredentialsProvider(creds);
}
catch (Exception e){
System.out.println("!!! Problem login in");
e.printStackTrace();
}
Edited: here is the main function showing how I use webClient:
public static void main(String[] args) throws Exception {
// Create and initialize WebClient object
WebClient webClient = new WebClient(/*BrowserVersion.CHROME_16*/);
webClient.setThrowExceptionOnScriptError(false);
webClient.setJavaScriptEnabled(false);
webClient.setCssEnabled(false);
webClient.getCookieManager().setCookiesEnabled(true);
setCredentials(webClient);
HtmlPage subj_page = null;
//visit login page and get it
String url = "http://www.website.com/index.php";
HtmlPage page = (HtmlPage) webClient.getPage(url);
HtmlAnchor anchor = null;
page = logIn(page);
// search for content
page = searchPage(page, "recent articles");
// click on the paper link
anchor = (HtmlAnchor) page.getAnchorByText("recent articles");
page = (HtmlPage) anchor.click();
// loop through found articles
//{{{page
int curr_pg = 1;
int last_pg = 5;
page = webClient.getPage(<starting URL of the first article>); // such URLs look like: "www.website.com/view_articles.php?publication_id=17&page=1"
do {
// find sections on this page;
List <HtmlDivision> sections = new ArrayList<HtmlDivision>();
List <HtmlDivision> artdivs = new ArrayList<HtmlDivision>();
List <HtmlDivision> tagdivs = new ArrayList<HtmlDivision>();
sections = (List<HtmlDivision>) page.getByXPath("//div[#class='article_section']");
artdivs = (List<HtmlDivision>) page.getByXPath("//div[#class='article_head']");
tagdivs = (List<HtmlDivision>) page.getByXPath("//div[#class='article_tag']");
int num_ques = sections.size();
HtmlDivision section, artdiv, tagdiv;
// for every section, get its sub-articles
for (int i = 0; i < num_ques; i++) {
section = sections.get(i);
artdiv = artdivs.get(i);
tagdiv = tagdivs.get(i);
// find the sub-article details and print to xml file
String xml = getXMLArticle(artdiv, section.asText(), tagdiv);
System.out.println(xml);
System.out.println("-----------------------------");
}
//remove IllegalMonitorStateException *
synchronized (webClient) {
webClient.wait(2000); // wait for 2 seconds
}
String href = "?publication_id=17&page=" + curr_pg;
anchor = page.getAnchorByHref(href);
page = anchor.click();
System.out.println("anchor val: " + anchor.getHrefAttribute());
curr_pg++;
} while (curr_pg < last_pg);
//}}}page
webClient.closeAllWindows();
}
Other info: I do not have info about the remote site server's authentication mechanism since I have no access to it, but your help would be great. Thank you!

Password hacking

I have two files, a list of usernames and a list of passwords. I need to write a program to check each user name with the list of passwords. Then I need to go to a website and see if it logs in. I am not very sure how to go about the comparing and how to simulate the program to log in the website enter the information. Could you please help me out with this? It's a homework problem.
Regardless of the language you choose to implement this in, the basic idea is to simulate log-ins programatically. This can be done by logging in manually and looking at the HTTP headers, then sending "forged" headers programatically, changing the user/password fields.
Most log-ins will use POST and making a POST is not entirely straightforward. If you are allowed to use external libraries, you can try cURL. Simply set the appropriate headers and look at the response to check if your attempt was successful or not. If not, try again with a new combination.
In pseudo code:
bool simulate_login(user, password) :
request = new request(url)
request.set_method('POST')
request.set_header('name', user)
request.set_header('pass', password)
response = request.fetch_reponse()
return response.contains("Login successful")
success = []
foreach user:
foreach password:
if (simulate_login(user, password)):
success.append((user, password))
break
If you would like to use java you can try with HtmlUnit (see: http://htmlunit.sourceforge.net/gettingStarted.html) or if you are allowed Groovy you can go with http://www.gebish.org/
Here is the example from getting started guide that is relevant to your case:
public void login() throws Exception {
final WebClient webClient = new WebClient();
// Get the first page
final HtmlPage page1 = webClient.getPage("http://some_url");
// Get the form that we are dealing with and within that form,
// find the submit button and the field that we want to change.
final HtmlForm form = page1.getFormByName("myform");
final HtmlSubmitInput button = form.getInputByName("submitbutton");
final HtmlTextInput textField = form.getInputByName("userid");
// Change the value of the text field
textField.setValueAttribute("username");
// Do similar for password and that's all
// Now submit the form by clicking the button and get back the second page.
final HtmlPage page2 = button.click();
webClient.closeAllWindows();
}
If you would like to use java you can try with HtmlUnit (see: http://htmlunit.sourceforge.net/gettingStarted.html) or if you are allowed Groovy you can go with http://www.gebish.org/
Here is the example from getting started guide that is relevant to your case:
public void login() throws Exception {
final WebClient webClient = new WebClient();
// Get the first page
final HtmlPage page1 = webClient.getPage("http://some_url");
// Get the form that we are dealing with and within that form,
// find the submit button and the field that we want to change.
final HtmlForm form = page1.getFormByName("myform");
final HtmlSubmitInput button = form.getInputByName("submitbutton");
final HtmlTextInput textField = form.getInputByName("userid");
// Change the value of the text field
textField.setValueAttribute("username");
// Do similar for password and that's all
// Now submit the form by clicking the button and get back the second page.
final HtmlPage page2 = button.click();
List item

Categories