How to get data from the Java web scraping API?

How to get data from the Java web scraping API? - java

I am trying to get table data from the following url:
Get Data from this URL
and I wrote this code with the help of jaunt API
package org.open.browser;
import com.jaunt.Element;
import com.jaunt.Elements;
import com.jaunt.JauntException;
import com.jaunt.UserAgent;
public class ICICIScraperDemo {
public static void main(String ar[]) throws JauntException{
UserAgent userAgent = new UserAgent(); //create new userAgent (headless browser)
userAgent.visit("https://www.icicidirect.com/idirectcontent/Research/TechnicalAnalysis.aspx/companyprofile/inftec");
Elements links = userAgent.doc.findEvery("<div class=expander>").findEvery("<a>"); //find search result links
String url = null;
for(Element link : links) {
if(link.innerHTML().equalsIgnoreCase("Company Details")){
url = link.getAt("href");
}
}
/*userAgent = new UserAgent(); */ //create new userAgent (headless browser)
userAgent.visit(url);
System.out.println(userAgent.getSource());
Elements results = userAgent.doc.findEvery("<tr>").findEvery("<td>");
System.out.println(results);
}
}
But it didn't work.
Then I tried another API called htmlunit and wrote below code
public void htmlUnitEx(){
String START_URL = "https://www.icicidirect.com/idirectcontent/Research/TechnicalAnalysis.aspx/companyprofile/inftec";
try {
WebClient webClient = new WebClient(BrowserVersion.CHROME);
HtmlPage page = webClient.getPage(START_URL);
WebResponse webres = page.getWebResponse();
//List<HtmlAnchor> companyInfo = (List) page.getByXPath("//input[#id='txtStockCode']");
HtmlTable companyInfo = (HtmlTable) page.getFirstByXPath("//table");
for(HtmlTableRow item : companyInfo.getBodies().get(0).getRows()){
String label = item.getCell(1).asText();
System.out.println(label);
if(!label.contains("Registered Office")){
continue ;
}
}
}
But this also not giving the result .
Can someone please help how to get the data from the above url and other Anchor url in a single session?

Using HtmlUnit you can do this
String url = "https://www.icicidirect.com/idirectcontent/Research/TechnicalAnalysis.aspx/companyprofile/inftec";
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60)) {
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(1000);
final DomNodeList<DomNode> divs = page.querySelectorAll("div.bigcoll");
System.out.println(divs.get(1).asText());
}
Two things to mention:
you have to wait after the getPage call a bit because some parts are created by javascript/AJAX
there are many way to find elements on a page (see Finding a specific element). I have done only a quick hack to show the code is working.

Related

Why views does not increase when java opens the pages?

I have a code which uses tor every time to get a new IP address, and then it opens a blog page, but then also the views counter of the blog do not increases?
import java.io.InputStream;
import java.net.*;
public class test {
public static void main (String args [])throws Exception {
System.out.println (test.getData("http://checkip.amazonaws.com"));
System.out.println (test.getData("***BLOG URL***"));
}
public static String getData(String ur) throws Exception {
String TOR_IP="127.0.0.1", TOR_PORT="9050";
System.setProperty("java.net.preferIPv4Stack" , "true");
System.setProperty("socksProxyHost", TOR_IP);
System.setProperty("socksProxyPort", TOR_PORT);
URL url = new URL(ur);
String s = "";
URLConnection c = url.openConnection();
c.connect();
InputStream i = c.getInputStream();
int j ;
while ((j = i.read()) != -1) {
s+=(char)j;
}
return s;
}
}
This I just made to understand what they have to pass this little auto script.

This is an evolving field, the blog sites try to detect and thwart cheating. Wordpress in particular excludes (https://en.support.wordpress.com/stats/):
visits from browsers that do not execute javascript or load images
In other words just hitting the page doesn't count. You need to fetch all the resources and possibly execute the JavaScript as well.

How to get info from website once logged on with HTMLUNIT?

I have made a post before about this but have gained some more details on how to do it, yet i am still unable to do it properly. This is the main part of code. When i run it i get a whole bunch of warnings related to css in the console. And it wont work. Im trying to get the user's name as i mentioned in the code.If someone could help it would mean a lot to me. The website is my school website: https://lionel2.kgv.edu.hk/login/index.php . I have included the logged on website ( I removed most elements except for my user part ) if that helps. Thanks in advance,
Vijay.
website:
https://drive.google.com/a/kgv.hk/file/d/0B-O_Xw0mAw7tajJhVlRxTkFhOE0/view?usp=sharing
//most of this is from https://gist.github.com/harisupriyanto/6805988
String loginUrl = "http://lionel2.kgv.edu.hk";
int loginFormNum = 1;
String usernameInputName = "nameinput";
String passwordInputName = "passinput";
String submitLoginButtonValue = "Sign In";
// create the HTMLUnit WebClient instance
WebClient wclient = new WebClient();
// configure WebClient based on your desired
wclient.getOptions().setPrintContentOnFailingStatusCode(false);
wclient.getOptions().setCssEnabled(true);
wclient.getOptions().setThrowExceptionOnFailingStatusCode(false);
wclient.getOptions().setThrowExceptionOnScriptError(false);
try {
final HtmlPage loginPage = (HtmlPage)wclient.getPage(loginUrl);
final HtmlForm loginForm = loginPage.getForms().get(loginFormNum);
final HtmlTextInput txtUser = loginForm.getInputByName(usernameInputName);
txtUser.setValueAttribute(username);
final HtmlPasswordInput txtpass = loginForm.getInputByName(passwordInputName);
txtpass.setValueAttribute(password);
final HtmlSubmitInput submitLogin = loginForm.getInputByValue(submitLoginButtonValue);
final HtmlPage returnPage = submitLogin.click();
final HtmlElement returnBody = returnPage.getBody();
//if (//there is a class called "Login info, then print out the nodeValue.) {
// }
} catch(FailingHttpStatusCodeException e) {
e.printStackTrace();
} catch(Exception e) {
e.printStackTrace();
}
}

You most likely do not need the CSS so you could disable it.
To improve performance and reduce warnings and errors I disable/limit as much as possible.
webClient.setJavaScriptTimeout(30 * 1000); // 30s
webClient.getOptions().setTimeout(300 * 1000); // 300s
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setThrowExceptionOnScriptError(false); // no Exceptions because of javascript
webClient.getOptions().setPopupBlockerEnabled(true);

Basic use of getByXpath in HtmlUnit

This has been my best attempt at it:
HtmlUnorderedList unorderedList = (HtmlUnorderedList) page.getFirstByXPath("//ul[#id='inbox-message-list-messages']");
However, that getFirstByXpath returns null. Just learned about using xpath today so I'm sure I'm missing something basic.

Once we have a reference to an HtmlPage we can search for a specific HtmlElement using one of get methods or XPath. Check the following example of finding a div by an ID, and getting an anchor by name:
#Test
public void getElements() throws Exception {
try (final WebClient webClient = new WebClient()) {
final HtmlPage page = webClient.getPage("http://some_url");
final HtmlDivision div = page.getHtmlElementById("some_div_id");
final HtmlAnchor anchor = page.getAnchorByName("anchor_name");
}
}
And XPath is the suggested way for more complex searches (tutorial) :
#Test
public void xpath() throws Exception {
try (final WebClient webClient = new WebClient()) {
final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");
//get list of all divs
final List<?> divs = page.getByXPath("//div");
//get div which has a 'name' attribute of 'John'
final HtmlDivision div = (HtmlDivision) page.getByXPath("//div[#name='John']").get(0);
}
}

I would add that, you compare real Chrome result with HtmlUnit, which may differ.
First you need to ensure you construct with Chrome simulation:
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
}
Then you should see what HtmlUnit sees, by printing:
System.out.println(page.asXml());
Then see the elements, and use the XPath accordingly, as hinted by akhil.

Maintaining login credentials across pages in HTMLunit WebClient

My question is very similar to the one at this page, except that I don't have access to the remote server, nor know how it does its authentication.
I'm trying to maintain logged in status across webpages that I can request using webclient.getPage(). The website I'm accessing uses a standard login form with username, password pair. What I've done before is to create a little function to do that for me:
public static HtmlPage logIn(HtmlPage page) {
HtmlPage nextpage = null;
final HtmlForm form = page.getFormByName("login_form");
final HtmlSubmitInput button = form.getInputByValue("Login");
final HtmlTextInput username = form.getInputByName("username");
final HtmlPasswordInput password = form.getInputByName("password");
username.setValueAttribute("user_foo");
password.setValueAttribute("pwd_bar");
// hit submit button and return the requested page
try {
nextpage = button.click();
} catch (IOException e) {
e.printStackTrace();
}
return nextpage;
}
The problem with this is that I have to manually search the page returned by this function in order to find the link to the page I want. More troubling is that this only works for the page right after login, but for not other pages.
Instead, I would like to hold the login information within the browser simulator, "webclient", so that I can seamlessly access any protected page within the site. In addition to attempting the solution in the previous question (linked above), I have attempted the following solution without success:
private static void setCredentials(WebClient webClient) {
String username = "user_foo";
String password = "pwd_bar";
DefaultCredentialsProvider creds = (DefaultCredentialsProvider) webClient.getCredentialsProvider();//new DefaultCredentialsProvider();
try {
creds.addCredentials(username, password);
webClient.setCredentialsProvider(creds);
}
catch (Exception e){
System.out.println("!!! Problem login in");
e.printStackTrace();
}
Edited: here is the main function showing how I use webClient:
public static void main(String[] args) throws Exception {
// Create and initialize WebClient object
WebClient webClient = new WebClient(/*BrowserVersion.CHROME_16*/);
webClient.setThrowExceptionOnScriptError(false);
webClient.setJavaScriptEnabled(false);
webClient.setCssEnabled(false);
webClient.getCookieManager().setCookiesEnabled(true);
setCredentials(webClient);
HtmlPage subj_page = null;
//visit login page and get it
String url = "http://www.website.com/index.php";
HtmlPage page = (HtmlPage) webClient.getPage(url);
HtmlAnchor anchor = null;
page = logIn(page);
// search for content
page = searchPage(page, "recent articles");
// click on the paper link
anchor = (HtmlAnchor) page.getAnchorByText("recent articles");
page = (HtmlPage) anchor.click();
// loop through found articles
//{{{page
int curr_pg = 1;
int last_pg = 5;
page = webClient.getPage(<starting URL of the first article>); // such URLs look like: "www.website.com/view_articles.php?publication_id=17&page=1"
do {
// find sections on this page;
List <HtmlDivision> sections = new ArrayList<HtmlDivision>();
List <HtmlDivision> artdivs = new ArrayList<HtmlDivision>();
List <HtmlDivision> tagdivs = new ArrayList<HtmlDivision>();
sections = (List<HtmlDivision>) page.getByXPath("//div[#class='article_section']");
artdivs = (List<HtmlDivision>) page.getByXPath("//div[#class='article_head']");
tagdivs = (List<HtmlDivision>) page.getByXPath("//div[#class='article_tag']");
int num_ques = sections.size();
HtmlDivision section, artdiv, tagdiv;
// for every section, get its sub-articles
for (int i = 0; i < num_ques; i++) {
section = sections.get(i);
artdiv = artdivs.get(i);
tagdiv = tagdivs.get(i);
// find the sub-article details and print to xml file
String xml = getXMLArticle(artdiv, section.asText(), tagdiv);
System.out.println(xml);
System.out.println("-----------------------------");
}
//remove IllegalMonitorStateException *
synchronized (webClient) {
webClient.wait(2000); // wait for 2 seconds
}
String href = "?publication_id=17&page=" + curr_pg;
anchor = page.getAnchorByHref(href);
page = anchor.click();
System.out.println("anchor val: " + anchor.getHrefAttribute());
curr_pg++;
} while (curr_pg < last_pg);
//}}}page
webClient.closeAllWindows();
}
Other info: I do not have info about the remote site server's authentication mechanism since I have no access to it, but your help would be great. Thank you!

HtmlUnit can't retrieve page after downloading a file

I'm having this weird problem with HtmlUnit in Java. I am using it to download some data from a website, the process is something like this:
1 - Login
2 - For each element (cars)
----- 3 Search for car
----- 4 Download zip file from a link
The code:
Creation of the webclient:
webClient = new WebClient(BrowserVersion.FIREFOX_3_6);
webClient.setJavaScriptEnabled(true);
webClient.setThrowExceptionOnScriptError(false);
DefaultCredentialsProvider provider = new DefaultCredentialsProvider();
provider.addCredentials(USERNAME, PASSWORD);
webClient.setCredentialsProvider(provider);
webClient.setRefreshHandler(new ImmediateRefreshHandler());
Login:
public void login() throws IOException
{
page = (HtmlPage) webClient.getPage(URL);
HtmlForm form = page.getFormByName("formLogin");
String user = USERNAME;
String password = PASSWORD;
// Enter login and password
form.getInputByName("LoginSteps$UserName").setValueAttribute(user);
form.getInputByName("LoginSteps$Password").setValueAttribute(password);
// Click Login Button
page = (HtmlPage) form.getInputByName("LoginSteps$LoginButton").click();
webClient.waitForBackgroundJavaScript(3000);
// Click on Campa area
HtmlAnchor link = (HtmlAnchor) page.getElementById("ctl00_linkCampaNoiH");
page = (HtmlPage) link.click();
webClient.waitForBackgroundJavaScript(3000);
System.out.println(page.asText());
}
Search for car in website:
private void searchCar(String _regNumber) throws IOException
{
// Open search window
page = page.getElementById("search_gridCampaNoi").click();
webClient.waitForBackgroundJavaScript(3000);
// Write plate number
HtmlInput element = (HtmlInput) page.getElementById("jqg1");
element.setValueAttribute(_regNumber);
webClient.waitForBackgroundJavaScript(3000);
// Click on search
HtmlAnchor anchor = (HtmlAnchor) page.getByXPath("//*[#id=\"fbox_gridCampaNoi_search\"]").get(0);
page = anchor.click();
webClient.waitForBackgroundJavaScript(3000);
System.out.println(page.asText());
}
Download pdf:
try
{
InputStream is = _link.click().getWebResponse().getContentAsStream();
File path = new File(new File(DOWNLOAD_PATH), _regNumber);
if (!path.exists())
{
path.mkdir();
}
writeToFile(is, new File(path, _regNumber + "_pdfs.zip"));
}
catch (Exception e)
{
e.printStackTrace();
}
}
The problem:
The first car works okay, pdf is downloaded, but as soon as I search for a new car, when I get to this line:
page = page.getElementById("search_gridCampaNoi").click();
I get this exception:
Exception in thread "main" java.lang.ClassCastException: com.gargoylesoftware.htmlunit.UnexpectedPage cannot be cast to com.gargoylesoftware.htmlunit.html.HtmlPage
After debugging, I've realized that the moment I make this call:
InputStream is = _link.click().getWebResponse().getContentAsStream();
the return type of page.getElementById("search_gridCampaNoi").click() changes from HtmlPage to WebResponse, so instead of receiving a new page, I'm receiving again the file that I already downloaded.
A couple of screenshots of the debugger showing this situation:
First call, return type OK:
Second call, return type changed and I no longer receive a HtmlPage:
Thanks in advance!

Just in case someone encounters the same problem, I found a workaround.Changing the line:
InputStream is = _link.click().getWebResponse().getContentAsStream();
to
InputStream is = _link.openLinkInNewWindow().getWebResponse().getContentAsStream();
seems to do the trick. Im having problems now when doing several iterations, sometimes it works, sometimes it doesn't but at least I have something now.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to get data from the Java web scraping API? - java

Related

Why views does not increase when java opens the pages?

How to get info from website once logged on with HTMLUNIT?

Basic use of getByXpath in HtmlUnit

Maintaining login credentials across pages in HTMLunit WebClient

HtmlUnit can't retrieve page after downloading a file

Categories

Resources