How to find all URLs recursively on a website -- java

How to find all URLs recursively on a website -- java - java

I have a method which allows me to get all URLs from the page (and optional - to check is it valid).
But it works only for 1 page, I want to check all the website. Need to make a recursion.
private static FirefoxDriver driver;
public static void main(String[] args) throws Exception {
driver = new FirefoxDriver();
driver.get("https://example.com/");
List<WebElement> allURLs = findAllLinks(driver);
report(allURLs);
// here are my trials for recursion
for (WebElement element : allURLs) {
driver.get(element.getAttribute("href"));
List<WebElement> allUrls = findAllLinks(driver);
report(allUrls);
}
}
public static List findAllLinks(WebDriver driver)
{
List<WebElement> elementList = driver.findElements(By.tagName("a"));
elementList.addAll(driver.findElements(By.tagName("img")));
List finalList = new ArrayList();
for (WebElement element : elementList)
{
if(element.getAttribute("href") != null)
{
finalList.add(element);
}
}
return finalList;
}
public static void report(List<WebElement> allURLs) {
for(WebElement element : allURLs){
System.out.println("URL: " + element.getAttribute("href")+ " returned " + isLinkBroken(new URL(element.getAttribute("href"))));
}
}
See comment "here are my trials for recursion". But it goes through the first page, then again through the first page and that's all.

You're trying to write a web crawler. I am a big fan of code reuse. Which is to say I always look around to see if my project has already been written before I spend the time writing it myself. And there are many versions of web crawlers out there. One written by Marilena Panagiotidou pops up early in a google search. Leaving out the imports, her basic version looks like this.
public class BasicWebCrawler {
private HashSet<String> links;
public BasicWebCrawler() {
links = new HashSet<String>();
}
public void getPageLinks(String URL) {
//4. Check if you have already crawled the URLs
//(we are intentionally not checking for duplicate content in this example)
if (!links.contains(URL)) {
try {
//4. (i) If not add it to the index
if (links.add(URL)) {
System.out.println(URL);
}
//2. Fetch the HTML code
Document document = Jsoup.connect(URL).get();
//3. Parse the HTML to extract links to other URLs
Elements linksOnPage = document.select("a[href]");
//5. For each extracted URL... go back to Step 4.
for (Element page : linksOnPage) {
getPageLinks(page.attr("abs:href"));
}
} catch (IOException e) {
System.err.println("For '" + URL + "': " + e.getMessage());
}
}
}
public static void main(String[] args) {
//1. Pick a URL from the frontier
new BasicWebCrawler().getPageLinks("http://www.mkyong.com/");
}
}
Probably the most important thing to note here is how the recursion works. A recursive method is one that calls itself. Your example above is not recursion. You have a method findAllLinks that you call once on a page, and then once for every link found in the page. Notice how Marilena's getPageLinks method calls itself once for every link it finds in a page at a given URL. And in calling itself it creates a new stack frame and a generates a new set of links from a page and calls itself again once for every link, etc. etc.
Another important thing to note about a recursive function is when does it stop calling itself. In this case Marilena's recursive function keeps calling itself until it can't find any new links. If the page you are crawling links to pages outside its domain, this program could run for a very long time. And, incidentally, what probably happens in that case is where this website got it's name: a StackOverflowError.

Make sure you are not visiting the same URL twice. Add some table where you store already visited URLs. Since every page might be starting with header that is linked to the home page you might end up visiting it over and over again, for example.

Related

Receiving selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element message

I am new to the site. I am trying to select/click on elements on http://autotask.net. I have tried to find element using the following methods:
driver.find_element_by_css_selector('#HREF_btnPrint > img:nth-child(1)')
driver.find_element_by_xpath('browser.find_element_by_xpath('/html/body/form[1]/div[2]/div[1]/ul/li[2]/a/img')
...but it gives me the exception in the Title.
I can find elements fine on the logon page(i.e. 'username' 'password' fields), but once I log in - selenium can't find any elements on the page.
I've only been learning/coding python for a few weeks now and I know nothing about Java. If you need more information, please let me know - Thanks.

I'm not sure about the python syntax, so I will use Java instead. I think there are 2 possible cases, first your driver look into the DOM structure while the page is loading. The other one is the web driver doesn't points at the root of DOM. I usually have the wait method to wait until the page finish loading, and switch to the default content before looking for some elements.
First, I pass both WebDriver, Element id, and Element name through constructor. Then I have waitUntilReady() that implement the logic to check the existing of element within specific timeout.
public void waitUntilReady() {
WebDriverWait waiter = new WebDriverWait(getWebDriver(), 30);
waiter.withMessage("\"" + getElementName() + "\" is not ready in 30 seconds")
.until(new Predicate<WebDriver>() {
#Override
public boolean apply(WebDriver input) {
/* isReady() returns true if all elements
* that you need to test are available.
* Note that if you have javascript, you
* should have some hidden field to tell
* you that the whole page has loaded. */
return isReady();
}
});
}
as well as switch to default content before find the element
WebElement getWebElement() {
WebElement webElement;
try {
getWebDriver().switchTo().defaultContent();
webElement = getWebDriver().findElement(By.id(elementId));
} catch (NoSuchElementException ex) {
getLogger().log(Level.SEVERE, "Element id '" + elementId + "' does not exist");
throw ex;
}
return webElement;
}
If it doesn't work, you can have <body id="container">, then use driver.findElementById("container").findElementById("your target element")

Get Hit Count of any url / resource using web crawler

I have made web crawler in java. It traverse through the links present in each page recursively. Now i want to get the count of hits a particular page got. Is it possible via web crawler? since we don't have any access to server code, we can't add any counter to count the hits. Please suggest any solution. Thanks.
Basic Structure of code is :
-> get the html source code of url.
-> find the reachable links from html code and put it in a list.
-> take the next link from list and continue the same till the list becomes empty.
I just want to show the hit count for each link.

One thing I can suggest is to wrap your link into a class, let it have a variable called counter to keep record of it. So basically you will have a list of the Link class. Example below:
public class Link{
private String url;
private int count = 0;
public Link(String url){
this.url = url; // initialise your link class with a url
}
public String getUrl(){
increment();
return url;
}
public void increment(){
count++;
}
public int getCount(){
return count;
}
}
Then count it like this:
List<Link> links.... // initialise your links
Document doc = Jsoup.connect(links.get(i).getUrl()).get();
This way, everytime your url is accessed, the count is incremented to keep record of total hits.

Is it possible to refresh Select objects in Selenium when testing ajax

When I was working on my selenium tests, I ran into an issue when I was testing some ajax functionality on a website. I was getting an error Exception in thread "main" org.openqa.selenium.StaleElementReferenceException: Element is no longer attached to the DOM.
After looking up a bunch of stuff, I know the reason that I am getting this error is because the element that I am accessing in my first select object is considered since the ajax reloaded that section of the site.
In order to get around this exception, I just created a new select object each time. The xpath does not change when the page is reloaded.
Is it possible to just refresh the Select with the new xpath to the object, instead of creating a new one each time?
Thanks for the help.
public static boolean ajaxFunctionalityFF() throws InterruptedException {
int rowCount=driver.findElements(By.xpath("//table[#class='classname']/tbody/tr")).size();
rowSizes.add(rowCount);
Select ajaxSelector = new Select(driver.findElement(By.id("edit-term")));
ajaxSelector.selectByVisibleText("-Beef);
Thread.sleep(4000);
rowCount=driver.findElements(By.xpath("//table[#class='classname']/tbody/tr")).size();
rowSizes.add(rowCount);
totalElements = totalElements + rowCount;
Select ajaxSelector2 = new Select(driver.findElement(By.id("edit-term"))); //create a new one to fix the stale element exception
ajaxSelector2.selectByVisibleText("-Cattle");
Thread.sleep(4000);
rowCount=driver.findElements(By.xpath("//table[#class='classname']/tbody/tr")).size();

You will need to fetch it each time that section of the HTML is refreshed. I would do something like
private By selectLocator = By.id("edit-term");
public static boolean ajaxFunctionalityFF() throws InterruptedException
{
...
Select ajaxSelector = getSelect();
...
ajaxSelector = getSelect();
ajaxSelector.selectByVisibleText("-Cattle");
...
}
public static Select getSelect()
{
return new Select(driver.findElement(selectLocator));
}

One workaround that i usually use for such cases is as below:
do
{
try
{
WebElement element=FindThatElement;
element.performSomeAction();
break;
}
catch(StaleElementException | //Any Other unExpectedException e)
{
//continue do while loop;
}
} while(1>0);

Selenium WebDriver: findElement() in each WebElement from List<WebElement> always returns contents of first element

Page page = new Page();
page.populateProductList( driver.findElement( By.xpath("//div[#id='my_25_products']") ) );
class Page
{
public final static String ALL_PRODUCTS_PATTERN = "//div[#id='all_products']";
private List<Product> productList = new ArrayList<>();
public void populateProductList(final WebElement pProductSectionElement)
{
final List<WebElement> productElements = pProductSectionElement.findElements(By.xpath(ALL_PRODUCTS_PATTERN));
for ( WebElement productElement : productElements ) {
// test 1 - works
// System.out.println( "Product block: " + productElement.getText() );
final Product product = new Product(productElement);
// test 2 - wrong
// System.out.println( "Title: " + product.getUrl() );
// System.out.println( "Url: " + product.getTitle() );
productList.add(product);
}
// test 3 - works
//System.out.println(productElements.get(0).findElement(Product.URL_PATTERN).getAttribute("href"));
//System.out.println(productElements.get(1).findElement(Product.URL_PATTERN).getAttribute("href"));
}
}
class Product
{
public final static String URL_PATTERN = "//div[#class='url']/a";
public final static String TITLE_PATTERN = "//div[#class='title']";
private String url;
private String title;
public Product(final WebElement productElement)
{
url = productElement.findElement(By.xpath(URL_PATTERN)).getAttribute("href");
title = productElement.findElement(By.xpath(TITLE_PATTERN)).getText();
}
/* ... */
}
The webpage I am trying to 'parse' with Selenium has a lot of code. I need to deal with just a smaller portion of it that contains the products grid.
To the populateProductList() call I pass the resulting portion of the DOM that contains all the products.
(Running that XPath in Chrome returns the expected all_products node.)
In that method, I split the products into 25 individual WebElements, i.e., the product blocks.
(Here I also confirm that works in Chrome and returns the list of nodes, each containing the product data)
Next I want to iterate through the resulting list and pass each WebElement into the Product() constructor that initializes the Product for me.
Before I do that I run a small test and print out the product block (see test 1); individual blocks are printed out in each iteraion.
After performing the product assignments (again, xpath confirmed in Chrome) I run another test (see test 2).
Problem: this time the test returns only the url/title pair from the FIRST product for EACH iteration.
Among other things, I tried moving the Product's findElement() calls into the loop and still had the same problem. Next, I tried running a findElement**s**() and do a get(i).getAttribute("href") on the result; this time it correctly returned individual product URLs (see test 3).
Then when I do a findElements(URL_PATTERN) on a single productElement inside the loop, and it magically returns ALL product urls... This means that findElement() always returns the first product from the set of 25 products, whereas I would expect the WebElement to contain only one product.
I think this looks like a problem with references, but I have not been able to come up with anything or find a solution online.
Any help with this? Thanks!
java 1.7.0_15, Selenium 2.45.0 and FF 37

The problem is in the XPATH of the Product locators.
Below xpath expression in selenium means you are looking for a matching element which CAN BE ANYWHERE in document. Not relative to the parent as you are thinking!!
//div[#class='url']/a
This is why it always returns the same first element.
So, in order to make it relative to the parent element it should be as given below. (just a . before //)
public final static String URL_PATTERN = ".//div[#class='url']/a";
public final static String TITLE_PATTERN = ".//div[#class='title']";
Now you make it search for matching child element relative to the parent.
XPATH in selenium works as given below.
/a/b/c --> Absolute - from the root
//a/b --> Matching element which can be anywhere in the document (even outside the parent).
.//a/b --> Matching element inside the given parent

Selenium 2.0 WebDriver: Element is no longer attached to the DOM error using Java

I'm using the PageObject/PageFactory design pattern for my UI automation. Using Selenium 2.0 WebDriver, JAVA, I randomly get the error: org.openqa.selenium.StaleElementReferenceException: Element is no longer attached to the DOM, when I attempt logic like this:
#FindBy(how = HOW.ID, using = "item")
private List<WebElement> items
private void getItemThroughName(String name) {
wait(items);
for(int i = 0; i < items.size(); i++) {
try {
Thread.sleep(0500);
} catch (InterruptedException e) { }
this.wait(items);
if(items.get(i).getText().contains(name)) {
System.out.println("Found");
break;
}
}
}
The error randomly happens at the if statement line, as you can see I've tried a couple things to avoid this, like sleeping a small amount of time, or waiting for the element again, neither works 100% of the time

First if you really have multiple elements on the by with the ID of "item" you should log a bug or talk to the developers on the site to fix that. An ID is meant to be unique.
As comments on the question already implied you should use an ExplicitWait in this case:
private void getItemThroughName(String name) {
new WebDriverWait(driver, 30)
.until(ExpectedConditions.presenceOfElementLocated(
By.xpath("id('item')[.='" + name + "']")
));
// A timeout exception will be thrown otherwise
System.out.println("Found");
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to find all URLs recursively on a website -- java - java

Make sure you are not visiting the same URL twice. Add some table where you store already visited URLs. Since every page might be starting with header that is linked to the home page you might end up visiting it over and over again, for example.

Related

Receiving selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element message

Get Hit Count of any url / resource using web crawler

Is it possible to refresh Select objects in Selenium when testing ajax

Selenium WebDriver: findElement() in each WebElement from List<WebElement> always returns contents of first element

Selenium 2.0 WebDriver: Element is no longer attached to the DOM error using Java

Categories

Resources