Get Hit Count of any url / resource using web crawler

Get Hit Count of any url / resource using web crawler - java

I have made web crawler in java. It traverse through the links present in each page recursively. Now i want to get the count of hits a particular page got. Is it possible via web crawler? since we don't have any access to server code, we can't add any counter to count the hits. Please suggest any solution. Thanks.
Basic Structure of code is :
-> get the html source code of url.
-> find the reachable links from html code and put it in a list.
-> take the next link from list and continue the same till the list becomes empty.
I just want to show the hit count for each link.

One thing I can suggest is to wrap your link into a class, let it have a variable called counter to keep record of it. So basically you will have a list of the Link class. Example below:
public class Link{
private String url;
private int count = 0;
public Link(String url){
this.url = url; // initialise your link class with a url
}
public String getUrl(){
increment();
return url;
}
public void increment(){
count++;
}
public int getCount(){
return count;
}
}
Then count it like this:
List<Link> links.... // initialise your links
Document doc = Jsoup.connect(links.get(i).getUrl()).get();
This way, everytime your url is accessed, the count is incremented to keep record of total hits.

Related

How to find all URLs recursively on a website -- java

I have a method which allows me to get all URLs from the page (and optional - to check is it valid).
But it works only for 1 page, I want to check all the website. Need to make a recursion.
private static FirefoxDriver driver;
public static void main(String[] args) throws Exception {
driver = new FirefoxDriver();
driver.get("https://example.com/");
List<WebElement> allURLs = findAllLinks(driver);
report(allURLs);
// here are my trials for recursion
for (WebElement element : allURLs) {
driver.get(element.getAttribute("href"));
List<WebElement> allUrls = findAllLinks(driver);
report(allUrls);
}
}
public static List findAllLinks(WebDriver driver)
{
List<WebElement> elementList = driver.findElements(By.tagName("a"));
elementList.addAll(driver.findElements(By.tagName("img")));
List finalList = new ArrayList();
for (WebElement element : elementList)
{
if(element.getAttribute("href") != null)
{
finalList.add(element);
}
}
return finalList;
}
public static void report(List<WebElement> allURLs) {
for(WebElement element : allURLs){
System.out.println("URL: " + element.getAttribute("href")+ " returned " + isLinkBroken(new URL(element.getAttribute("href"))));
}
}
See comment "here are my trials for recursion". But it goes through the first page, then again through the first page and that's all.

You're trying to write a web crawler. I am a big fan of code reuse. Which is to say I always look around to see if my project has already been written before I spend the time writing it myself. And there are many versions of web crawlers out there. One written by Marilena Panagiotidou pops up early in a google search. Leaving out the imports, her basic version looks like this.
public class BasicWebCrawler {
private HashSet<String> links;
public BasicWebCrawler() {
links = new HashSet<String>();
}
public void getPageLinks(String URL) {
//4. Check if you have already crawled the URLs
//(we are intentionally not checking for duplicate content in this example)
if (!links.contains(URL)) {
try {
//4. (i) If not add it to the index
if (links.add(URL)) {
System.out.println(URL);
}
//2. Fetch the HTML code
Document document = Jsoup.connect(URL).get();
//3. Parse the HTML to extract links to other URLs
Elements linksOnPage = document.select("a[href]");
//5. For each extracted URL... go back to Step 4.
for (Element page : linksOnPage) {
getPageLinks(page.attr("abs:href"));
}
} catch (IOException e) {
System.err.println("For '" + URL + "': " + e.getMessage());
}
}
}
public static void main(String[] args) {
//1. Pick a URL from the frontier
new BasicWebCrawler().getPageLinks("http://www.mkyong.com/");
}
}
Probably the most important thing to note here is how the recursion works. A recursive method is one that calls itself. Your example above is not recursion. You have a method findAllLinks that you call once on a page, and then once for every link found in the page. Notice how Marilena's getPageLinks method calls itself once for every link it finds in a page at a given URL. And in calling itself it creates a new stack frame and a generates a new set of links from a page and calls itself again once for every link, etc. etc.
Another important thing to note about a recursive function is when does it stop calling itself. In this case Marilena's recursive function keeps calling itself until it can't find any new links. If the page you are crawling links to pages outside its domain, this program could run for a very long time. And, incidentally, what probably happens in that case is where this website got it's name: a StackOverflowError.

Make sure you are not visiting the same URL twice. Add some table where you store already visited URLs. Since every page might be starting with header that is linked to the home page you might end up visiting it over and over again, for example.

Selenium WebDriver: findElement() in each WebElement from List<WebElement> always returns contents of first element

Page page = new Page();
page.populateProductList( driver.findElement( By.xpath("//div[#id='my_25_products']") ) );
class Page
{
public final static String ALL_PRODUCTS_PATTERN = "//div[#id='all_products']";
private List<Product> productList = new ArrayList<>();
public void populateProductList(final WebElement pProductSectionElement)
{
final List<WebElement> productElements = pProductSectionElement.findElements(By.xpath(ALL_PRODUCTS_PATTERN));
for ( WebElement productElement : productElements ) {
// test 1 - works
// System.out.println( "Product block: " + productElement.getText() );
final Product product = new Product(productElement);
// test 2 - wrong
// System.out.println( "Title: " + product.getUrl() );
// System.out.println( "Url: " + product.getTitle() );
productList.add(product);
}
// test 3 - works
//System.out.println(productElements.get(0).findElement(Product.URL_PATTERN).getAttribute("href"));
//System.out.println(productElements.get(1).findElement(Product.URL_PATTERN).getAttribute("href"));
}
}
class Product
{
public final static String URL_PATTERN = "//div[#class='url']/a";
public final static String TITLE_PATTERN = "//div[#class='title']";
private String url;
private String title;
public Product(final WebElement productElement)
{
url = productElement.findElement(By.xpath(URL_PATTERN)).getAttribute("href");
title = productElement.findElement(By.xpath(TITLE_PATTERN)).getText();
}
/* ... */
}
The webpage I am trying to 'parse' with Selenium has a lot of code. I need to deal with just a smaller portion of it that contains the products grid.
To the populateProductList() call I pass the resulting portion of the DOM that contains all the products.
(Running that XPath in Chrome returns the expected all_products node.)
In that method, I split the products into 25 individual WebElements, i.e., the product blocks.
(Here I also confirm that works in Chrome and returns the list of nodes, each containing the product data)
Next I want to iterate through the resulting list and pass each WebElement into the Product() constructor that initializes the Product for me.
Before I do that I run a small test and print out the product block (see test 1); individual blocks are printed out in each iteraion.
After performing the product assignments (again, xpath confirmed in Chrome) I run another test (see test 2).
Problem: this time the test returns only the url/title pair from the FIRST product for EACH iteration.
Among other things, I tried moving the Product's findElement() calls into the loop and still had the same problem. Next, I tried running a findElement**s**() and do a get(i).getAttribute("href") on the result; this time it correctly returned individual product URLs (see test 3).
Then when I do a findElements(URL_PATTERN) on a single productElement inside the loop, and it magically returns ALL product urls... This means that findElement() always returns the first product from the set of 25 products, whereas I would expect the WebElement to contain only one product.
I think this looks like a problem with references, but I have not been able to come up with anything or find a solution online.
Any help with this? Thanks!
java 1.7.0_15, Selenium 2.45.0 and FF 37

The problem is in the XPATH of the Product locators.
Below xpath expression in selenium means you are looking for a matching element which CAN BE ANYWHERE in document. Not relative to the parent as you are thinking!!
//div[#class='url']/a
This is why it always returns the same first element.
So, in order to make it relative to the parent element it should be as given below. (just a . before //)
public final static String URL_PATTERN = ".//div[#class='url']/a";
public final static String TITLE_PATTERN = ".//div[#class='title']";
Now you make it search for matching child element relative to the parent.
XPATH in selenium works as given below.
/a/b/c --> Absolute - from the root
//a/b --> Matching element which can be anywhere in the document (even outside the parent).
.//a/b --> Matching element inside the given parent

Jsoup selectors returning all values instead of searched values

I'm learning how to use jsoup and I've created a method called search which uses jsoup's selectors containsand containsOwn to search for a given item and return it's price. (For now the item name is hardcoded for testing purposes but the method will later take in a parameter to accept any item name).
The problem I'm having is that the selector isn't working and all the prices on the page are being returned instead of the one item being searched for, in this case "blinds". So in this example if you follow the link, only one item on that page says blinds and the price is listed as "$30 - $110 original $18 - $66 sale" but every item on that page gets returned instead.
I am aware that with jsoup I can explicitly call the name of the div and just extract the information from it that way. But I wanted to turn this into a bigger project and also extract prices from the same item from other chains such as Walmart, Sears, Macy's etc. Not just that particular website I used in my code. So I can't explicitly call the div name if I wanted to do that because that would only solve the problem for one site, but not the others and I wanted to take on an approach that encompasses the majority of those sites all at once.
How do I extract the price associated with its rightful item? Is there any way of doing it so that the item and price extracting will apply to most websites?
I would appreciate any help.
private static String search(){
Document doc;
String priceText = null;
try{
doc = Jsoup.connect("http://www.jcpenney.com/for-the-home/sale/cat.jump?id=cat100590341&deptId=dept20000011").get();
Elements divs = doc.select("div");
HashMap items = new HashMap();
for(Element element : doc.select("div:contains(blinds)")){
//For those items that say "buy 1 get 1 free"
String buyOneText = divs.select(":containsOwn(buy 1)").text();
Element all = divs.select(":containsOwn($)").first();
priceText = element.select(":containsOwn($)").text();
items.put(element, priceText);
}
System.out.println(priceText);
}catch(Exception e){
e.printStackTrace();
}
return priceText;
}

If you have tried at least to debug your app, then for sure, you will spot where is your mistake.
Put breakpoint for example on this line:
String buyOneText = divs.select(":containsOwn(buy 1)").text();
and then you will see, that really this element in loop contains blinds text. (and as all, that were selected)
I don't know why to make super universal tools, that will be working everywhere - IMO is not possible, and for every page you have to make your own crawler. In this case probably your code should be looking like this (I have to added timeout + this code is not working fully on my side, as I have default currency PLN):
private static String search() {
Document doc;
String priceText = null;
try {
doc = Jsoup.connect("http://www.jcpenney.com/for-the-home/sale/cat.jump?id=cat100590341&deptId=dept20000011").timeout(10000).get();
Elements divs = doc.select("div[class=price_description]");
HashMap items = new HashMap();
for (Element element : divs.select("div:contains(blinds)")) {
//For those items that say "buy 1 get 1 free"
String buyOneText = divs.select(":containsOwn(buy 1)").text();
Element all = divs.select(":containsOwn($)").first();
priceText = element.select(":containsOwn($)").text();
items.put(element, priceText);
}
System.out.println(priceText);
} catch (Exception e) {
e.printStackTrace();
}
return priceText;
}

Variable Declaration in multi thread usage, Java [Memory Leak Issue]

I'm building a crawler using Jsoup Library in Java.
The code structure is as follows:
public static BoneCP connectionPool = null;
public static Document doc = null;
public static Elements questions = null;
static
{
// Connection Pool Created here
}
In the MAIN method, I've called getSeed() method from 10 different threads.
The getSeed() method selects 1 random URL from the database and forwards it to processPage() method.
The processPage() method connects to the URL passed from getSeed() method using jSoup library and extracts all the URLs from it and further adds them all to database.
This process goes on for 24x7.
The problem is:
In processPage() method, it first connects to the URL sent from getSeed() method using:
doc = Jsoup.connect(URL)
And then, for each URL that is found by visiting that particular URL, a new connection is made again by jSoup.
questions = doc.select("a[href]");
for(Element link: questions)
{
doc_child = Jsoup.connect(link.attr("abs:href"))
}
Now, if I declare doc and questions variable as global variable and null them after whole processing in processPage() method, it solves the problem of memory leak but the other threads stops because doc and questions get nulled in between. What should I do next ?

It's crying "wrong design" if you are using static fields, particularly for that kind of state, and based on your description it seems like it's behaving very thread-unsafe. I don't know why you think you have a memory-leak at hand but whatever it is it's easier to diagnose if stuff is in order.
What I would say is, try getting something working based on something like this:
class YieldLinks implements Callable<Set<URI>>{
final URI seed;
YieldLinks(URI seed){
this.seed = seed;
}
}
public static void main(String[] args){
Set<URI> links = new HashSet<>();
for(URI uri : seeds){
YieldLinks yieldLinks = new YieldLinks(uri);
links.addAll(yieldLinks.call());
}
}
Once this single threaded thing works ok, you could look at adding threads.

problems with RequestForFile() class that is using User class in Java

Okay I'll try to be direct.
I am working on a file sharing application that is based on a common Client/Serer architecture. I also have HandleClient class but that is not particularly important here.
What I wanna do is to allow users to search for a particular file that can be stored in shared folders of other users. For example, 3 users are connected with server and they all have their respective shared folders. One of them wants to do a search for a file named "Madonna" and the application should list all files containing that name and next to that file name there should be an information about user(s) that have/has a wanted file. That information can be either username or IPAddress. Here is the User class, the way it needs to be written (that's how my superiors wanted it):
import java.io.File;
import java.util.ArrayList;
import java.util.Scanner;
public class User {
public static String username;
public static String ipAddress;
public User(String username, String ipAddress) {
username = username.toLowerCase();
System.out.println(username + " " + ipAddress);
}
public static void fileList() {
Scanner userTyping = new Scanner(System.in);
String fileLocation = userTyping.nextLine();
File folder = new File(fileLocation);
File[] files = folder.listFiles();
ArrayList<String> list = new ArrayList<String>();
for (int i = 0; i < files.length; i++) {
list.add(i, files[i].toString().substring(fileLocation.length()));
System.out.println(list.get(i));
}
}
public static void main(String args[]) {
System.out.println("Insert the URL of your shared folder");
User.fileList();
}
}
This class stores attributes of a particular user (username, IPAddress) and also creates the list of files from the shared folder of that particular user. the list type is ArrayList, that's how it has to be, again, my superiors told me to.
On the other hand I need another class that is called RequestForFile(String fileName) whose purpose is to look for a certain file within ArrayLists of files from all users that are logged in at the moment of search.
This is how it should look, and this is where I especially need your help cause I get an error and I can't complete the class.
import java.util.ArrayList;
public class RequestForFile {
public RequestForFile(String fileName) {
User user = new User("Slavisha", "84.82.0.1");
ArrayList<User> listOfUsers = new ArrayList();
listOfUsers.add(user);
for (User someUser : listOfUsers) {
for (String request : User.fileList()) {
if (request.equals(fileName))
System.out.println(someUser + "has that file");
}
}
}
}
The idea is for user to look among the lists of other users and return the user(s) with a location of a wanted file.
GUI aside for now, I will get to it when I fix this problem.
Any help appreciated.
Thanks
I'm here to answer anything regarding this matter.

There are lots of problems here such as:
I don't think that this code can compile:
for (String request : User.fileList())
Because fileList() does not return anything. Then there's the question of why fileList() is static. That means that all User objects are sharing the same list. I guess that you have this becuase you are trying to test your user object from main().
I think instead you should have coded:
myUser = new User(...);
myUser.fileList()
and so fileList could not be static.
You have now explained your overall problem more clearly, but that reveals some deeper problems.
Let's start at the very top. Your request object: I think it represents one request for one user with one file definition. But it needs to go looking in the folders of many users. You add the the requesting user to a list, but what about the others. I think that this means that you need another class responsible for holding all the users.
Anyway lets have a class called UserManager.
UserMananger{
ArrayList<User> allTheUsers;
public UserManager() {
}
// methods here for adding and removing users from the list
// plus a method for doing the search
public ArrayList<FileDefinitions> findFile(request) [
// build the result
}
}

in the "line 14: for (String request : User.fileList()) {" I get this error: "void type not allowed here" and also "foreach not applicable to expression type"
You need to let User.fileList() return a List<String> and not void.
Thus, replace
public static void fileList() {
// ...
}
by
public static List<String> fileList() {
// ...
return list;
}
To learn more about basic Java programming, I can strongly recommend the Sun tutorials available in Trials Covering the Basics chapter here.

It looks like you're getting that error because the fileList() method needs to returns something that can be iterated through - which does not include void, which is what that method returns. As written, fileList is returning information to the console, which is great for your own debugging purposes, but it means that other methods can't get any of the information fileList sends to the console.
On a broader note, why is RequestForFile a separate class? If it just contains one method, you can just write it as a static method, or as a method in the class that's going to call it. Also, how will it get lists of other users? It looks like there's no way to do so as is, as you've hard-coded one user.
And looking at the answers, I'd strongly second djna's suggestion of having some class that acts as the controller/observer of all the Users.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Get Hit Count of any url / resource using web crawler - java

Related

How to find all URLs recursively on a website -- java

Selenium WebDriver: findElement() in each WebElement from List<WebElement> always returns contents of first element

Jsoup selectors returning all values instead of searched values

Variable Declaration in multi thread usage, Java [Memory Leak Issue]

problems with RequestForFile() class that is using User class in Java

Categories

Resources