Remove Duplicate URLs from Elements list in jSoup? - java

When using jSoup to scrape a page, all of the links on the page can be gathered using;
Elements allLinksOnPage = doc.select("a");
Which is great. Now how would one go about removing duplicate URLs from this list? I.e. imagine a /contact-us.html that is linked in both the main navigation.
Once all the duplicate URLs have been removed, then the next step is to crawl those unique URLs and continue the loop.
A question on the practicalities of this though. For the code;
for (Element e : allLinksOnPage) {
String absUrl = e.absUrl("href");
//Asbolute URL Starts with HTTP or HTTPS to make sure it isn't a mailto: link
if (absUrl.startsWith("http") || absUrl.startsWith("https")) {
//Check that the URL starts with the original domain name
if (absUrl.startsWith(getURL)) {
//Remove Duplicate URLs
//Not sure how to do this bit yet?
//Add New URLs found on Page to 'allLinksOnPage' to allow this
//For Loop to continue until the entire website has been scraped
}
}
}
So the question being, the final part of the loop, imagine when page-2.html is crawled, more URLs are identified on here and added to the allLinksOnPage variable.
Will the for loop continue for the length of the full list, i.e. 10 links found on page-1.html and 10 links on page-2.html, so 20 pages will be crawled in total - OR - Will the loop only continue for the length of the first 10 links identified, i.e. the links before the code 'for (Element e : allLinksOnPage)' is triggered?
This will all inevitably end up in a database once the logic is finalised, but looking to keep the logic purely Java based initially to prevent lots of reads/writes to the DB which will slow everything down.

allLinksOnPage is iterated only once. You never retrieve any information about pages you found links to.
You can use a Set and a List for this, however. Furthermore you can use the URL class to extract the protocol for you.
URL startUrl = ...;
Set<String> addedPages = new HashSet<>();
List<URL> urls = new ArrayList<>();
addedPages.add(startUrl.toExternalForm());
urls.add(startUrl);
while (!urls.isEmpty()) {
// retrieve url not yet crawled
URL url = urls.remove(urls.size()-1);
Document doc = JSoup.parse(url, TIMEOUT);
Elements allLinksOnPage = doc.select("a");
for (Element e : allLinksOnPage) {
// add hrefs
URL absUrl = new URL(e.absUrl("href"));
switch (absUrl.getProtocol()) {
case "https":
case "http":
if (absUrl.toExternalForm().startsWith(getURL) && addedPages.add(absUrl.toExternalForm())) {
// add url, if not already added
urls.add(absUrl);
}
}
}
}

Related

How can I redirect WebDriver to new page and back for each found link avoiding StaleElementReferenceException?

So, the simplest example would be fetching data from google search engine results.
Example code:
void test() {
WebDriver driver = new FirefoxDriver();
driver.get("https://www.google.com");
driver.findElement(By.id("lst-ib")).sendKeys("pizza"); // type pizza in search field
driver.findElement(By.name("btnK")).click(); // perform search
// for every header link in search results page
for(WebElement link : driver.findElements(By.xpath("//h3[#class = 'r']/a"))) {
link.click(); // click the link
System.out.println(driver.getCurrentUrl()); // fetch page url or something else
driver.navigate().back(); // go back to search results
}
}
But, on second iteration, exception is thrown:
Exception in thread "main"
org.openqa.selenium.StaleElementReferenceException: The element
reference of stale: either the element is no longer attached to
the DOM or the page has been refreshed
That's because the remaining links are invalidated the moment i click one of them. How can somehow iterate over them all for this task?
You can use for loop with index for this and relocate the list each iteration
int count = 1;
for (int i = 0 ; i < count ; i++) {
List<WebElement> links = driver.findElements(By.xpath("//h3[#class = 'r']/a"))
count = links.size();
link.get(i).click(); // click the link
System.out.println(driver.getCurrentUrl()); // fetch page url or something else
driver.navigate().back(); // go back to search results
}
As per your code block to avoid StaleElementReferenceException I would suggest a change in the approach as follows :
Create a List out of your search results, extract the href attribute and strore them in a new List.
Open each href in a new TAB, switch to the newly opened TAB and perform the required operations.
Once you complete the operations on the new tab, Close the TAB and switch back to the main TAB.
Here is the sample code block to perform the above mentions steps :
List <WebElement> my_list = driver.findElements(By.xpath("//div[#id='rso']//div[#class='srg']/div[#class='g']//h3/a"));
ArrayList<String> href_list = new ArrayList<String>();
for(WebElement element:my_list)
href_list.add(element.getAttribute("href"));
for(String myhref:href_list)
{
((JavascriptExecutor) driver).executeScript("window.open(arguments[0])", myhref);
//switch to the required TAB, perform the operations, close and switch back to main TAB
//For demonstration I didn't switch/close any of the TABS
}

Lazily recurse Java 8 stream

I'm using the Google Cloud Java API to get objects out of Google Cloud Storage (GCS). The code for this reads something like this:
Storage storage = ...
List<StorageObject> storageObjects = storage.objects().list(bucket).execute().getItems();
But this will not return all items (storage objects) in the GCS bucket, it'll only return the first 1000 items in the first "page". So in order to get the next 1000 items one should do:
Storage.Objects.List list = storage.objects().list(bucket).execute();
String nextPageToken = objects.getNextPageToken();
List<StorageObject> itemsInFirstPage = objects.getItems();
if (nextPageToken != null) {
// recurse
}
What I want to do is to find an item that matches a Predicate while traversing all items in the GCS bucket until the predicate is matched. To make this efficient I'd like to only load the items in the next page when the item wasn't found in the current page. For a single page this works:
Predicate<StorageObject> matchesItem = ...
takeWhile(storage.objects().list(bucket).execute().getItems().stream(), not(matchesItem));
Where takeWhile is copied from here.
And this will load the storage objects from all pages recursively:
private Stream<StorageObject> listGcsPageItems(String bucket, String pageToken) {
if (pageToken == null) {
return Stream.empty();
}
Storage.Objects.List list = storage.objects().list(bucket);
if (!pageToken.equals(FIRST_PAGE)) {
list.setPageToken(pageToken);
}
Objects objects = list.execute();
String nextPageToken = objects.getNextPageToken();
List<StorageObject> items = objects.getItems();
return Stream.concat(items.stream(), listGcsPageItems(bucket, nextPageToken));
}
where FIRST_PAGE is just a "magic" String that instructs the method not to set a specific page (which will result in the first page items).
The problem with this approach is that it's eager, i.e. all items from all pages are loaded before the "matching predicate" is applied. I'd like this to be lazy (one page at a time). How can I achieve this?
I would implement custom Iterator<StorageObject> or Supplier<StorageObject> which would keep current page list and next page token in its internal state producing StorageObjects one by one.
Then I would use the following code to find the first match:
Optional<StorageObject> result =
Stream.generate(new StorageObjectSupplier(...))
.filter(predicate)
.findFirst();
Supplier will only be invoked until the match is found, i.e. lazily.
Another way is to implement supplier by-page, i.e. class StorageObjectPageSupplier implements Supplier<List<StorageObject>> and use stream API to flatten it:
Optional<StorageObject> result =
Stream.generate(new StorageObjectPageSupplier(...))
.flatMap(List::stream)
.filter(predicate)
.findFirst();

Get data from table using jSoup

I am looking to get data from the table on http://www.sportinglife.com/greyhounds/abc-guide using jSoup. I would like to put this data into some kind of table within my java program that I can then use in my code.
I'm not too sure how to do this. I have been playing around with jSoup and currently am able to get each cell from the table to print out using a while loop - but obviously can't use this always as the number of cells in the table will change.
Document doc = Jsoup.connect("http://www.sportinglife.com/greyhounds/abc-guide").get();
int n = 0;
while (n < 100){
Element tableHeader = doc.select("td").get(n);
for( Element element : tableHeader.children() )
{
// Here you can do something with each element
System.out.println(element.text());
}
n++;
}
Any idea of how I could do this?
There are just a few things you have to implement to achieve your goal. Take a look on this Groovy script - https://gist.github.com/wololock/568b9cc402ea661de546 Now lets explain what we have here
List<Element> rows = document.select('table[id=ABC Guide] > tbody > tr')
Here we're specifying that we are interested in every row tr that is immediate child of tbody which is immediate child of table with id ABC Guide. In return you receive a list of Element objects that describes those tr rows.
Map<String, String> data = new HashMap<>()
We will store our result in a simple hash map for further evaluation e.g. putting those scraped data into the database.
for (Element row : rows) {
String dog = row.select('td:eq(0)').text()
String race = row.select('td:eq(1)').text()
data.put(dog, race)
}
Now we iterate over every Element and we select content as a text from the first cell: String dog = row.select('td:eq(0)').text() and we repeat this step to retrieve the content as a text from the second cell: String race = row.select('td:eq(1)').text(). Then we just simply put those data into the hash map. That's all.
I hope this example with provided description will help you with developing your application.
EDIT:
Java code sample - https://gist.github.com/wololock/8ccbc6bbec56ef57fc9e

Load an url only one time

I have a table of url I want to load, the table can have one or more time an url.
For example, a table with three values : url1, url2 url1.
So, after, I load an url, an extract one of his html piece(for example a ).
I have this :
HtmlPage page=null;
for (int i = 0; i < tableUrlSource.length; i++) {
try {
page = webClient.getPage(tabUrlSource[i]);
List<HtmlElement> nbElements = (List<HtmlElement>) page.getByXPath(tabXpathSource[i]);
if (null != nbElements && !nbElements.isEmpty()) {
htmlResult = nbElements.get(0).asText();
}
...
But this is not the more efficient, because it will load url1 two times and url one time.
So it will like there is three url to load, and then, make the treatment longer.
How can I load an url only one time and keep the same final result?
I hope my english is clear, so as my question.
Regards.
Thank you.
What Keppil answered is correct but you would have to use the Set in place of tabUrlSource[i] rather than for Set<HtmlElement>
EDIT:
Okay what is the content of tabUrlSource[i]?Is it of type URL or custom?
This is how it would look like if it is URL
Set <URL>uniqueURLs = new HashSet <URL>();
for (int i = 0; i < tableUrlSource.length; i++) {
uniqueURLs.add(tableUrlSource[i])
}
And then iterate over this Set instead of tableUrlSource array like this
for(Iterator itr = uniqueURLs.iterator(); itr.hasNext(); ){
page = webClient.getPage((URL)itr.next());
.............
.............
Continue the rest of the code
Also you said you are using index 'i' to associate url and xpath. Will that xpath be same for same url? If so you can use HashMap instead with key as URL and value as xpath so that duplicate keys will be overridden. Then you can iterate over this hashmap keys to get the 'page' and use the 'value' for to fetch HTMLELEMENT
If they are not same you can still use a HashSet like this
Set <URL>uniqueURLs = new HashSet <URL>();
HtmlPage page=null;
for (int i = 0; i < tableUrlSource.length; i++) {
try {
if(uniqueURLs.contains(tabUrlSource[i]) continue;
else
uniqueURLs.add( tabUrlSource[i] );
page = webClient.getPage(tabUrlSource[i]);
List<HtmlElement> nbElements = (List<HtmlElement>)
page.getByXPath(tabXpathSource[i]);
if (null != nbElements && !nbElements.isEmpty()) {
htmlResult = nbElements.get(0).asText();
}
Hope this helps :)
You could use a Set<HtmlElement> instead of a List. This will remove duplicates automatically.
This of course is dependant on the fact that HtmlElements are comparable. If they aren't, you could instead add all the URLs to a Set<String> and then iterate over that.
Update
To clarify the second part:
A Set is declared like this in the Javadocs:
A collection that contains no duplicate elements. More formally, sets
contain no pair of elements e1 and e2 such that e1.equals(e2), and at
most one null element. As implied by its name, this interface models
the mathematical set abstraction.
In other words, to ensure that there are no duplicates, it relies on the elements being comparable via the equals() method. If HtmlElement hasn't overridden this method, the Set will just use the Object.equals() method, which just compares object references instead of the actual data in the HtmlElements.
However, String has overridden the equals() method, and you can therefor be certain that duplicate Strings will be removed from a Set<String>.

Retrieving Reviews from Amazon using JSoup

I'm using JSoup to retrive reviews from a particular webpage in Amazon and what I have now is this:
Document doc = Jsoup.connect("http://www.amazon.com/Presto-06006-Kitchen-Electric-Multi-Cooker/product-reviews/B002JM202I/ref=sr_1_2_cm_cr_acr_txt?ie=UTF8&showViewpoints=1").get();
String title = doc.title();
Element reviews = doc.getElementById("productReviews");
System.out.println(reviews);
This gives me the block of html which has the reviews but I want only the text without all the tags div etc. I want to then write all this information into a file. How can I do this? Thanks!
Use text() method
System.out.println(reviews.text());
While text() will get you a bunch of text, you'll want to first use jsoup's select(...) methods to subdivide the problem into individual review elements. I'll give you the first big division, but it will be up to you to subdivide it further:
public static List<Element> getReviewList(Element reviews) {
List<Element> revList = new ArrayList<Element>();
Elements eles = reviews.select("div[style=margin-left:0.5em;]");
for (Element element : eles) {
revList.add(element);
}
return revList;
}
If you analyze each element, you should see how amazon further subdivides the information held including the title of the review, the date of the review and the body of the text it holds.

Categories