Selecting all YouTube videos

Selecting all YouTube videos - java

I'm trying to get the watch urls of all uploaded videos in the video manager on YouTube. How do I select all videos on the page? I figured that the class name vm-video-title-content yt-uix-sessionlink exists in the a tag on every video but how can it be used to retrieve all of the href attributes? I'm struggling with the selector.
This is the html snippet I'm basically dealing with:
THE_VIDEO_TITLE
In Selenium I tried doing
By videoSelector = By.className("vm-video-title-content.yt-uix-sessionlink");
List<WebElement> webElements = webDriver.findElements(videoSelector);
System.out.println(webElements.size());
but it prints a size of 0.
Note that I placed a dot in the class name since compound class names are not supported.
Is my approach promising or is there a better way of doing it?

I think you misunderstood "classes", and how they work in HTML.
Your HTML sample specifies two classes: vm-video-title-content and yt-uix-sessionlink. Note that this is a space-separated list!
Your Java code is asking for a completely different one class: vm-video-title-content.yt-uix-sessionlink. Note that is done by exact string comparison!
If you try something like:
By videoSelector = By.className("vm-video-title-content");
You should be little closer to what you want.

Related

Selenium (Java): ordered list of webelements but with nested divs

I am working with JavaSE and Selenium WebDriver within Chrome. My task is to find a set of input fields and do stuff with them. The issue is that I have to do stuff in the presented order they are available on the web page.
So I would find them via XPATH, because that's what works in the given web page. Let's say that I have a set of inputs on the following path: .../form/div/div/div
However for reasons I cannot say, certain type of input fields (such as text and numbers) are in the following path: .../form/div/div
The problem is that one set of the inputs are one div 'deeper' than the others, so when I save them to a List<WebElement> with driver.findElements, i can't really save their order.
I thought of finding the inputs with id, but the id names have a space in it which Selenium apparently dislikes. I am not sure if relative XPATH could be of help in this case.
Your comments are appreciated.

I made the mistake of not reading enough about XPATH. What I was looking for was the 'and' operand within an xpath expression. If you are a beginner like me, please read about it on w3schools.
Basically the following code solved my issue, as a workaround:
driver.findElements(By.xpath("//input[#required=''] | //select[#required='']"));

Outputting Search results using Jsoup in java

I'm trying to create a Java Program, where I can insert a String into a search bar and then record/print out the results.
This site is: http://maple.fm/khroa
I'm fairly new to JSoup and I've spent several hours just reading the html code regarding that page and have come across variables that could be used to insert the String that I need and get results, although I'm not sure how to exactly do that. Would someone be able to point me to the right direction?

I think you missed the point of JSOUP.
JSOUP can parse a page that is already loaded - it is not used to interact with a page (as you want). You could use Selenium to interact with the page (http://www.seleniumhq.org/) and then use JSOUP to parse the loaded page's source code.
In this case, the search results seem to be all loaded when the page load, and the Item Search function only filters the (already existing) results with Javascript.
There are no absolute links you could use to get results to a particular search.

JSoup get all elements based on class

I am writing a web scraper using JSoup to take prices from the first page of search results on Amazon. For example, you search "hammer" on amazon, the first page of search results comes up, my scraper takes all the prices for each search result and shows them. However, I can't figure out why nothing is printed when I run my program. The HTML for the price figure of an item on Amazon.ca is:
<a class="a-link-normal a-text-normal" href="http://www.amazon.ca/Stanley-51-624-Fiberglass-Hammer-20-Ounce/dp/B000VSSG2K/ref=sr_1_1?ie=UTF8&qid=1436274467&sr=8-1&keywords=hammer"><span class="a-size-base a-color-price s-price a-text-bold">CDN$ 17.52</span></a>
I run my code as follows:
Elements prices = doc.getElementsByClass("a-size-base a-color-price s-price a-text-bold");
System.out.println("Prices: " + prices);
What is returned:
Prices:
How do I get the price value "CDN$ 17.52" in this case?

One way would be doc.select("span.s-price"), another would be doc.getElementsByClass("s-price").
Your code doesn't work because getElementsByClass expects a single class name, and returns all elements which have that class. You've supplied several class names, the function can't cope with that and finds nothing.
The span element you're looking for has several classes applied to it: a-size-base, a-color-price, s-price and a-text-bold. You can look for any one of these classes, and it's also possible to match elements which have all four classes by building a CSS selector like doc.select(".a-size-base.a-color-price.s-price.a-text-bold").
However, you probably want as simple a selector as possible, because Amazon are free to change their CSS styles at any time and can easily break your scraper.
The simpler the scraper is, the more resilient it is to breakage. You might want to look for prices through semantics rather than rendered style, e.g. doc.getElementsContainingOwnText("CDN$") would select elements containing the literal text "CDN$".

Get only text from multiple pages with JSoup

I have a set of 1000 pages(links) that I get by putting a query to Google. I am using JSoup. I want to get rid of images, links, menus, videos, etc. and take only the main article from every page.
My problem is that every page has a different DOM tree so I cannot use the same command for every page! Do you know any way to do this for 1000 pages simultaneously? I guess that I have to use regular expressions. Something like that perhaps
textdoc.body().select("[id*=main]").text();//get id that contains the word main
textdoc.body().select("[class*=main]").text();//get class that contains the word main
textdoc.body().select("[id*=content]").text();//get id that contains the word content
But I feel that always I will miss something with this. Any better ideas?

Element main = doc.select("div.main").first();
Elements links = main.select("a[href]");
All different pages have main class for the main article?

How can I extract only the main textual content from an HTML page?

Update
Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.
So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.
The Question
I download some pages from random sites, and now I want to analyze the textual content of the page.
The problem is that a web page have a lot of content like menus, publicity, banners, etc.
I want to try to exclude all that is not related with the content of the page.
Taking this page as example, I don't want the menus above neither the links in the footer.
Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.
At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).
The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)
Edit: I want to do this inside my Java code, not an external application (if this can be possible).
I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering

Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
There are a few ways to feed HTML into Boilerpipe and extract HTML.
You can use a URL:
ArticleExtractor.INSTANCE.getText(url);
You can use a String:
ArticleExtractor.INSTANCE.getText(myHtml);
There are also options to use a Reader, which opens up a large number of options.

You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).
Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:
Reader reader = ...
InputSource is = new InputSource(reader);
// parse the document into boilerpipe's internal data structure
TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();
// perform the extraction/classification process on "doc"
ArticleExtractor.INSTANCE.process(doc);
// iterate over all blocks (= segments as "ArticleExtractor" sees them)
for (TextBlock block : getTextBlocks()) {
// block.isContent() tells you if it's likely to be content or not
// block.getText() gives you the block's text
}
TextBlock has some more exciting methods, feel free to play around!

There appears to be a possible problem with Boilerpipe. Why?
Well, it appears that is suited to certain kinds of web pages, such as web pages that have a single body of content.
So one can crudely classify web pages into three kinds in respect to Boilerpipe:
a web page with a single article in it (Boilerpipe worthy!)
a web with multiple articles in it, such as the front page of the New York times
a web page that really doesn't have any article in it, but has some content in respect to links, but may also have some degree of clutter.
Boilerpipe works on case #1. But if one is doing a lot of automated text processing, then how does one's software "know" what kind of web page it is dealing with? If the web page itself could be classified into one of these three buckets, then Boilerpipe could be applied for case #1. Case #2 is a problem, and case#3 is a problem as well - it might require an aggregate of related web pages to determine what is clutter and what isn't.

You can use some libs like goose. It works best on articles/news.
You can also check javascript code that does similar extraction as goose with the readability bookmarklet

My first instinct was to go with your initial method of using Jsoup. At least with that, you can use selectors and retrieve only the elements that you want (i.e. Elements posts = doc.select("p"); and not have to worry about the other elements with random content.
On the matter of your other post, was the issue of false positives your only reasoning for straying away from Jsoup? If so, couldn't you just tweak the number of MIN_WORDS_SEQUENCE or be more selective with your selectors (i.e. do not retrieve div elements)

http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php
Proprietary software, but it makes it very easy to extract from webpages and integrates well with java.
You use a provided application to design xml files read by the roboserver api to parse webpages. The xml files are built by you analyzing the pages you wish to parse inside the provided application (fairly easy) and applying rules for gathering the data (generally, websites follow the same patterns). You can setup the scheduling, running, and db integration using the provided Java API.
If you're against using software and doing it yourself, I'd suggest not trying to apply 1 rule to all sites. Find a way to separate tags and then build per-site

You're looking for what are known as "HTML scrapers" or "screen scrapers". Here are a couple of links to some options for you:
Tag Soup
HTML Unit

You can filter the html junk and then parse the required details or use the apis of the existing site.
Refer the below link to filter the html, i hope it helps.
http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/

You could use the textracto api, it extracts the main 'article' text and there is also the opportunity to extract all other textual content. By 'subtracting' these texts you could split the navigation texts, preview texts, etc. from the main textual content.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.