Get only text from multiple pages with JSoup - java

I have a set of 1000 pages(links) that I get by putting a query to Google. I am using JSoup. I want to get rid of images, links, menus, videos, etc. and take only the main article from every page.
My problem is that every page has a different DOM tree so I cannot use the same command for every page! Do you know any way to do this for 1000 pages simultaneously? I guess that I have to use regular expressions. Something like that perhaps
textdoc.body().select("[id*=main]").text();//get id that contains the word main
textdoc.body().select("[class*=main]").text();//get class that contains the word main
textdoc.body().select("[id*=content]").text();//get id that contains the word content
But I feel that always I will miss something with this. Any better ideas?

Element main = doc.select("div.main").first();
Elements links = main.select("a[href]");
All different pages have main class for the main article?

Related

JSOUP Deleting closing and/or opening divs only

Hello Im googling for hours now and can't find answer...(or smt close to it)
What i am trying to do is, lets say i have this code(very simplified):
<div id="one"><div id="two"><div id="three"></div></div></div>
And what i want to do is delete specific amount of this elements , lets say 2 of them. So the result would be:
<div id="one"><div id="two"><div id="three"></div>
Or i want to delete this opening elements (again specific amount of them, lets say 2 again) but without knowing their full name (so we can assume if real name is id="one_54486464" i know its one_ ... )
So after deleting I get this result:
<div id="three"></div></div></div>
Can anyone suggest way to achieve this results? It does not have to Include JSOUP, any better. more simple or more efficient way is welcomed :) (But i am using JSOUP to parse document to get to the point where i am left with )
I hope i explain myself clearly if you have any question please do ask... Thanks :)
EDIT: Those elements that i want to delete are on very end of the HTML document(so nothing, nothing is behind them not body tag html tag nothing...)
Please keep that HTML document would have many across whole code and i want to delete only specific amount at the end of the document...
For the opening divs THOSE are on very beginning of my HTML document and nothing is before them... So i need to remove specific amount from the beginning without knowing their specific ID only start of it. Also this div has closing too somewhere in the document and that closing i want to keep there.
For the first case, you can get the element's html (using the html() method) and use some String methods on it to delete a couple of its closing tags.
Example:
e.html().replaceAll("(((\\s|\n)+)?<\\/div>){2}$","");
This will remove the last 2 closing div tags, to change the number of tags to be remove, just change the number between the curly brackets {n}
(this is just an example and is probably unreliable, you should use some other String methods to decide which parts to discard)
For the second case, you can select the inner element(s) and add some additional closing tags to it/them.
Example:
String s = e.select("#two").first().html() + "</div></div>";
To select an element that has an ID that starts with some String you can use this e.select("div[id^=two]")
You can find more details on how to select elements here
After Titus suggested regular expressions I decided to write regex for deleting opening divs too.
So I convert Jsoup Document to String then did the parsing on a string and then convert back to Jsoup Document so I can use Jsoup functions.
ADD: What I was doing is that I was parsing and connecting two pages to one seamless. So there was no missing opening div or closing... so my HTML code stay with no errors therefore I was able to convert it back to Jsoup Document without complications.

JSoup get all elements based on class

I am writing a web scraper using JSoup to take prices from the first page of search results on Amazon. For example, you search "hammer" on amazon, the first page of search results comes up, my scraper takes all the prices for each search result and shows them. However, I can't figure out why nothing is printed when I run my program. The HTML for the price figure of an item on Amazon.ca is:
<a class="a-link-normal a-text-normal" href="http://www.amazon.ca/Stanley-51-624-Fiberglass-Hammer-20-Ounce/dp/B000VSSG2K/ref=sr_1_1?ie=UTF8&qid=1436274467&sr=8-1&keywords=hammer"><span class="a-size-base a-color-price s-price a-text-bold">CDN$ 17.52</span></a>
I run my code as follows:
Elements prices = doc.getElementsByClass("a-size-base a-color-price s-price a-text-bold");
System.out.println("Prices: " + prices);
What is returned:
Prices:
How do I get the price value "CDN$ 17.52" in this case?
One way would be doc.select("span.s-price"), another would be doc.getElementsByClass("s-price").
Your code doesn't work because getElementsByClass expects a single class name, and returns all elements which have that class. You've supplied several class names, the function can't cope with that and finds nothing.
The span element you're looking for has several classes applied to it: a-size-base, a-color-price, s-price and a-text-bold. You can look for any one of these classes, and it's also possible to match elements which have all four classes by building a CSS selector like doc.select(".a-size-base.a-color-price.s-price.a-text-bold").
However, you probably want as simple a selector as possible, because Amazon are free to change their CSS styles at any time and can easily break your scraper.
The simpler the scraper is, the more resilient it is to breakage. You might want to look for prices through semantics rather than rendered style, e.g. doc.getElementsContainingOwnText("CDN$") would select elements containing the literal text "CDN$".

Selecting all YouTube videos

I'm trying to get the watch urls of all uploaded videos in the video manager on YouTube. How do I select all videos on the page? I figured that the class name vm-video-title-content yt-uix-sessionlink exists in the a tag on every video but how can it be used to retrieve all of the href attributes? I'm struggling with the selector.
This is the html snippet I'm basically dealing with:
THE_VIDEO_TITLE
In Selenium I tried doing
By videoSelector = By.className("vm-video-title-content.yt-uix-sessionlink");
List<WebElement> webElements = webDriver.findElements(videoSelector);
System.out.println(webElements.size());
but it prints a size of 0.
Note that I placed a dot in the class name since compound class names are not supported.
Is my approach promising or is there a better way of doing it?
I think you misunderstood "classes", and how they work in HTML.
Your HTML sample specifies two classes: vm-video-title-content and yt-uix-sessionlink. Note that this is a space-separated list!
Your Java code is asking for a completely different one class: vm-video-title-content.yt-uix-sessionlink. Note that is done by exact string comparison!
If you try something like:
By videoSelector = By.className("vm-video-title-content");
You should be little closer to what you want.

Does JSoup achieve this?

I want to collect domain names (crawling). I have wrote a simple Java application that reads HTML page and save the code in text file. Now, I want to parse this text in order to collect all domain names without douplicates. But I need the domain names without "http://www.", just domainname.topleveldmian or the possibilities of dmianname.subdomain.topleveldomain or whatever number of subdomains (then, the collected links need to be extracted the same way and collect the links inside them till I reach certain number of links, say 100).
I have asked about this in previous posts https://stackoverflow.com/questions/11113568/simple-efficient-java-web-crawler-to-extract-hostnames , and searched. JSoup seems good solution but I have not worked with JSoup before, so before going deeply on it. I just want to ask: Does it achieve what I want to do ?? Any other suggestions for achieving my simple crawling in a simple way are welcome.
jsoup is a Java library for working with real-world HTML. It provides
a very convenient API for extracting and manipulating data, using the
best of DOM, CSS, and jquery-like methods
So yes, you can connect to a website extract its html and parse it with jsoup.
The logic of extracting the top level domain is "your part" you will need to write the code logic yourself.
Take a look at the docs for more options...
Use selector-syntax to find elements
Use DOM methods to navigate a document

How can I extract only the main textual content from an HTML page?

Update
Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.
So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.
The Question
I download some pages from random sites, and now I want to analyze the textual content of the page.
The problem is that a web page have a lot of content like menus, publicity, banners, etc.
I want to try to exclude all that is not related with the content of the page.
Taking this page as example, I don't want the menus above neither the links in the footer.
Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.
At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).
The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)
Edit: I want to do this inside my Java code, not an external application (if this can be possible).
I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering
Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
There are a few ways to feed HTML into Boilerpipe and extract HTML.
You can use a URL:
ArticleExtractor.INSTANCE.getText(url);
You can use a String:
ArticleExtractor.INSTANCE.getText(myHtml);
There are also options to use a Reader, which opens up a large number of options.
You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).
Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:
Reader reader = ...
InputSource is = new InputSource(reader);
// parse the document into boilerpipe's internal data structure
TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();
// perform the extraction/classification process on "doc"
ArticleExtractor.INSTANCE.process(doc);
// iterate over all blocks (= segments as "ArticleExtractor" sees them)
for (TextBlock block : getTextBlocks()) {
// block.isContent() tells you if it's likely to be content or not
// block.getText() gives you the block's text
}
TextBlock has some more exciting methods, feel free to play around!
There appears to be a possible problem with Boilerpipe. Why?
Well, it appears that is suited to certain kinds of web pages, such as web pages that have a single body of content.
So one can crudely classify web pages into three kinds in respect to Boilerpipe:
a web page with a single article in it (Boilerpipe worthy!)
a web with multiple articles in it, such as the front page of the New York times
a web page that really doesn't have any article in it, but has some content in respect to links, but may also have some degree of clutter.
Boilerpipe works on case #1. But if one is doing a lot of automated text processing, then how does one's software "know" what kind of web page it is dealing with? If the web page itself could be classified into one of these three buckets, then Boilerpipe could be applied for case #1. Case #2 is a problem, and case#3 is a problem as well - it might require an aggregate of related web pages to determine what is clutter and what isn't.
You can use some libs like goose. It works best on articles/news.
You can also check javascript code that does similar extraction as goose with the readability bookmarklet
My first instinct was to go with your initial method of using Jsoup. At least with that, you can use selectors and retrieve only the elements that you want (i.e. Elements posts = doc.select("p"); and not have to worry about the other elements with random content.
On the matter of your other post, was the issue of false positives your only reasoning for straying away from Jsoup? If so, couldn't you just tweak the number of MIN_WORDS_SEQUENCE or be more selective with your selectors (i.e. do not retrieve div elements)
http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php
Proprietary software, but it makes it very easy to extract from webpages and integrates well with java.
You use a provided application to design xml files read by the roboserver api to parse webpages. The xml files are built by you analyzing the pages you wish to parse inside the provided application (fairly easy) and applying rules for gathering the data (generally, websites follow the same patterns). You can setup the scheduling, running, and db integration using the provided Java API.
If you're against using software and doing it yourself, I'd suggest not trying to apply 1 rule to all sites. Find a way to separate tags and then build per-site
You're looking for what are known as "HTML scrapers" or "screen scrapers". Here are a couple of links to some options for you:
Tag Soup
HTML Unit
You can filter the html junk and then parse the required details or use the apis of the existing site.
Refer the below link to filter the html, i hope it helps.
http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/
You could use the textracto api, it extracts the main 'article' text and there is also the opportunity to extract all other textual content. By 'subtracting' these texts you could split the navigation texts, preview texts, etc. from the main textual content.

Categories