I need to convert HTML to plain text for sending it per mail.
Currently I'm using
Jsoup.parse(html).wholeText();
This preserves line breaks, but not lists. Something like
- List item
- List item 2
- Nested list item
gets converted to List itemList item2Nested list item
How can I keep most of the text formatting, but remove all HTML tags with images, links etc.?
What you're asking for is to render HTML (not parse it; though parsing it is, naturally, part of any HTML rendering engine). Not render it the way e.g. Chromium would render it (as an image to a screen), but to render it into a string.
This is highly complicated, and involves CSS support as well. In basis, what you are asking for is multiple personyears of effort, and as far as I know no library exists that did it. You can have a look at text-based HTML renderers such as Lynx or w3m - you can probably install them, execute these with ProcessBuilder (this does, of course, make your app entirely arch+OS dependent, and you'll have to ship a w3m or lynx binary for each and every platform you want to support, or ask the one who installs your app to take care of also installing a lynx and/or w3m and telling your app where it is). Note that lynx/w3m tend to assume full terminal support, meaning: Bold, colours, etc.
Imagine an HTML page that doesn't use <ul> and <li> to create a bulleted list, but instead uses some CSS to make something that looks a lot like a bulleted list. Or what if inline CSS is used to align something to the right. Presumably then you would expect the string to also do this right alignment, except that is completely impossible unless either [A] you know the size of the 'window' the string will be rendered into or [B] the output is not basic text strings but some sort of markup language that supports right aligning (be it HTML or RTF or similar), or [C] terminal command sequences are available to move the cursor around.
This should highlight how your question is in essence 'weird' - it's either incredibly complicated, or a seemingly arbitrary tiny subselection of what HTML can do.
If the latter piques your interest, it isn't too difficult to just write a simplistic tree walker that specifically inserts newlines and "\n * " any time a <li> element inside a <ul> is visited, and a String.format("\n%2d. ") anytime a <li> is visited inside an <ol>.
In other words, given that what you ask for is either impossible or is an arbitrary choice of HTML and CSS stylings that you do and don't want to support, write it yourself. If truly you are only interested specifically in <ol>/<ul> based lists and nothing else, this will be about a page full of code and no more.
Hello Im googling for hours now and can't find answer...(or smt close to it)
What i am trying to do is, lets say i have this code(very simplified):
<div id="one"><div id="two"><div id="three"></div></div></div>
And what i want to do is delete specific amount of this elements , lets say 2 of them. So the result would be:
<div id="one"><div id="two"><div id="three"></div>
Or i want to delete this opening elements (again specific amount of them, lets say 2 again) but without knowing their full name (so we can assume if real name is id="one_54486464" i know its one_ ... )
So after deleting I get this result:
<div id="three"></div></div></div>
Can anyone suggest way to achieve this results? It does not have to Include JSOUP, any better. more simple or more efficient way is welcomed :) (But i am using JSOUP to parse document to get to the point where i am left with )
I hope i explain myself clearly if you have any question please do ask... Thanks :)
EDIT: Those elements that i want to delete are on very end of the HTML document(so nothing, nothing is behind them not body tag html tag nothing...)
Please keep that HTML document would have many across whole code and i want to delete only specific amount at the end of the document...
For the opening divs THOSE are on very beginning of my HTML document and nothing is before them... So i need to remove specific amount from the beginning without knowing their specific ID only start of it. Also this div has closing too somewhere in the document and that closing i want to keep there.
For the first case, you can get the element's html (using the html() method) and use some String methods on it to delete a couple of its closing tags.
Example:
e.html().replaceAll("(((\\s|\n)+)?<\\/div>){2}$","");
This will remove the last 2 closing div tags, to change the number of tags to be remove, just change the number between the curly brackets {n}
(this is just an example and is probably unreliable, you should use some other String methods to decide which parts to discard)
For the second case, you can select the inner element(s) and add some additional closing tags to it/them.
Example:
String s = e.select("#two").first().html() + "</div></div>";
To select an element that has an ID that starts with some String you can use this e.select("div[id^=two]")
You can find more details on how to select elements here
After Titus suggested regular expressions I decided to write regex for deleting opening divs too.
So I convert Jsoup Document to String then did the parsing on a string and then convert back to Jsoup Document so I can use Jsoup functions.
ADD: What I was doing is that I was parsing and connecting two pages to one seamless. So there was no missing opening div or closing... so my HTML code stay with no errors therefore I was able to convert it back to Jsoup Document without complications.
I'm trying to get the watch urls of all uploaded videos in the video manager on YouTube. How do I select all videos on the page? I figured that the class name vm-video-title-content yt-uix-sessionlink exists in the a tag on every video but how can it be used to retrieve all of the href attributes? I'm struggling with the selector.
This is the html snippet I'm basically dealing with:
THE_VIDEO_TITLE
In Selenium I tried doing
By videoSelector = By.className("vm-video-title-content.yt-uix-sessionlink");
List<WebElement> webElements = webDriver.findElements(videoSelector);
System.out.println(webElements.size());
but it prints a size of 0.
Note that I placed a dot in the class name since compound class names are not supported.
Is my approach promising or is there a better way of doing it?
I think you misunderstood "classes", and how they work in HTML.
Your HTML sample specifies two classes: vm-video-title-content and yt-uix-sessionlink. Note that this is a space-separated list!
Your Java code is asking for a completely different one class: vm-video-title-content.yt-uix-sessionlink. Note that is done by exact string comparison!
If you try something like:
By videoSelector = By.className("vm-video-title-content");
You should be little closer to what you want.
I have a set of 1000 pages(links) that I get by putting a query to Google. I am using JSoup. I want to get rid of images, links, menus, videos, etc. and take only the main article from every page.
My problem is that every page has a different DOM tree so I cannot use the same command for every page! Do you know any way to do this for 1000 pages simultaneously? I guess that I have to use regular expressions. Something like that perhaps
textdoc.body().select("[id*=main]").text();//get id that contains the word main
textdoc.body().select("[class*=main]").text();//get class that contains the word main
textdoc.body().select("[id*=content]").text();//get id that contains the word content
But I feel that always I will miss something with this. Any better ideas?
Element main = doc.select("div.main").first();
Elements links = main.select("a[href]");
All different pages have main class for the main article?
Update
Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.
So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.
The Question
I download some pages from random sites, and now I want to analyze the textual content of the page.
The problem is that a web page have a lot of content like menus, publicity, banners, etc.
I want to try to exclude all that is not related with the content of the page.
Taking this page as example, I don't want the menus above neither the links in the footer.
Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.
At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).
The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)
Edit: I want to do this inside my Java code, not an external application (if this can be possible).
I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering
Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
There are a few ways to feed HTML into Boilerpipe and extract HTML.
You can use a URL:
ArticleExtractor.INSTANCE.getText(url);
You can use a String:
ArticleExtractor.INSTANCE.getText(myHtml);
There are also options to use a Reader, which opens up a large number of options.
You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).
Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:
Reader reader = ...
InputSource is = new InputSource(reader);
// parse the document into boilerpipe's internal data structure
TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();
// perform the extraction/classification process on "doc"
ArticleExtractor.INSTANCE.process(doc);
// iterate over all blocks (= segments as "ArticleExtractor" sees them)
for (TextBlock block : getTextBlocks()) {
// block.isContent() tells you if it's likely to be content or not
// block.getText() gives you the block's text
}
TextBlock has some more exciting methods, feel free to play around!
There appears to be a possible problem with Boilerpipe. Why?
Well, it appears that is suited to certain kinds of web pages, such as web pages that have a single body of content.
So one can crudely classify web pages into three kinds in respect to Boilerpipe:
a web page with a single article in it (Boilerpipe worthy!)
a web with multiple articles in it, such as the front page of the New York times
a web page that really doesn't have any article in it, but has some content in respect to links, but may also have some degree of clutter.
Boilerpipe works on case #1. But if one is doing a lot of automated text processing, then how does one's software "know" what kind of web page it is dealing with? If the web page itself could be classified into one of these three buckets, then Boilerpipe could be applied for case #1. Case #2 is a problem, and case#3 is a problem as well - it might require an aggregate of related web pages to determine what is clutter and what isn't.
You can use some libs like goose. It works best on articles/news.
You can also check javascript code that does similar extraction as goose with the readability bookmarklet
My first instinct was to go with your initial method of using Jsoup. At least with that, you can use selectors and retrieve only the elements that you want (i.e. Elements posts = doc.select("p"); and not have to worry about the other elements with random content.
On the matter of your other post, was the issue of false positives your only reasoning for straying away from Jsoup? If so, couldn't you just tweak the number of MIN_WORDS_SEQUENCE or be more selective with your selectors (i.e. do not retrieve div elements)
http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php
Proprietary software, but it makes it very easy to extract from webpages and integrates well with java.
You use a provided application to design xml files read by the roboserver api to parse webpages. The xml files are built by you analyzing the pages you wish to parse inside the provided application (fairly easy) and applying rules for gathering the data (generally, websites follow the same patterns). You can setup the scheduling, running, and db integration using the provided Java API.
If you're against using software and doing it yourself, I'd suggest not trying to apply 1 rule to all sites. Find a way to separate tags and then build per-site
You're looking for what are known as "HTML scrapers" or "screen scrapers". Here are a couple of links to some options for you:
Tag Soup
HTML Unit
You can filter the html junk and then parse the required details or use the apis of the existing site.
Refer the below link to filter the html, i hope it helps.
http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/
You could use the textracto api, it extracts the main 'article' text and there is also the opportunity to extract all other textual content. By 'subtracting' these texts you could split the navigation texts, preview texts, etc. from the main textual content.