extract text from HTML segment using standard java - java

I'm receiving a segment of HTML document as Java String and i would like to extract it's inner text.
for ex: hello world ----> hello world
is there a way to extract the text using java standard library ?
something maybe more efficient than open/close tag regex with empty string?
thanks,

Don't use regex to parse HTML but a dedicated parser like HtmlCleaner.
Using a regex will usually work at fist test, and then start to be more and more complex until it ends being impossible to adapt.

I will also say it - don't use regex with HTML. ;-)
You can give a shot with JTidy.

Don't use regular expression to parse HTML, use for instance jsoup: Java HTML Parser. It has a convenient way to select elements from the DOM.
Example
Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the In the news section into a list of Elements:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
There is also a HTML parser in the JDK: javax.swing.text.html.parser.Parser, which could be applied like this:
Reader in = new InputStreamReader(new URL(webpageURL).openConnection().getInputStream());
ParserDelegator parserDelegator = new ParserDelegator();
parserDelegator.parse(in, harvester, true);
Then, dependent on what kind you are looking for: start tags, end tags, attributes, etc. you define the appropriate callback function:
#Override
public void handleStartTag(HTML.Tag tag,
MutableAttributeSet mutableAttributeSet, int pos) {
// parses the HTML document until a <a> or <area> tag is found
if (tag == HTML.Tag.A || tag == HTML.Tag.AREA) {
// reading the href attribute of the tag
String address = (String) mutableAttributeSet
.getAttribute(Attribute.HREF);
/* ... */

You can use HTMLParser , this is a open source.

Related

How to validate that at least one element in a html string has content?

I have a wysiwyg editor that I can't modify that sometimes returns <p></p> which obviously looks like an empty field to the person using the wysiwyg.
So I need to add some validation on my back-end which uses java.
should be rejected
<p></p>
<p> </p>
<div><p> </p></div>
should be accepted
<p>a</p>
<div><p>a</p></div>
<p> </p>
<div><p>a</p></div>
basically as long as any element contains some content we will accept it and save it.
I am looking for libraries that I should look at and ideas for how to approach it. Thanks.
You may look on jsoup library. It's pretty fast
It takes HTML and you may return text from it (see example from their website below).
Extract attributes, text, and HTML from elements
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
String text = doc.body().text(); // "An example link"
I would advise you to do it on the client side. The reason is because it is natural for the browser to do this. You need to hook your wysiwyg editor in the send or "save" part, a lot of them have this ability.
Javascript would be
function stripIfEmpty(html)
{
var tmp = document.createElement("DIV");
tmp.innerHTML = html;
var contentText = tmp.textContent || tmp.innerText || "";
if(contentText.trim().length === 0){
return "";
}else{
return html;
}
}
In the case if you need backend javascript, then the only correct solution would be to use some library that parse HTML, like jsoup - #Dmytro Pastovenskyi show you that.
If you want to use backend but allow it to be fuzzy, not strict, then you can use regex like replaceAll("\\<[^>]*>","") then trim, then check if the string is empty.
You can use regular expressions (built-in to Java).
For example,
"<p>\\s*\\w+\\s*</p>"
would match a <p> tag with at least 1 character of content.

Extract text between html tags parsed from xml

Can anyone help me in extracting text from within the html tags to plain text?
I have parsed an xml and get some output as body which has html tags now i want to remove the tags and use the text.
thanks in advance!!!!
You can use HTML Parser like JSoup
For example
HTML is
<div style="height:240px;"><br>test: example<br>test1:example1</div>
You can get the html using
Document document = Jsoup.parse(html);
Element div = document.select("div[style=height:240px;]").first();
div.html();
Try a HTML Parser.
If the HTML is escaped, i.e. < instead of < you might have to decode first.
Considering your requirements you might try Jericho HTML Parser
Take a look at TextExtractor class:
Using the default settings, the source segment:
"<div><b>O</b>ne</div><div title="Two"><b>Th</b><script>//a script </script>ree</div>"
produces the text "One Two Three".
If all you want to do is remove HTML tags from a string, you can do this:
String output = input.replaceAll("(?s)\\<.*?\\>", " ");

Regex to strip HTML tags

I have this HTML input:
<font size="5"><p>some text</p>
<p> another text</p></font>
I'd like to use regex to remove the HTML tags so that the output is:
some text
another text
Can anyone suggest how to do this with regex?
Since you asked, here's a quick and dirty solution:
String stripped = input.replaceAll("<[^>]*>", "");
(Ideone.com demo)
Using regexps to deal with HTML is a pretty bad idea though. The above hack won't deal with stuff like
<tag attribute=">">Hello</tag>
<script>if (a < b) alert('Hello>');</script>
etc.
A better approach would be to use for instance Jsoup. To remove all tags from a string, you can for instance do Jsoup.parse(html).text().
Use a HTML parser. Here's a Jsoup example.
String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
String stripped = Jsoup.parse(input).text();
System.out.println(stripped);
Result:
some text another text
Or if you want to preserve newlines:
String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
for (String line : input.split("\n")) {
String stripped = Jsoup.parse(line).text();
System.out.println(stripped);
}
Result:
some text
another text
Jsoup offers more advantages as well. You could easily extract specific parts of the HTML document using the select() method which accepts jQuery-like CSS selectors. It only requires the document to be semantically well-formed. The presence of the since 1998 deprecated <font> tag is already not a very good indication, but if you know the HTML structure in depth detail beforehand, it'll still be doable.
See also:
Pros and cons of leading HTML parsers in Java
You can go with HTML parser called Jericho Html parser.
you can download it from here - http://jericho.htmlparser.net/docs/index.html
Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognized or invalid HTML. It also provides high-level HTML form manipulation functions.
The presence of badly formatted HTML does not interfere with the parsing
Starting from aioobe's code, I tried something more daring:
String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
String stripped = input.replaceAll("</?(font|p){1}.*?/?>", "");
System.out.println(stripped);
The code to strip every HTML tag would look like this:
public class HtmlSanitizer {
private static String pattern;
private final static String [] tagsTab = {"!doctype","a","abbr","acronym","address","applet","area","article","aside","audio","b","base","basefont","bdi","bdo","bgsound","big","blink","blockquote","body","br","button","canvas","caption","center","cite","code","col","colgroup","content","data","datalist","dd","decorator","del","details","dfn","dir","div","dl","dt","element","em","embed","fieldset","figcaption","figure","font","footer","form","frame","frameset","h1","h2","h3","h4","h5","h6","head","header","hgroup","hr","html","i","iframe","img","input","ins","isindex","kbd","keygen","label","legend","li","link","listing","main","map","mark","marquee","menu","menuitem","meta","meter","nav","nobr","noframes","noscript","object","ol","optgroup","option","output","p","param","plaintext","pre","progress","q","rp","rt","ruby","s","samp","script","section","select","shadow","small","source","spacer","span","strike","strong","style","sub","summary","sup","table","tbody","td","template","textarea","tfoot","th","thead","time","title","tr","track","tt","u","ul","var","video","wbr","xmp"};
static {
StringBuffer tags = new StringBuffer();
for (int i=0;i<tagsTab.length;i++) {
tags.append(tagsTab[i].toLowerCase()).append('|').append(tagsTab[i].toUpperCase());
if (i<tagsTab.length-1) {
tags.append('|');
}
}
pattern = "</?("+tags.toString()+"){1}.*?/?>";
}
public static String sanitize(String input) {
return input.replaceAll(pattern, "");
}
public final static void main(String[] args) {
System.out.println(HtmlSanitizer.pattern);
System.out.println(HtmlSanitizer.sanitize("<font size=\"5\"><p>some text</p><br/> <p>another text</p></font>"));
}
}
I wrote this in order to be Java 1.4 compliant, for some sad reasons, so feel free to use for each and StringBuilder...
Advantages:
You can generate lists of tags you want to strip, which means you can keep those you want
You avoid stripping stuff that isn't an HTML tag
You keep the whitespaces
Drawbacks:
You have to list all HTML tags you want to strip from your string. Which can be a lot, for example if you want to strip everything.
If you see any other drawbacks, I would really be glad to know them.
If you use Jericho, then you just have to use something like this:
public String extractAllText(String htmlText){
Source source = new Source(htmlText);
return source.getTextExtractor().toString();
}
Of course you can do the same even with an Element:
for (Element link : links) {
System.out.println(link.getTextExtractor().toString());
}

Read href inside anchor tag using Java

I have an HTML snippet like this :
View or apply to job
I want to read href value XXXXXXXXXX using Java.
Point to note: I am reading the HTML file from a URL using inputstreamreader(url.openStream()).
I am getting a complete HTML file, and above snippet is a part of that file.
How can I do this?
Thanks
Karunjay Anand
Use a html parser like Jsoup. The API is easy to learn and for your case,the following code snippet will do
URL url = new URL("http://example.com/");
Document doc = Jsoup.parse(url, 3*1000);
Elements links = doc.select("a[href]"); // a with href
for (Element link : links) {
System.out.println("Href = "+link.attr("abs:href"));
}
Use an HTML parser like TagSoup or something similar.
You can use Java's own HtmlEditorKit for parsing html. This way you wont need to depend on any third party html parser. Here is an example of how to use it.

How to use regular expressions to parse HTML in Java?

Please can someone tell me a simple way to find href and src tags in an html file using regular expressions in Java?
And then, how do I get the URL associated with the tag?
Thanks for any suggestion.
Using regular expressions to pull values from HTML is always a mistake. HTML syntax is a lot more complex that it may first appear and it's very easy for a page to catch out even a very complex regular expression.
Use an HTML Parser instead. See also What are the pros and cons of the leading Java HTML parsers?
The other answers are true. Java Regex API is not a proper tool to achieve your goal. Use efficient, secure and well tested high-level tools mentioned in the other answers.
If your question concerns rather Regex API than a real-life problem (learning purposes for example) - you can do it with the following code:
String html = "foo <a href='link1'>bar</a> baz <a href='link2'>qux</a> foo";
Pattern p = Pattern.compile("<a href='(.*?)'>");
Matcher m = p.matcher(html);
while(m.find()) {
System.out.println(m.group(0));
System.out.println(m.group(1));
}
And the output is:
<a href='link1'>
link1
<a href='link2'>
link2
Please note that lazy/reluctant qualifier *? must be used in order to reduce the grouping to the single tag. Group 0 is the entire match, group 1 is the next group match (next pair of parenthesis).
Dont use regular expressions use NekoHTML or TagSoup which are a bridge providing a SAX or DOM as in XML approach to visiting a HTML document.
If you want to go down the html parsing route, which Dave and I recommend here's the code to parse a String Data for anchor tags and print their href.
since your just using anchor tags you should be ok with just regex but if you want to do more go with a parser. The Mozilla HTML Parser is the best out there.
File parserLibraryFile = new File("lib/MozillaHtmlParser/native/bin/MozillaParser" + EnviromentController.getSharedLibraryExtension());
String parserLibrary = parserLibraryFile.getAbsolutePath();
// mozilla.dist.bin directory :
final File mozillaDistBinDirectory = new File("lib/MozillaHtmlParser/mozilla.dist.bin."+ EnviromentController.getOperatingSystemName());
MozillaParser.init(parserLibrary,mozillaDistBinDirectory.getAbsolutePath());
MozillaParser parser = new MozillaParser();
Document domDocument = parser.parse(data);
NodeList list = domDocument.getElementsByTagName("a");
for (int i = 0; i < list.getLength(); i++) {
Node n = list.item(i);
NamedNodeMap m = n.getAttributes();
if (m != null) {
Node attrNode = m.getNamedItem("href");
if (attrNode != null)
System.out.println(attrNode.getNodeValue());
I searched the Regular Expression Library (http://regexlib.com/Search.aspx?k=href and http://regexlib.com/Search.aspx?k=src)
The best I found was
((?<html>(href|src)\s*=\s*")|(?<css>url\())(?<url>.*?)(?(html)"|\))
Check out these links for more expressions:
http://regexlib.com/REDetails.aspx?regexp_id=2261
http://regexlib.com/REDetails.aspx?regexp_id=758
http://regexlib.com/REDetails.aspx?regexp_id=774
http://regexlib.com/REDetails.aspx?regexp_id=1437
Regular expressions can only parse regular languages, that's why they are called regular expressions. HTML is not a regular language, ergo it cannot be parsed by regular expressions.
HTML parsers, on the other hand, can parse HTML, that's why they are called HTML parsers.
You should use you favorite HTML parser instead.
Contrary to popular opinion, regular expressions are useful tools to extract data from unstructured text (which HTML is).
If you are doing complex HTML data extraction (say, find all paragraphs in a page) then HTML parsing is probably the way to go. But if you just need to get some URLs from HREFs, then a regular expression would work fine and it will be very hard to break it.
Try something like this:
/<a[^>]+href=["']?([^'"> ]+)["']?[^>]*>/i

Categories