How to use regular expressions to parse HTML in Java?

How to use regular expressions to parse HTML in Java? - java

Please can someone tell me a simple way to find href and src tags in an html file using regular expressions in Java?
And then, how do I get the URL associated with the tag?
Thanks for any suggestion.

Using regular expressions to pull values from HTML is always a mistake. HTML syntax is a lot more complex that it may first appear and it's very easy for a page to catch out even a very complex regular expression.
Use an HTML Parser instead. See also What are the pros and cons of the leading Java HTML parsers?

The other answers are true. Java Regex API is not a proper tool to achieve your goal. Use efficient, secure and well tested high-level tools mentioned in the other answers.
If your question concerns rather Regex API than a real-life problem (learning purposes for example) - you can do it with the following code:
String html = "foo <a href='link1'>bar</a> baz <a href='link2'>qux</a> foo";
Pattern p = Pattern.compile("<a href='(.*?)'>");
Matcher m = p.matcher(html);
while(m.find()) {
System.out.println(m.group(0));
System.out.println(m.group(1));
}
And the output is:
<a href='link1'>
link1
<a href='link2'>
link2
Please note that lazy/reluctant qualifier *? must be used in order to reduce the grouping to the single tag. Group 0 is the entire match, group 1 is the next group match (next pair of parenthesis).

Dont use regular expressions use NekoHTML or TagSoup which are a bridge providing a SAX or DOM as in XML approach to visiting a HTML document.

If you want to go down the html parsing route, which Dave and I recommend here's the code to parse a String Data for anchor tags and print their href.
since your just using anchor tags you should be ok with just regex but if you want to do more go with a parser. The Mozilla HTML Parser is the best out there.
File parserLibraryFile = new File("lib/MozillaHtmlParser/native/bin/MozillaParser" + EnviromentController.getSharedLibraryExtension());
String parserLibrary = parserLibraryFile.getAbsolutePath();
// mozilla.dist.bin directory :
final File mozillaDistBinDirectory = new File("lib/MozillaHtmlParser/mozilla.dist.bin."+ EnviromentController.getOperatingSystemName());
MozillaParser.init(parserLibrary,mozillaDistBinDirectory.getAbsolutePath());
MozillaParser parser = new MozillaParser();
Document domDocument = parser.parse(data);
NodeList list = domDocument.getElementsByTagName("a");
for (int i = 0; i < list.getLength(); i++) {
Node n = list.item(i);
NamedNodeMap m = n.getAttributes();
if (m != null) {
Node attrNode = m.getNamedItem("href");
if (attrNode != null)
System.out.println(attrNode.getNodeValue());

I searched the Regular Expression Library (http://regexlib.com/Search.aspx?k=href and http://regexlib.com/Search.aspx?k=src)
The best I found was
((?<html>(href|src)\s*=\s*")|(?<css>url\())(?<url>.*?)(?(html)"|\))
Check out these links for more expressions:
http://regexlib.com/REDetails.aspx?regexp_id=2261
http://regexlib.com/REDetails.aspx?regexp_id=758
http://regexlib.com/REDetails.aspx?regexp_id=774
http://regexlib.com/REDetails.aspx?regexp_id=1437

Regular expressions can only parse regular languages, that's why they are called regular expressions. HTML is not a regular language, ergo it cannot be parsed by regular expressions.
HTML parsers, on the other hand, can parse HTML, that's why they are called HTML parsers.
You should use you favorite HTML parser instead.

Contrary to popular opinion, regular expressions are useful tools to extract data from unstructured text (which HTML is).
If you are doing complex HTML data extraction (say, find all paragraphs in a page) then HTML parsing is probably the way to go. But if you just need to get some URLs from HREFs, then a regular expression would work fine and it will be very hard to break it.
Try something like this:
/<a[^>]+href=["']?([^'"> ]+)["']?[^>]*>/i

Related

Remove html which have class x from string in java

Is there a good way to remove HTML from a Java string which have class "abc"? A simple regex like -
replaceAll("\\<.*?>","")
will remove all but i want to remove only those tag whose having class "abc".
<H1 class="abc">Hey</H1>
<H1 class="xyz">Hello</H1>
Remove h1 with class abc only.
Note -> have to ddo it through regex not through parser because this is the only instance where i am modifying HTML in my code. Don't want additional JAR in my code.

This should Work
replaceAll("<h1[^>]*?class=\"*\'*abc\"*\'*>.*?h1>","")

Try
replaceAll("<[Hh]1 class=['\"]landingPage['\"]>.*?</[Hh]1>", "")
But note that since regex is not well-suited for this task, there might be unwanted results when it comes to complex HTML input.
For the input
<H1 class="abc">Hey</H1>
<H1 class="xyz">Hello</H1>
the output is
<H1 class="xyz">Hello</H1>

It's never a good idea to parse HTML using regex, see RegEx match open tags except XHTML self-contained tags
See Which HTML Parser is the best? for alternatives.
For example, using JSoup you could write something like this (untested):
Document doc = Jsoup.parse(html);
Elements elements = doc.select(".abc");
elements.remove();

How to validate that at least one element in a html string has content?

I have a wysiwyg editor that I can't modify that sometimes returns <p></p> which obviously looks like an empty field to the person using the wysiwyg.
So I need to add some validation on my back-end which uses java.
should be rejected
<p></p>
<p> </p>
<div><p> </p></div>
should be accepted
<p>a</p>
<div><p>a</p></div>
<p> </p>
<div><p>a</p></div>
basically as long as any element contains some content we will accept it and save it.
I am looking for libraries that I should look at and ideas for how to approach it. Thanks.

You may look on jsoup library. It's pretty fast
It takes HTML and you may return text from it (see example from their website below).
Extract attributes, text, and HTML from elements
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
String text = doc.body().text(); // "An example link"

I would advise you to do it on the client side. The reason is because it is natural for the browser to do this. You need to hook your wysiwyg editor in the send or "save" part, a lot of them have this ability.
Javascript would be
function stripIfEmpty(html)
{
var tmp = document.createElement("DIV");
tmp.innerHTML = html;
var contentText = tmp.textContent || tmp.innerText || "";
if(contentText.trim().length === 0){
return "";
}else{
return html;
}
}
In the case if you need backend javascript, then the only correct solution would be to use some library that parse HTML, like jsoup - #Dmytro Pastovenskyi show you that.
If you want to use backend but allow it to be fuzzy, not strict, then you can use regex like replaceAll("\\<[^>]*>","") then trim, then check if the string is empty.

You can use regular expressions (built-in to Java).
For example,
"<p>\\s*\\w+\\s*</p>"
would match a <p> tag with at least 1 character of content.

Java Regular Expression: href without hash

I'm trying to build a sitemap and parsing the html bodies for hrefs that doesn't have # (as those with hashes are just sub chapter links in some content page htmls).
My regexp now: <a\\s[^>]*href\\s*=\\s*\"([^\"]*)\"[^>]*>(.*?)</a>
I guess I should use [^#] or !# to exclude the # from hrefs but could not solve it with just trying and googling after it. Thanks in advance for helping me out!

Done it. Just inserted the # too in the [^\"] block. :D
<a\\s[^>]*href\\s*=\\s*\"([^\"#]*)\"[^>]*>(.*?)</a>

You should not use regex to parse HTML.
Best use an HTML parser, as eg http://jsoup.org and then
Document doc = Jsoup.parse(input);
Elements links = doc.select("a[href]");
for (Element each: links) {
if (each.attr("href").startsWith("#")) continue;
...
}
So much more painless than using regex, eh!

extract text from HTML segment using standard java

I'm receiving a segment of HTML document as Java String and i would like to extract it's inner text.
for ex: hello world ----> hello world
is there a way to extract the text using java standard library ?
something maybe more efficient than open/close tag regex with empty string?
thanks,

Don't use regex to parse HTML but a dedicated parser like HtmlCleaner.
Using a regex will usually work at fist test, and then start to be more and more complex until it ends being impossible to adapt.

I will also say it - don't use regex with HTML. ;-)
You can give a shot with JTidy.

Don't use regular expression to parse HTML, use for instance jsoup: Java HTML Parser. It has a convenient way to select elements from the DOM.
Example
Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the In the news section into a list of Elements:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
There is also a HTML parser in the JDK: javax.swing.text.html.parser.Parser, which could be applied like this:
Reader in = new InputStreamReader(new URL(webpageURL).openConnection().getInputStream());
ParserDelegator parserDelegator = new ParserDelegator();
parserDelegator.parse(in, harvester, true);
Then, dependent on what kind you are looking for: start tags, end tags, attributes, etc. you define the appropriate callback function:
#Override
public void handleStartTag(HTML.Tag tag,
MutableAttributeSet mutableAttributeSet, int pos) {
// parses the HTML document until a <a> or <area> tag is found
if (tag == HTML.Tag.A || tag == HTML.Tag.AREA) {
// reading the href attribute of the tag
String address = (String) mutableAttributeSet
.getAttribute(Attribute.HREF);
/* ... */

You can use HTMLParser , this is a open source.

Regex to strip HTML tags

I have this HTML input:
<font size="5"><p>some text</p>
<p> another text</p></font>
I'd like to use regex to remove the HTML tags so that the output is:
some text
another text
Can anyone suggest how to do this with regex?

Since you asked, here's a quick and dirty solution:
String stripped = input.replaceAll("<[^>]*>", "");
(Ideone.com demo)
Using regexps to deal with HTML is a pretty bad idea though. The above hack won't deal with stuff like
<tag attribute=">">Hello</tag>
<script>if (a < b) alert('Hello>');</script>
etc.
A better approach would be to use for instance Jsoup. To remove all tags from a string, you can for instance do Jsoup.parse(html).text().

Use a HTML parser. Here's a Jsoup example.
String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
String stripped = Jsoup.parse(input).text();
System.out.println(stripped);
Result:
some text another text
Or if you want to preserve newlines:
String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
for (String line : input.split("\n")) {
String stripped = Jsoup.parse(line).text();
System.out.println(stripped);
}
Result:
some text
another text
Jsoup offers more advantages as well. You could easily extract specific parts of the HTML document using the select() method which accepts jQuery-like CSS selectors. It only requires the document to be semantically well-formed. The presence of the since 1998 deprecated <font> tag is already not a very good indication, but if you know the HTML structure in depth detail beforehand, it'll still be doable.
See also:
Pros and cons of leading HTML parsers in Java

You can go with HTML parser called Jericho Html parser.
you can download it from here - http://jericho.htmlparser.net/docs/index.html
Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognized or invalid HTML. It also provides high-level HTML form manipulation functions.
The presence of badly formatted HTML does not interfere with the parsing

Starting from aioobe's code, I tried something more daring:
String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
String stripped = input.replaceAll("</?(font|p){1}.*?/?>", "");
System.out.println(stripped);
The code to strip every HTML tag would look like this:
public class HtmlSanitizer {
private static String pattern;
private final static String [] tagsTab = {"!doctype","a","abbr","acronym","address","applet","area","article","aside","audio","b","base","basefont","bdi","bdo","bgsound","big","blink","blockquote","body","br","button","canvas","caption","center","cite","code","col","colgroup","content","data","datalist","dd","decorator","del","details","dfn","dir","div","dl","dt","element","em","embed","fieldset","figcaption","figure","font","footer","form","frame","frameset","h1","h2","h3","h4","h5","h6","head","header","hgroup","hr","html","i","iframe","img","input","ins","isindex","kbd","keygen","label","legend","li","link","listing","main","map","mark","marquee","menu","menuitem","meta","meter","nav","nobr","noframes","noscript","object","ol","optgroup","option","output","p","param","plaintext","pre","progress","q","rp","rt","ruby","s","samp","script","section","select","shadow","small","source","spacer","span","strike","strong","style","sub","summary","sup","table","tbody","td","template","textarea","tfoot","th","thead","time","title","tr","track","tt","u","ul","var","video","wbr","xmp"};
static {
StringBuffer tags = new StringBuffer();
for (int i=0;i<tagsTab.length;i++) {
tags.append(tagsTab[i].toLowerCase()).append('|').append(tagsTab[i].toUpperCase());
if (i<tagsTab.length-1) {
tags.append('|');
}
}
pattern = "</?("+tags.toString()+"){1}.*?/?>";
}
public static String sanitize(String input) {
return input.replaceAll(pattern, "");
}
public final static void main(String[] args) {
System.out.println(HtmlSanitizer.pattern);
System.out.println(HtmlSanitizer.sanitize("<font size=\"5\"><p>some text</p><br/> <p>another text</p></font>"));
}
}
I wrote this in order to be Java 1.4 compliant, for some sad reasons, so feel free to use for each and StringBuilder...
Advantages:
You can generate lists of tags you want to strip, which means you can keep those you want
You avoid stripping stuff that isn't an HTML tag
You keep the whitespaces
Drawbacks:
You have to list all HTML tags you want to strip from your string. Which can be a lot, for example if you want to strip everything.
If you see any other drawbacks, I would really be glad to know them.

If you use Jericho, then you just have to use something like this:
public String extractAllText(String htmlText){
Source source = new Source(htmlText);
return source.getTextExtractor().toString();
}
Of course you can do the same even with an Element:
for (Element link : links) {
System.out.println(link.getTextExtractor().toString());
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to use regular expressions to parse HTML in Java? - java

Please can someone tell me a simple way to find href and src tags in an html file using regular expressions in Java? And then, how do I get the URL associated with the tag? Thanks for any suggestion.

Dont use regular expressions use NekoHTML or TagSoup which are a bridge providing a SAX or DOM as in XML approach to visiting a HTML document.

Related

Remove html which have class x from string in java

How to validate that at least one element in a html string has content?

Java Regular Expression: href without hash

extract text from HTML segment using standard java

Regex to strip HTML tags

Categories

Resources