Is there a good way to remove HTML from a Java string which have class "abc"? A simple regex like -
replaceAll("\\<.*?>","")
will remove all but i want to remove only those tag whose having class "abc".
<H1 class="abc">Hey</H1>
<H1 class="xyz">Hello</H1>
Remove h1 with class abc only.
Note -> have to ddo it through regex not through parser because this is the only instance where i am modifying HTML in my code. Don't want additional JAR in my code.
This should Work
replaceAll("<h1[^>]*?class=\"*\'*abc\"*\'*>.*?h1>","")
Try
replaceAll("<[Hh]1 class=['\"]landingPage['\"]>.*?</[Hh]1>", "")
But note that since regex is not well-suited for this task, there might be unwanted results when it comes to complex HTML input.
For the input
<H1 class="abc">Hey</H1>
<H1 class="xyz">Hello</H1>
the output is
<H1 class="xyz">Hello</H1>
It's never a good idea to parse HTML using regex, see RegEx match open tags except XHTML self-contained tags
See Which HTML Parser is the best? for alternatives.
For example, using JSoup you could write something like this (untested):
Document doc = Jsoup.parse(html);
Elements elements = doc.select(".abc");
elements.remove();
I have a wysiwyg editor that I can't modify that sometimes returns <p></p> which obviously looks like an empty field to the person using the wysiwyg.
So I need to add some validation on my back-end which uses java.
should be rejected
<p></p>
<p> </p>
<div><p> </p></div>
should be accepted
<p>a</p>
<div><p>a</p></div>
<p> </p>
<div><p>a</p></div>
basically as long as any element contains some content we will accept it and save it.
I am looking for libraries that I should look at and ideas for how to approach it. Thanks.
You may look on jsoup library. It's pretty fast
It takes HTML and you may return text from it (see example from their website below).
Extract attributes, text, and HTML from elements
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
String text = doc.body().text(); // "An example link"
I would advise you to do it on the client side. The reason is because it is natural for the browser to do this. You need to hook your wysiwyg editor in the send or "save" part, a lot of them have this ability.
Javascript would be
function stripIfEmpty(html)
{
var tmp = document.createElement("DIV");
tmp.innerHTML = html;
var contentText = tmp.textContent || tmp.innerText || "";
if(contentText.trim().length === 0){
return "";
}else{
return html;
}
}
In the case if you need backend javascript, then the only correct solution would be to use some library that parse HTML, like jsoup - #Dmytro Pastovenskyi show you that.
If you want to use backend but allow it to be fuzzy, not strict, then you can use regex like replaceAll("\\<[^>]*>","") then trim, then check if the string is empty.
You can use regular expressions (built-in to Java).
For example,
"<p>\\s*\\w+\\s*</p>"
would match a <p> tag with at least 1 character of content.
I'm trying to build a sitemap and parsing the html bodies for hrefs that doesn't have # (as those with hashes are just sub chapter links in some content page htmls).
My regexp now: <a\\s[^>]*href\\s*=\\s*\"([^\"]*)\"[^>]*>(.*?)</a>
I guess I should use [^#] or !# to exclude the # from hrefs but could not solve it with just trying and googling after it. Thanks in advance for helping me out!
Done it. Just inserted the # too in the [^\"] block. :D
<a\\s[^>]*href\\s*=\\s*\"([^\"#]*)\"[^>]*>(.*?)</a>
You should not use regex to parse HTML.
Best use an HTML parser, as eg http://jsoup.org and then
Document doc = Jsoup.parse(input);
Elements links = doc.select("a[href]");
for (Element each: links) {
if (each.attr("href").startsWith("#")) continue;
...
}
So much more painless than using regex, eh!
I'm receiving a segment of HTML document as Java String and i would like to extract it's inner text.
for ex: hello world ----> hello world
is there a way to extract the text using java standard library ?
something maybe more efficient than open/close tag regex with empty string?
thanks,
Don't use regex to parse HTML but a dedicated parser like HtmlCleaner.
Using a regex will usually work at fist test, and then start to be more and more complex until it ends being impossible to adapt.
I will also say it - don't use regex with HTML. ;-)
You can give a shot with JTidy.
Don't use regular expression to parse HTML, use for instance jsoup: Java HTML Parser. It has a convenient way to select elements from the DOM.
Example
Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the In the news section into a list of Elements:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
There is also a HTML parser in the JDK: javax.swing.text.html.parser.Parser, which could be applied like this:
Reader in = new InputStreamReader(new URL(webpageURL).openConnection().getInputStream());
ParserDelegator parserDelegator = new ParserDelegator();
parserDelegator.parse(in, harvester, true);
Then, dependent on what kind you are looking for: start tags, end tags, attributes, etc. you define the appropriate callback function:
#Override
public void handleStartTag(HTML.Tag tag,
MutableAttributeSet mutableAttributeSet, int pos) {
// parses the HTML document until a <a> or <area> tag is found
if (tag == HTML.Tag.A || tag == HTML.Tag.AREA) {
// reading the href attribute of the tag
String address = (String) mutableAttributeSet
.getAttribute(Attribute.HREF);
/* ... */
You can use HTMLParser , this is a open source.
I have this HTML input:
<font size="5"><p>some text</p>
<p> another text</p></font>
I'd like to use regex to remove the HTML tags so that the output is:
some text
another text
Can anyone suggest how to do this with regex?
Since you asked, here's a quick and dirty solution:
String stripped = input.replaceAll("<[^>]*>", "");
(Ideone.com demo)
Using regexps to deal with HTML is a pretty bad idea though. The above hack won't deal with stuff like
<tag attribute=">">Hello</tag>
<script>if (a < b) alert('Hello>');</script>
etc.
A better approach would be to use for instance Jsoup. To remove all tags from a string, you can for instance do Jsoup.parse(html).text().
Use a HTML parser. Here's a Jsoup example.
String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
String stripped = Jsoup.parse(input).text();
System.out.println(stripped);
Result:
some text another text
Or if you want to preserve newlines:
String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
for (String line : input.split("\n")) {
String stripped = Jsoup.parse(line).text();
System.out.println(stripped);
}
Result:
some text
another text
Jsoup offers more advantages as well. You could easily extract specific parts of the HTML document using the select() method which accepts jQuery-like CSS selectors. It only requires the document to be semantically well-formed. The presence of the since 1998 deprecated <font> tag is already not a very good indication, but if you know the HTML structure in depth detail beforehand, it'll still be doable.
See also:
Pros and cons of leading HTML parsers in Java
You can go with HTML parser called Jericho Html parser.
you can download it from here - http://jericho.htmlparser.net/docs/index.html
Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognized or invalid HTML. It also provides high-level HTML form manipulation functions.
The presence of badly formatted HTML does not interfere with the parsing
Starting from aioobe's code, I tried something more daring:
String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
String stripped = input.replaceAll("</?(font|p){1}.*?/?>", "");
System.out.println(stripped);
The code to strip every HTML tag would look like this:
public class HtmlSanitizer {
private static String pattern;
private final static String [] tagsTab = {"!doctype","a","abbr","acronym","address","applet","area","article","aside","audio","b","base","basefont","bdi","bdo","bgsound","big","blink","blockquote","body","br","button","canvas","caption","center","cite","code","col","colgroup","content","data","datalist","dd","decorator","del","details","dfn","dir","div","dl","dt","element","em","embed","fieldset","figcaption","figure","font","footer","form","frame","frameset","h1","h2","h3","h4","h5","h6","head","header","hgroup","hr","html","i","iframe","img","input","ins","isindex","kbd","keygen","label","legend","li","link","listing","main","map","mark","marquee","menu","menuitem","meta","meter","nav","nobr","noframes","noscript","object","ol","optgroup","option","output","p","param","plaintext","pre","progress","q","rp","rt","ruby","s","samp","script","section","select","shadow","small","source","spacer","span","strike","strong","style","sub","summary","sup","table","tbody","td","template","textarea","tfoot","th","thead","time","title","tr","track","tt","u","ul","var","video","wbr","xmp"};
static {
StringBuffer tags = new StringBuffer();
for (int i=0;i<tagsTab.length;i++) {
tags.append(tagsTab[i].toLowerCase()).append('|').append(tagsTab[i].toUpperCase());
if (i<tagsTab.length-1) {
tags.append('|');
}
}
pattern = "</?("+tags.toString()+"){1}.*?/?>";
}
public static String sanitize(String input) {
return input.replaceAll(pattern, "");
}
public final static void main(String[] args) {
System.out.println(HtmlSanitizer.pattern);
System.out.println(HtmlSanitizer.sanitize("<font size=\"5\"><p>some text</p><br/> <p>another text</p></font>"));
}
}
I wrote this in order to be Java 1.4 compliant, for some sad reasons, so feel free to use for each and StringBuilder...
Advantages:
You can generate lists of tags you want to strip, which means you can keep those you want
You avoid stripping stuff that isn't an HTML tag
You keep the whitespaces
Drawbacks:
You have to list all HTML tags you want to strip from your string. Which can be a lot, for example if you want to strip everything.
If you see any other drawbacks, I would really be glad to know them.
If you use Jericho, then you just have to use something like this:
public String extractAllText(String htmlText){
Source source = new Source(htmlText);
return source.getTextExtractor().toString();
}
Of course you can do the same even with an Element:
for (Element link : links) {
System.out.println(link.getTextExtractor().toString());
}