Libs for HTML sanitizing - java

I'm looking for a html sanitizer which I can call per API to sanitise strings which I get from my webapp. Are there some useful easy to use libs available? Does anyone knows maybe one or two?
I don't need something big it just must be able to find unclosed tags and close them.

https://github.com/OWASP/java-html-sanitizer is now marked ready for production use.
A fast and easy to configure HTML Sanitizer written in Java which lets you include HTML authored by third-parties in your web application while protecting against XSS.
You can use prepackaged policies
Sanitizers.FORMATTING.and(Sanitizers.LINKS)
or the tests show how you can configure your own easily:
new HtmlPolicyBuilder()
.allowElements("a")
.allowUrlProtocols("https")
.allowAttributes("href").onElements("a")
.requireRelNofollowOnLinks()
or write custom policies to do things like changing h1s to divs with a certain class:
new HtmlPolicyBuilder()
.allowElements("h1", "p")
.allowElements(
new ElementPolicy() {
public String apply(String elementName, List<String> attrs) {
attrs.add("class");
attrs.add("header-" + elementName);
return "div";
}
}, "h1"))

JTidy may help you.

The HTML Parser JSoup also supports sanitisation by policy: http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer

Apart from JTidy you can also take a look at:
Nekohtml
TagSoup
Getting text in HTmL document

http://roberto.open-lab.com/2009/11/05/a-java-html-sanitizer-also-against-xss/

Related

GWT Safe HTML Framework: When to use, and why?

In reading JavaDocs and various GWT articles, I've occassionally run into the following Safe* classes:
SafeHtml
SafeHtmlBuilder
It looks like SafeHtml is somehow used when creating a new Widget or Composite, and helps ensure that the Widget/Composite doesn't execute any scripts on the client-side. Is this the case, or am I way off-base? Can someone provide a code example of SafeHtml being used properly in action?
If so, then what's the point of SafeHtmlBuilder? Do you use it inside of a Widget to somehow "build up" safe HTML?
The simplest way to view SafeHtml is as a String where any HTML markup has been appropriately escaped. This protects against Cross-Site Scripting (XSS) attacks as it ensures, for example, if someone enters their name in a form as <SCRIPT>alert('Fail')</SCRIPT> this is the text that gets displayed when your page is rendered rather than the JavaScript being run.
So instead of having something like:
String name = getValueOfName();
HTML widget = new HTML(name);
You should use:
String name = getValueOfName();
HTML widget = new HTML(SafeHtmlUtils.fromString(name));
SafeHtmlBuilder is like a StringBuilder except that it automatically escapes HTML markup in the Strings you add. So to extend the above example:
String name = getValueOfName();
SafeHtmlBuilder shb = new SafeHtmlBuilder();
shb.appendEscaped("Name: ").appendEscaped(name);
HTML widget = new HTML(shb.toSafeHtml());
The is a good guide to SafeHtml in the GWT documentation that is worth a read.
SafeHtmlBuilder is to SafeHtml what StringBuilder is to String.
As for the Safe* API, use it whenever you deal with HTML (or CSS for SafeStyles, or URLs for SafeUri and UriUtils), more precisely building HTML/CSS/URL from parts to be fed to the browser for parsing, with no exception.
Actually, we were recently discussing whether to deprecate Element.setInnerHtml and other similar APIs (HasHTML) in favor of Element.setInnerSafeHtml and the like (HasSafeHtml).

High performace HTML parsing library [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm just checking for
div class = "classname"
in each line of HTML - This works, but I can't help but feel there is a better solution out there.
Is there any nice way where I could give a class a line of HTML and have some nice methods like:
boolean usesClass(String CSSClassname);
String getText();
String getLink();
Another library that might be useful for HTML processing is jsoup.
Jsoup tries to clean malformed HTML and allows html parsing in Java using jQuery like tag selector syntax.
http://jsoup.org/
The main problem as stated by preceding coments is malformed HTML, so an html cleaner or HTML-XML converter is a must. Once you get the XML code (XHTML) there are plenty of tools to handle it. You could get it with a simple SAX handler that extracts only the data you need or any tree-based method (DOM, JDOM, etc.) that let you even modify original code.
Here is a sample code that uses HTML cleaner to get all DIVs that use a certain class and print out all Text content inside it.
import java.io.IOException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
/**
* #author Fernando Miguélez Palomo <fernandoDOTmiguelezATgmailDOTcom>
*/
public class TestHtmlParse
{
static final String className = "tags";
static final String url = "http://www.stackoverflow.com";
TagNode rootNode;
public TestHtmlParse(URL htmlPage) throws IOException
{
HtmlCleaner cleaner = new HtmlCleaner();
rootNode = cleaner.clean(htmlPage);
}
List getDivsByClass(String CSSClassname)
{
List divList = new ArrayList();
TagNode divElements[] = rootNode.getElementsByName("div", true);
for (int i = 0; divElements != null && i < divElements.length; i++)
{
String classType = divElements[i].getAttributeByName("class");
if (classType != null && classType.equals(CSSClassname))
{
divList.add(divElements[i]);
}
}
return divList;
}
public static void main(String[] args)
{
try
{
TestHtmlParse thp = new TestHtmlParse(new URL(url));
List divs = thp.getDivsByClass(className);
System.out.println("*** Text of DIVs with class '"+className+"' at '"+url+"' ***");
for (Iterator iterator = divs.iterator(); iterator.hasNext();)
{
TagNode divElement = (TagNode) iterator.next();
System.out.println("Text child nodes of DIV: " + divElement.getText().toString());
}
}
catch(Exception e)
{
e.printStackTrace();
}
}
}
Several years ago I used JTidy for the same purpose:
http://jtidy.sourceforge.net/
"JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.
JTidy was written by Andy Quick, who later stepped down from the maintainer position. Now JTidy is maintained by a group of volunteers.
More information on JTidy can be found on the JTidy SourceForge project page ."
You might be interested by TagSoup, a Java HTML parser able to handle malformed HTML. XML parsers would work only on well formed XHTML.
The HTMLParser project (http://htmlparser.sourceforge.net/) might be a possibility. It seems to be pretty decent at handling malformed HTML. The following snippet should do what you need:
Parser parser = new Parser(htmlInput);
CssSelectorNodeFilter cssFilter =
new CssSelectorNodeFilter("DIV.targetClassName");
NodeList nodes = parser.parse(cssFilter);
Jericho: http://jericho.htmlparser.net/docs/index.html
Easy to use, supports not well formed HTML, a lot of examples.
HTMLUnit might be of help. It does a lot more stuff too.
http://htmlunit.sourceforge.net/1
Let's not forget Jerry, its jQuery in java: a fast and concise Java Library that simplifies HTML document parsing, traversing and manipulating; includes usage of css3 selectors.
Example:
Jerry doc = jerry(html);
doc.$("div#jodd p.neat").css("color", "red").addClass("ohmy");
Example:
doc.form("#myform", new JerryFormHandler() {
public void onForm(Jerry form, Map<String, String[]> parameters) {
// process form and parameters
}
});
Of course, these are just some quick examples to get the feeling how it all looks like.
The nu.validator project is an excellent, high performance HTML parser that doesn't cut corners correctness-wise.
The Validator.nu HTML Parser is an implementation of the HTML5 parsing algorithm in Java. The parser is designed to work as a drop-in replacement for the XML parser in applications that already support XHTML 1.x content with an XML parser and use SAX, DOM or XOM to interface with the parser. Low-level functionality is provided for applications that wish to perform their own IO and support document.write() with scripting. The parser core compiles on Google Web Toolkit and can be automatically translated into C++. (The C++ translation capability is currently used for porting the parser for use in Gecko.)
You can also use XWiki HTML Cleaner:
It uses HTMLCleaner and extends it to generate valid XHTML 1.1 content.
If your HTML is well-formed, you can easily employ an XML parser to do the job for you... If you're only reading, SAX would be ideal.

Are there any java libraries for validating user supplied HTML, on the server side?

I have a service which takes the user supplied rich text (can have HTML tags) and saves it into the database. That data gets used by some other application. But sometimes the user supplied data has missing HTML tags and wrong closing tags. I want to validate if the user supplied data is valid HTML or not and depending on that I want to warn the user.
Are there any java libraries to do HTML validation?
You can try JTidy, but it's too slow for simple HTML cleaning.
If you want just process HTML you can try NekoHTML, it's lightweight and fast
You can try JTidy.
JTidy is a Java port of HTML Tidy, a
HTML syntax checker and pretty
printer.
You can use Jsoup, from the project README
Here is an example:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
...
String markup = "<body><head>...";
Jsoup.isValid(markup, null);
Instead of null, you can pass a Whitelist ? object as second parameter to the isValid method.
Plus, you can easily install this library using Gradle
Validator.nu, which implements the HTML5 spec, IMO.
There's a great thing called NekoHTML which is just a thin wrapper over the Apache Xerces parser that turns on error-recovery/correction. It doesn't validate so much as error-correct, so you can process the result as XML, i.e. run it through XPaths or XSLTs. It has worked flawlessly for me for several months on completely arbitrary HTML from 3rd-party sites.

How do I (should I?) use Apache POI HWPFDocument?

I'm thinking about including the Apache POI into my application. Main goal is to output RTF document, but DOC would be nice, too. But the documentation is not very detailed about writing a HWPFDocument and everything I found on the web isn't helpful at all.
I can read DOC files, that's working without any problem. But I really can't see how I write a document. Maybe someone can give me a short code example?
Thanks a lot!
If you want to do RTF, These are text files and they are support in all versions of Word.
you can use itext for simple stuff
http://itextdocs.lowagie.com/tutorial/rtf/index.php
ro
you can export them the hard way
//-- save as example.doc -------------
{
\rtf1
\ansi
\ansicpg1252
\deff0
\deflang1033
{\fonttbl
{\f0
\fswiss
\fcharset0 Arial;
}
}
{
\*
\generator Msftedit 5.41.21.2500;
}
\viewkind4
\uc1
\pard
\f0
\fs20
Hello World
\par
}
Well,
It has been a long time since the last time I used POI. I read that the HWPFDocument is now orphaned (read on apache POI website). I would recommend using the WordML specification released by Microsoft instead.
http://en.wikipedia.org/wiki/Microsoft_Office_XML_formats
I have used this method before. The easiest way is to create a WordML template and just replace the values using XPATH

Java HTML Parsing [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm just checking for
div class = "classname"
in each line of HTML - This works, but I can't help but feel there is a better solution out there.
Is there any nice way where I could give a class a line of HTML and have some nice methods like:
boolean usesClass(String CSSClassname);
String getText();
String getLink();
Another library that might be useful for HTML processing is jsoup.
Jsoup tries to clean malformed HTML and allows html parsing in Java using jQuery like tag selector syntax.
http://jsoup.org/
The main problem as stated by preceding coments is malformed HTML, so an html cleaner or HTML-XML converter is a must. Once you get the XML code (XHTML) there are plenty of tools to handle it. You could get it with a simple SAX handler that extracts only the data you need or any tree-based method (DOM, JDOM, etc.) that let you even modify original code.
Here is a sample code that uses HTML cleaner to get all DIVs that use a certain class and print out all Text content inside it.
import java.io.IOException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
/**
* #author Fernando Miguélez Palomo <fernandoDOTmiguelezATgmailDOTcom>
*/
public class TestHtmlParse
{
static final String className = "tags";
static final String url = "http://www.stackoverflow.com";
TagNode rootNode;
public TestHtmlParse(URL htmlPage) throws IOException
{
HtmlCleaner cleaner = new HtmlCleaner();
rootNode = cleaner.clean(htmlPage);
}
List getDivsByClass(String CSSClassname)
{
List divList = new ArrayList();
TagNode divElements[] = rootNode.getElementsByName("div", true);
for (int i = 0; divElements != null && i < divElements.length; i++)
{
String classType = divElements[i].getAttributeByName("class");
if (classType != null && classType.equals(CSSClassname))
{
divList.add(divElements[i]);
}
}
return divList;
}
public static void main(String[] args)
{
try
{
TestHtmlParse thp = new TestHtmlParse(new URL(url));
List divs = thp.getDivsByClass(className);
System.out.println("*** Text of DIVs with class '"+className+"' at '"+url+"' ***");
for (Iterator iterator = divs.iterator(); iterator.hasNext();)
{
TagNode divElement = (TagNode) iterator.next();
System.out.println("Text child nodes of DIV: " + divElement.getText().toString());
}
}
catch(Exception e)
{
e.printStackTrace();
}
}
}
Several years ago I used JTidy for the same purpose:
http://jtidy.sourceforge.net/
"JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.
JTidy was written by Andy Quick, who later stepped down from the maintainer position. Now JTidy is maintained by a group of volunteers.
More information on JTidy can be found on the JTidy SourceForge project page ."
You might be interested by TagSoup, a Java HTML parser able to handle malformed HTML. XML parsers would work only on well formed XHTML.
The HTMLParser project (http://htmlparser.sourceforge.net/) might be a possibility. It seems to be pretty decent at handling malformed HTML. The following snippet should do what you need:
Parser parser = new Parser(htmlInput);
CssSelectorNodeFilter cssFilter =
new CssSelectorNodeFilter("DIV.targetClassName");
NodeList nodes = parser.parse(cssFilter);
Jericho: http://jericho.htmlparser.net/docs/index.html
Easy to use, supports not well formed HTML, a lot of examples.
HTMLUnit might be of help. It does a lot more stuff too.
http://htmlunit.sourceforge.net/1
Let's not forget Jerry, its jQuery in java: a fast and concise Java Library that simplifies HTML document parsing, traversing and manipulating; includes usage of css3 selectors.
Example:
Jerry doc = jerry(html);
doc.$("div#jodd p.neat").css("color", "red").addClass("ohmy");
Example:
doc.form("#myform", new JerryFormHandler() {
public void onForm(Jerry form, Map<String, String[]> parameters) {
// process form and parameters
}
});
Of course, these are just some quick examples to get the feeling how it all looks like.
The nu.validator project is an excellent, high performance HTML parser that doesn't cut corners correctness-wise.
The Validator.nu HTML Parser is an implementation of the HTML5 parsing algorithm in Java. The parser is designed to work as a drop-in replacement for the XML parser in applications that already support XHTML 1.x content with an XML parser and use SAX, DOM or XOM to interface with the parser. Low-level functionality is provided for applications that wish to perform their own IO and support document.write() with scripting. The parser core compiles on Google Web Toolkit and can be automatically translated into C++. (The C++ translation capability is currently used for porting the parser for use in Gecko.)
You can also use XWiki HTML Cleaner:
It uses HTMLCleaner and extends it to generate valid XHTML 1.1 content.
If your HTML is well-formed, you can easily employ an XML parser to do the job for you... If you're only reading, SAX would be ideal.

Categories