I have some input containing HTML like <br> <b> <i> etc. I need a way to escape only the "bad" HTML that exposes my site to XSS etc.
After hours of Googeling I found the GWT which looks kinda promising.
What is the recommended way to escape bad HTML?
Edit:
Let me clear things up.
I am using a javascript text editor which outputs html. Wouldn't it be much easier if i use something like bbcode?
OWASP AntiSamy is a project for just that. If you need users to be able to submit structured text, look at markdown (imho a lot better than BBCode).
Google caja is a tool for making third party HTML, CSS and JavaScript safe to embed in your website.
Playframework 2 already offers a solution.
the #Html() function filters bad html, which is really nice.
I really love play2
You might want to just escape all html. If you want to have users be able to use basic html tags like <b> or <i> then you could just replace them with [b] and [i] (if your forum/whatever you're creating can use bbcode), then just replace all "<" and ">" with "<" and ">".
Related
I want to know is it possible to retrieve HTML tag and plain text such as
<p>This is text </p> or <div> or This is text
by using XmlPullParser ? I read here that it is not recommended. So is there any alternative way or a simple code that allow you to retrieve HTML and plain text like I wanted above ? I'm still a beginner in android. Thank you for your help.
I think your best option (which I have also used) is JSOUP.
JSOUP provides a very convenient API for extracting and manipulating data, using DOM, CSS, and jquery-like methods. JSOUP allows you to scrape and parse HTML from a URL, file, or string and many more.
jSoup: https://jsoup.org/
You have here a nice tutorial (not mine)
http://www.androidbegin.com/tutorial/android-basic-jsoup-tutorial/
JSOUP is a great parser and is one of the most commonly used ones.
Another thing that might be helpful for you is HTML organizer, a common thing that happens when writing parsers is errors due to Malformed HTML files. This happens more often then what you expect so a HTML organizer can reduce the amount of errors.
A good organizer I used is: Tidy
I need to get everything bewteen
onmouseout="this.style.backgroundColor='#fff'">
and the following <
in this case:
onmouseout="this.style.backgroundColor='#fff'">example<
I would like to get the word example.
Here is a more complicated example of where it should work as well:
onmouseout="this.style.backgroundColor='#fff'">going to drink?<br></span><span title="Juist!" onmouseover="this.style.backgroundColor='#ebeff9'" onmouseout="this.style.backgroundColor='#fff'">Exactly!</span></span></div></div>
So here i need 2 of them back (and not joined).
Could someone help? I suck at regex.
Someone edited my tag to javascript.
I need a solution to use in java, i just get a file as plain text. So javascript or html solutions are not really helpfull.
Regex with html? Well, If you have to parse only a few lines then ok. But in general is better to use a html parser (because HTML is not a regular language).
This is pure gold: https://stackoverflow.com/a/1732454/434171
Background: I'm writing a Java program to go through HTML files and replace all the content in tags that are not <script> or <style> with Lorem Ipsum. I originally did this with a regex just removing everything between a > and a <, which actually worked quite well (blasphemous I know), but I'm trying to turn this into a tool others may find useful so I wouldn't dare threaten the sanctity of the universe any more by trying to use regex on html.
I'm trying to use HtmlCleaner, a Java library that attracted me because it has no other dependencies. However, trying to implement it I've been unable to deal with html like this:
<div>
This text is in the div <span>but this is also in a span.</span>
</div>
The problem is simple. When the TagNodeVisitor reaches the div, if I replace its contents with the right amount of lipsum, it will eliminate the span tag. But if I drill down to only TagNodes with no other children, I would miss the first bit of text.
HtmlCleaner has a ContentNode object, but that object has no replace method. Anything I can think of to deal with this seems like it must be far too complicated. Is anyone familiar with a way to deal with this, with HtmlCleaner or some other parsing library you're more familiar with?
You can pretty much do anything you want with JSoup setters
Would that suit you ?
Element div = doc.select("div").first(); // <div></div>
div.html("<p>lorem ipsum</p>"); // <div><p>lorem ipsum</p></div>
HtmlCleaner's ContentNode has a method getContent() that returns a java.lang.StringBuilder. This is mutable and can be changed to whatever value you want.
look at the Google map
Is there any parser to get the link(www.google.com/map) from the <a> tag?
or the best way just to write a custom one~
jQuery, for instance:
var href = $('a.more-link').attr('href');
There is many 3:rd party solutions but I am not sure which exist for Java, maybe HTML agility pack exists in a version for Java.
But another solution would be to use regex
/<a\s+[^<]*?href\s*=\s*(?:(['"])(.+?)\1.*?|(.+?))>/
Fixed the regex to handle problems suggested in comments.
Looked up some real HTML parsers for Java if you find you need more than the regex aproach
http://htmlparser.sourceforge.net/
http://jericho.htmlparser.net/docs/index.html
http://jsoup.org/
My java program needs to rewrite urls in html (just in time). I am looking for the right tool and wonder if antlr is doing the job for me?
For example:
<html><body> <img src="foo.jpg" /> </body></html>
should be rewritten as:
<html><body> <img src="http://foo.com/foo.jpg" /> </body></html>
I want to read/write from/to a stream (byte by byte).
As khmarbaise said, first make sure, if regular expressions can do it. But there are cases, in which they can't [*], and then I think, ANTLR might really be a legitimate choice.
[*] For the mathematical background on this, see http://en.wikipedia.org/wiki/Formal_grammar#The_Chomsky_hierarchy
Update
Now that you updated your question, I see what you really want to do: For modifying a complete HTML file, I'd use a parser like NekoHTML, or something similar: http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/
Then you can use these to extract the URL. Then
parse only the URL itself - e. g. with Regexes, Java's URL class (or sometimes better: URI), or maybe ANTLR
modify the parsed URL
and write out the HTML again, using NekoHTML/...
Do not use regular expressions to parse the entire HTML file! You could use ANTLR for that in theory, but it would be very hard to make that work reliably.
What about Regular expressions ?