get part with regex - java

I need to get everything bewteen
onmouseout="this.style.backgroundColor='#fff'">
and the following <
in this case:
onmouseout="this.style.backgroundColor='#fff'">example<
I would like to get the word example.
Here is a more complicated example of where it should work as well:
onmouseout="this.style.backgroundColor='#fff'">going to drink?<br></span><span title="Juist!" onmouseover="this.style.backgroundColor='#ebeff9'" onmouseout="this.style.backgroundColor='#fff'">Exactly!</span></span></div></div>
So here i need 2 of them back (and not joined).
Could someone help? I suck at regex.
Someone edited my tag to javascript.
I need a solution to use in java, i just get a file as plain text. So javascript or html solutions are not really helpfull.

Regex with html? Well, If you have to parse only a few lines then ok. But in general is better to use a html parser (because HTML is not a regular language).
This is pure gold: https://stackoverflow.com/a/1732454/434171

Related

Safe html in java

I have some input containing HTML like <br> <b> <i> etc. I need a way to escape only the "bad" HTML that exposes my site to XSS etc.
After hours of Googeling I found the GWT which looks kinda promising.
What is the recommended way to escape bad HTML?
Edit:
Let me clear things up.
I am using a javascript text editor which outputs html. Wouldn't it be much easier if i use something like bbcode?
OWASP AntiSamy is a project for just that. If you need users to be able to submit structured text, look at markdown (imho a lot better than BBCode).
Google caja is a tool for making third party HTML, CSS and JavaScript safe to embed in your website.
Playframework 2 already offers a solution.
the #Html() function filters bad html, which is really nice.
I really love play2
You might want to just escape all html. If you want to have users be able to use basic html tags like <b> or <i> then you could just replace them with [b] and [i] (if your forum/whatever you're creating can use bbcode), then just replace all "<" and ">" with "<" and ">".

Pretty print ("indentation-only") HTML documents in Java (no JTidy)

We're generating HTML files out of apaches velocity generic template engine. The generated HTML is kind of ugly and not with correcht indentation.
In my case I've got the HTML stored in a String which I want to manipulate in this way, that it looks pretty printed.
I've already gave JTidy a try, but it changes the HTML source code when I pipe the raw HTML trough it. Sometimes it adds or removes HTML tags.
My question:
Is there a java library or something else out there which (only!) pretty prints my HTML code without adding, removing tags from my HTML document? It shall only do the indentation, so that it looks pretty printed! Nothing more, nothing less. Any ideas? :-)
Also code suggestions, hints or tips are welcome.
Best regards
Maybe a little to late, but I found a solution to this with Jsoup.
you can get the "pretty" version of the html by using only the parser, and (in case of needed) avoid the generation of the html elements by using a "custom parser"
I got the answer from this Jsoup question
And its
public static String formatHTML(String html) throws Exception{
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
return doc.toString();
}
I hope this helps.
Regards
Find any SAX parser example in java. indent++ for opening tags, intent-- for closing, and write content with counted intentation.
Why don't you write a simple Java parser to pretty print HTML yourself. Here is a sketch:
Track open and close tags for example and
have a counter to figure out the current indentation level.
Perhaps use a stack to push, pop the indentation level
Just iterate thru the HTML string and push the current indentation level on stack when you see a tag
If you see a nested tag then increment indentation level and keep going
When you see an end of tag e.g . etc then pop the stack to go back to prev indent level
I wanted to give you a rough idea here, you can use this as a starting point. I have written many perl based pretty printers. You could use Perl to script a parse fairly quickly..

HTML type string parsing question!

look at the Google map
Is there any parser to get the link(www.google.com/map) from the <a> tag?
or the best way just to write a custom one~
jQuery, for instance:
var href = $('a.more-link').attr('href');
There is many 3:rd party solutions but I am not sure which exist for Java, maybe HTML agility pack exists in a version for Java.
But another solution would be to use regex
/<a\s+[^<]*?href\s*=\s*(?:(['"])(.+?)\1.*?|(.+?))>/
Fixed the regex to handle problems suggested in comments.
Looked up some real HTML parsers for Java if you find you need more than the regex aproach
http://htmlparser.sourceforge.net/
http://jericho.htmlparser.net/docs/index.html
http://jsoup.org/

url rewriting with antlr

My java program needs to rewrite urls in html (just in time). I am looking for the right tool and wonder if antlr is doing the job for me?
For example:
<html><body> <img src="foo.jpg" /> </body></html>
should be rewritten as:
<html><body> <img src="http://foo.com/foo.jpg" /> </body></html>
I want to read/write from/to a stream (byte by byte).
As khmarbaise said, first make sure, if regular expressions can do it. But there are cases, in which they can't [*], and then I think, ANTLR might really be a legitimate choice.
[*] For the mathematical background on this, see http://en.wikipedia.org/wiki/Formal_grammar#The_Chomsky_hierarchy
Update
Now that you updated your question, I see what you really want to do: For modifying a complete HTML file, I'd use a parser like NekoHTML, or something similar: http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/
Then you can use these to extract the URL. Then
parse only the URL itself - e. g. with Regexes, Java's URL class (or sometimes better: URI), or maybe ANTLR
modify the parsed URL
and write out the HTML again, using NekoHTML/...
Do not use regular expressions to parse the entire HTML file! You could use ANTLR for that in theory, but it would be very hard to make that work reliably.
What about Regular expressions ?

Java (Android) regular expression to strip out HTML paragraph

I have an Android application which grabs some data from an external XML source. I've stripped out some HTML from one of the XML elements, but it's in the format:
<p class="x">Some text...</p>
<p>Some more text</p>
<p>Some final text</p>
I want to extract the middle paragraph text, how can I do this? Would a regular expression be the best way? I don't really want to start including external HTML parsing libraries.
RegEx match open tags except XHTML self-contained tags
So, I'll ask the question that wraps up the linked-to answer: have you tried using an XML parser instead?
You might get some ideas from some of the other answers there, too, but I'd try to avoid the regex path. As Macarse suggested, clean this up on the server if you can. If not, wrap those three <p> elements in a single root element and parse it using SAX or something, paying attention to the 2nd paragraph element.
If it's simple, just do a regex.
If you are getting XML from an external source that you own, I would parse it there.
just doing a split: http://developer.android.com/reference/java/lang/String.html#split(java.lang.String)
on "</p><p>" and taking the second entry in the returned array would actually do it pretty quickly
The regex would probably look something like: .*?>(.*?)<.*
And you access the grouped content by calling group(1) on the Matcher object.
If you are going to parse an XML file downloaded from website, then there is nothing to do with Android.

Categories