Correct unescaped text in HTML - java

My question is not a duplicated question of the question presented above...
I have this text (from database) :
<p>I love Java & .NET ! <strong>5 > 3</strong></p>
As you see, the & and > is NOT escaped.
In Java, is there a way to turn this string into :
<p>I love Java & .NET ! <strong>5 > 3</strong></p>
As you noticed, I want to keep all the HTML tags in same way they are, but I want to escape the text, unvalid for XML (This text must be valid for Docx4J XHTMLImporter).
Thank you !

You can use escape characters to solve your problem.
For & sign you can use: &
And for > you can use: >
Full list of escape characters could be found here.

I used Jsoup and its parse function to clean my string :
String unscappedHtml = " ";
if (StringUtils.isNotBlank(unscappedText)) {
// We remove width and height from image tags.
Document doc = Jsoup.parse(unscappedText);
doc.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
doc.select("a").unwrap();
unscappedHtml = doc.body().html();
}

Related

Java: match all strings that do not end in .htm?"

I'm parsing an HTML file in Java using regular expressions and I want to know how to match for all href="" elements that do not end in a .htm or .html, and, if it matches, capture the content between quotes into a group
These are the ones I've tried so far:
href\s*[=]\s*"(.+?)(?![.]htm[l]?)"
href\s*[=]\s*"(.*?)(?![.]htm[l]?)"
href\s*[=]\s*"(?![.]htm[l]?)"
I understand that with the first two, the entire string between quotes is being captured into the first group, including the .htm(l) if it is present.
Does anyone know how I can avoid this from happening?
You can just rearrange the expression, and move the negative look-ahead to before the capturing:
href\s*[=]\s*"(?!.+?[.]htm[l]?")(.+?)"
Here is a demo.
As a side answer, jsoup is a very good API when dealing with html.
Using jsoup:
Document doc = Jsoup.parse(html);
for(Element link : doc.select("a")) {
String linkHref = link.attr("href");
if(linkHref.endsWith(".htm") || linkHref.endsWith(".html")) {
// do something
}
}
Try this .*\.(?!(htm|html)$)
any character in any number .* followed by a dot . not followed by htm, htmt (?! ... )

Jsoup having problems with special HTML symbols, ‘ — etc

I have some HTML (String) that I am putting through Jsoup just so I can add something to all href and src attributes, that works fine. However, I'm noticing that for some special HTML characters, Jsoup is converting them from say “ to the actual character “. I output the value before and after and I see that change.
Before:
THIS — IS A “TEST”. 5 > 4. trademark: ™
After:
THIS — IS A “TEST”. 5 > 4. trademark: ?
What the heck is going on? I was specifically converting those special characters to their HTML entities before any Jsoup stuff to avoid this. The quotes changed to the actual quote characters, the greater-than stayed the same, and the trademark changed into a question mark. Aaaaaaa.
FYI, my Jsoup code is doing:
Document document = Jsoup.parse(fileHtmlStr);
//some stuff
String modifiedFileHtmlStr = document.html();
Thanks for any help!
The code below will give similar to the input markup. It changes the escaping mode for specific characters and sets ASCII mode to escape the TM sign for systems which don't support Unicode.
The output:
<p>THIS — IS A “TEST&rdquor;&period; 5 > 4&period; trademark&colon; ™</p>
The code:
Document doc = Jsoup.parse("" +
"<p>THIS — IS A “TEST”. 5 > 4. trademark: ™</p>");
Document.OutputSettings settings = doc.outputSettings();
settings.prettyPrint(false);
settings.escapeMode(Entities.EscapeMode.extended);
settings.charset("ASCII");
String modifiedFileHtmlStr = doc.html();
System.out.println(modifiedFileHtmlStr);

Jsoup Whitelist: Parsing non-english character

I am trying to clean HTML text and to extract plain text from it using Jsoup. The HTML might contain non-english character.
For example the HTML text is:
String html = "<p>Á <a href='http://example.com/'><b>example</b></a> link.</p>";
Now if I use Jsoup#parse(String html):
String text = Jsoup.parse(html).text();
It is printing:
Á example link.
And if I clean the text using Jsoup#clean(String bodyHtml, Whitelist whitelist):
String text = Jsoup.clean(html, Whitelist.none());
It is printing:
Á example link.
My question is, how can I get the text
Á example link.
using Whitelist and clean() method? I want to use Whitelist since I might be needed to use Whitelist#addTags(String... tags).
Any information will be very helpful to me.
Thanks.
Not possible in current version (1.6.1), jsoup print Á as Á because the entity escaping feature, there is no "don't escape" mode now (check Entities.EscapeMode).
You can 1. unescape these HTML entities, 2. extend jsoup's source code by adding a new escape mode with an empty map.

Android get text from html

I get a special html code:
&lt ;p &gt ;This is &lt ;a href=&quot ;http://www.test.hu&quot ;&gt ;a test link&lt ;/a&gt ; and this is &amp ;nbsp;a sample text with special char: &amp ;#233;va &lt ;/p&gt ;
(There isn't space before ; char, but if I don't insert space the stackoverflow format it)
It's not a normally html code, but if I paste in a empty html page, the browser show it with normal tags:
<i><_p_>This is <_a_ href="http://www.test.hu">a test link<_/a_> and this is a sample text with special char: éva <_/p_>
</i>
This code will be shown in a browser:
This is a test link And this is a sample text with special char: éva
So I want to get this text, but I can't use Html.fromHtml, because the component what I use doesn't support Spanned. I wanted to try StringEscapeUtils, but I couldn't import it.
How can I replace special chars and remove tags?
I guess I am too late to answer Robertoq's question, but I am sure many other guys are still struggeling with this issue, I was one of them.
Anyway, the easiest way I found is this:
In strings.xml, add your html code inside CDATA, and then in the activity retrieve the string and load it in WebView, here is the example:
in strings.xml:
<string name="st1"><![CDATA[<p>This is a test link and this is a sample text with special char: éva </p>]]>
</string>
you may wish to replace é with &eacute ; (note: there is no space between &eacute and the ; )
Now, in your activity, create WebView and load string st1 to it:
WebView mWebview = (WebView)findViewById(R.id.*WebViewControlID*);
mWebview.loadDataWithBaseURL(null, getString(R.string.st1), "text/html", "utf-8", null);
And horraaa, it should work correctly. If you find this post useful I will be greatful if you can mark it as answered, so we help other struggling with this issue
Write a parser, no different than you would in any other situation where you have to parse data.
Now, if you can get it as ordinary unescaped HTML, there are a variety of open source Java HTML parsers out there that you can use. If you are going to work with the escaped HTML as you have in your first example, you will have to write the parser yourself.

How to use regular expressions to parse HTML in Java?

Please can someone tell me a simple way to find href and src tags in an html file using regular expressions in Java?
And then, how do I get the URL associated with the tag?
Thanks for any suggestion.
Using regular expressions to pull values from HTML is always a mistake. HTML syntax is a lot more complex that it may first appear and it's very easy for a page to catch out even a very complex regular expression.
Use an HTML Parser instead. See also What are the pros and cons of the leading Java HTML parsers?
The other answers are true. Java Regex API is not a proper tool to achieve your goal. Use efficient, secure and well tested high-level tools mentioned in the other answers.
If your question concerns rather Regex API than a real-life problem (learning purposes for example) - you can do it with the following code:
String html = "foo <a href='link1'>bar</a> baz <a href='link2'>qux</a> foo";
Pattern p = Pattern.compile("<a href='(.*?)'>");
Matcher m = p.matcher(html);
while(m.find()) {
System.out.println(m.group(0));
System.out.println(m.group(1));
}
And the output is:
<a href='link1'>
link1
<a href='link2'>
link2
Please note that lazy/reluctant qualifier *? must be used in order to reduce the grouping to the single tag. Group 0 is the entire match, group 1 is the next group match (next pair of parenthesis).
Dont use regular expressions use NekoHTML or TagSoup which are a bridge providing a SAX or DOM as in XML approach to visiting a HTML document.
If you want to go down the html parsing route, which Dave and I recommend here's the code to parse a String Data for anchor tags and print their href.
since your just using anchor tags you should be ok with just regex but if you want to do more go with a parser. The Mozilla HTML Parser is the best out there.
File parserLibraryFile = new File("lib/MozillaHtmlParser/native/bin/MozillaParser" + EnviromentController.getSharedLibraryExtension());
String parserLibrary = parserLibraryFile.getAbsolutePath();
// mozilla.dist.bin directory :
final File mozillaDistBinDirectory = new File("lib/MozillaHtmlParser/mozilla.dist.bin."+ EnviromentController.getOperatingSystemName());
MozillaParser.init(parserLibrary,mozillaDistBinDirectory.getAbsolutePath());
MozillaParser parser = new MozillaParser();
Document domDocument = parser.parse(data);
NodeList list = domDocument.getElementsByTagName("a");
for (int i = 0; i < list.getLength(); i++) {
Node n = list.item(i);
NamedNodeMap m = n.getAttributes();
if (m != null) {
Node attrNode = m.getNamedItem("href");
if (attrNode != null)
System.out.println(attrNode.getNodeValue());
I searched the Regular Expression Library (http://regexlib.com/Search.aspx?k=href and http://regexlib.com/Search.aspx?k=src)
The best I found was
((?<html>(href|src)\s*=\s*")|(?<css>url\())(?<url>.*?)(?(html)"|\))
Check out these links for more expressions:
http://regexlib.com/REDetails.aspx?regexp_id=2261
http://regexlib.com/REDetails.aspx?regexp_id=758
http://regexlib.com/REDetails.aspx?regexp_id=774
http://regexlib.com/REDetails.aspx?regexp_id=1437
Regular expressions can only parse regular languages, that's why they are called regular expressions. HTML is not a regular language, ergo it cannot be parsed by regular expressions.
HTML parsers, on the other hand, can parse HTML, that's why they are called HTML parsers.
You should use you favorite HTML parser instead.
Contrary to popular opinion, regular expressions are useful tools to extract data from unstructured text (which HTML is).
If you are doing complex HTML data extraction (say, find all paragraphs in a page) then HTML parsing is probably the way to go. But if you just need to get some URLs from HREFs, then a regular expression would work fine and it will be very hard to break it.
Try something like this:
/<a[^>]+href=["']?([^'"> ]+)["']?[^>]*>/i

Categories