Jsoup Whitelist: Parsing non-english character - java

I am trying to clean HTML text and to extract plain text from it using Jsoup. The HTML might contain non-english character.
For example the HTML text is:
String html = "<p>Á <a href='http://example.com/'><b>example</b></a> link.</p>";
Now if I use Jsoup#parse(String html):
String text = Jsoup.parse(html).text();
It is printing:
Á example link.
And if I clean the text using Jsoup#clean(String bodyHtml, Whitelist whitelist):
String text = Jsoup.clean(html, Whitelist.none());
It is printing:
Á example link.
My question is, how can I get the text
Á example link.
using Whitelist and clean() method? I want to use Whitelist since I might be needed to use Whitelist#addTags(String... tags).
Any information will be very helpful to me.
Thanks.

Not possible in current version (1.6.1), jsoup print Á as Á because the entity escaping feature, there is no "don't escape" mode now (check Entities.EscapeMode).
You can 1. unescape these HTML entities, 2. extend jsoup's source code by adding a new escape mode with an empty map.

Related

Avoid removal of spaces and newline while parsing html using jsoup

I have a sample code as below.
String sample = "<html>
<head>
</head>
<body>
This is a sample on parsing html body using jsoup
This is a sample on parsing html body using jsoup
</body>
</html>";
Document doc = Jsoup.parse(sample);
String output = doc.body().text();
I get the output as
This is a sample on parsing html body using jsoup This is a sample on `parsing html body using jsoup`
But I want the output as
This is a sample on parsing html body using jsoup
This is a sample on parsing html body using jsoup
How do parse it so that I get this output? Or is there another way to do so in Java?
You can disable the pretty printing of your document to get the output like you want it. But you also have to change the .text() to .html().
Document doc = Jsoup.parse(sample);
doc.outputSettings(new Document.OutputSettings().prettyPrint(false));
String output = doc.body().html();
The HTML specification requires that multiple whitespace characters are collapsed into a single whitespace. Therefore, when parsing the sample, the parser correctly eliminates the superfluous whitespace characters.
I don't think you can change how the parser works. You could add a preprocessing step where you replace multiple whitespaces with non-breakable spaces ( ), which will not collapse. The side effect, though, would of course be that those would be, well, non-breakable (which doesn't matter if you really just want to use the rendered text, as in doc.body().text()).

Escape special characters of html string in java

I have a html content as a string.
String attachment = "<div style=\"color:black;font-style:normal;font-size:10pt;font-family:verdana;\"><div><span style=\"background-color: rgb(255,255,255);\">This is special "'; </span></div></div>";
If I try to add this as a multipart form data I get an exception. The reason happens to be the special characters inside the html which is " and '. So I tried escaping the entire string using
org.apache.commons.lang.StringEscapeUtils.escapeJave(attachment);
After doing this the exception disappeared and it was working fine. But the double quotes used for the attributes, like style are also escaped using this method, which is not desired.
Instead of <div> style="color:black;
it was sent as <div> style=\"color:black;
So far I realized that I need to escape only the text inside the html content and not the entire text. i could extract the text content using jsoup or something else then form the html again.
But is there a generic easy solution to do this?

Jsoup removes quotes and apostrophes

I have something like
Whitelist whitelist = new Whitelist();
whitelist.addTags("p", "i", "b", "em", "strong", "u");
String content = Jsoup.clean(data.html(), whitelist);
in my code. But the Jsoup library removes " and '. How do I prevent that.
e.g. = <p>It's a sunny day.</p>
result = It? s a sunny day.
You are using data.html() . here is what the API of Element class tells about it: Element API
Retrieves the element's inner HTML. E.g. on a <div> with one empty <p>, would return <p></p>. (Whereas Node.outerHtml() would return <div><p></p></div>.)
so you should be using the method outerHtml() instead:
String content = Jsoup.clean(data.outerHtml(), whitelist);
here is also another link for useful examples. the example contains both methods and you can see the difference: Jsoup Attribute text and HTML example
As for the other issue (quote being turned into question mark), I think its a matter of encoding and charachter set as it is not happening on my pc. check the encoding of the source html file and try to initially parse it in Jsoup with the matching charachter set.

Extract text between html tags parsed from xml

Can anyone help me in extracting text from within the html tags to plain text?
I have parsed an xml and get some output as body which has html tags now i want to remove the tags and use the text.
thanks in advance!!!!
You can use HTML Parser like JSoup
For example
HTML is
<div style="height:240px;"><br>test: example<br>test1:example1</div>
You can get the html using
Document document = Jsoup.parse(html);
Element div = document.select("div[style=height:240px;]").first();
div.html();
Try a HTML Parser.
If the HTML is escaped, i.e. < instead of < you might have to decode first.
Considering your requirements you might try Jericho HTML Parser
Take a look at TextExtractor class:
Using the default settings, the source segment:
"<div><b>O</b>ne</div><div title="Two"><b>Th</b><script>//a script </script>ree</div>"
produces the text "One Two Three".
If all you want to do is remove HTML tags from a string, you can do this:
String output = input.replaceAll("(?s)\\<.*?\\>", " ");

Android get text from html

I get a special html code:
&lt ;p &gt ;This is &lt ;a href=&quot ;http://www.test.hu&quot ;&gt ;a test link&lt ;/a&gt ; and this is &amp ;nbsp;a sample text with special char: &amp ;#233;va &lt ;/p&gt ;
(There isn't space before ; char, but if I don't insert space the stackoverflow format it)
It's not a normally html code, but if I paste in a empty html page, the browser show it with normal tags:
<i><_p_>This is <_a_ href="http://www.test.hu">a test link<_/a_> and this is a sample text with special char: éva <_/p_>
</i>
This code will be shown in a browser:
This is a test link And this is a sample text with special char: éva
So I want to get this text, but I can't use Html.fromHtml, because the component what I use doesn't support Spanned. I wanted to try StringEscapeUtils, but I couldn't import it.
How can I replace special chars and remove tags?
I guess I am too late to answer Robertoq's question, but I am sure many other guys are still struggeling with this issue, I was one of them.
Anyway, the easiest way I found is this:
In strings.xml, add your html code inside CDATA, and then in the activity retrieve the string and load it in WebView, here is the example:
in strings.xml:
<string name="st1"><![CDATA[<p>This is a test link and this is a sample text with special char: éva </p>]]>
</string>
you may wish to replace é with &eacute ; (note: there is no space between &eacute and the ; )
Now, in your activity, create WebView and load string st1 to it:
WebView mWebview = (WebView)findViewById(R.id.*WebViewControlID*);
mWebview.loadDataWithBaseURL(null, getString(R.string.st1), "text/html", "utf-8", null);
And horraaa, it should work correctly. If you find this post useful I will be greatful if you can mark it as answered, so we help other struggling with this issue
Write a parser, no different than you would in any other situation where you have to parse data.
Now, if you can get it as ordinary unescaped HTML, there are a variety of open source Java HTML parsers out there that you can use. If you are going to work with the escaped HTML as you have in your first example, you will have to write the parser yourself.

Categories