Android get text from html - java

I get a special html code:
&lt ;p &gt ;This is &lt ;a href=&quot ;http://www.test.hu&quot ;&gt ;a test link&lt ;/a&gt ; and this is &amp ;nbsp;a sample text with special char: &amp ;#233;va &lt ;/p&gt ;
(There isn't space before ; char, but if I don't insert space the stackoverflow format it)
It's not a normally html code, but if I paste in a empty html page, the browser show it with normal tags:
<i><_p_>This is <_a_ href="http://www.test.hu">a test link<_/a_> and this is a sample text with special char: éva <_/p_>
</i>
This code will be shown in a browser:
This is a test link And this is a sample text with special char: éva
So I want to get this text, but I can't use Html.fromHtml, because the component what I use doesn't support Spanned. I wanted to try StringEscapeUtils, but I couldn't import it.
How can I replace special chars and remove tags?

I guess I am too late to answer Robertoq's question, but I am sure many other guys are still struggeling with this issue, I was one of them.
Anyway, the easiest way I found is this:
In strings.xml, add your html code inside CDATA, and then in the activity retrieve the string and load it in WebView, here is the example:
in strings.xml:
<string name="st1"><![CDATA[<p>This is a test link and this is a sample text with special char: éva </p>]]>
</string>
you may wish to replace é with &eacute ; (note: there is no space between &eacute and the ; )
Now, in your activity, create WebView and load string st1 to it:
WebView mWebview = (WebView)findViewById(R.id.*WebViewControlID*);
mWebview.loadDataWithBaseURL(null, getString(R.string.st1), "text/html", "utf-8", null);
And horraaa, it should work correctly. If you find this post useful I will be greatful if you can mark it as answered, so we help other struggling with this issue

Write a parser, no different than you would in any other situation where you have to parse data.
Now, if you can get it as ordinary unescaped HTML, there are a variety of open source Java HTML parsers out there that you can use. If you are going to work with the escaped HTML as you have in your first example, you will have to write the parser yourself.

Related

Android app not rendering html tags very well

This is what i am getting from web service in string.
<p><strong>This is instructor&#39;s reply to the guest&#39;s MDB message</strong></p>
This is how i am setting it to text view
String Reply = parent.getString(TAG_ReplyMessage);
TextView ReplyTextView= (TextView) findViewById(R.id.reply_txt);
ReplyTextView.setText(Html.fromHtml(Reply));
but app shows information with <p> and <strong> tags.
Whereas it should render those tags, not to display it's html.
Any help would be highly appreciated.
It may help you
I was also getting same issue in my project and this worked for me.may be it will help you:
String reply = parent.getString(TAG_ReplyMessage);
TextView ReplyTextView = findViewById(R.id.reply_txt);
String htmltext = Html.fromHtml(reply).toString();
ReplyTextView.setText(Html.fromHtml(htmltext));
Android does not support all HTML tags.
This is a, albeit a bit old, list of supported HTML tags:
http://www.grokkingandroid.com/android-quick-tip-formatting-text-with-html-fromhtml/
Or you could look through the sources and see which are supported:
http://grepcode.com/file/repository.grepcode.com/java/ext/com.google.android/android/5.1.1_r1/android/text/Html.java#Html
According to those lists, <p> is not supported. So it can't parse anything within <p></p>.
&#39; remove amp; part. Html code for ' single quote is just '
You can write those tags directly between <![CDATA[...]]> in strings.xml.
Example:
<string name="type"><![CDATA[Type:<B> %s</B>]]></string>
In Java:
textView.setText(Html.fromHtml(getString(R.string.type, "Sports")));
Output:
Type: Sports
Update:
If string itself is like:
String data = "<p><strong>This is instructor&#39;s reply to the guest&#39;s MDB message</strong></p>";
Then, you can try as below:
textView.setText(Html.fromHtml(Html.fromHtml(data).toString()));

Escape special characters of html string in java

I have a html content as a string.
String attachment = "<div style=\"color:black;font-style:normal;font-size:10pt;font-family:verdana;\"><div><span style=\"background-color: rgb(255,255,255);\">This is special "'; </span></div></div>";
If I try to add this as a multipart form data I get an exception. The reason happens to be the special characters inside the html which is " and '. So I tried escaping the entire string using
org.apache.commons.lang.StringEscapeUtils.escapeJave(attachment);
After doing this the exception disappeared and it was working fine. But the double quotes used for the attributes, like style are also escaped using this method, which is not desired.
Instead of <div> style="color:black;
it was sent as <div> style=\"color:black;
So far I realized that I need to escape only the text inside the html content and not the entire text. i could extract the text content using jsoup or something else then form the html again.
But is there a generic easy solution to do this?

Jsoup Whitelist: Parsing non-english character

I am trying to clean HTML text and to extract plain text from it using Jsoup. The HTML might contain non-english character.
For example the HTML text is:
String html = "<p>Á <a href='http://example.com/'><b>example</b></a> link.</p>";
Now if I use Jsoup#parse(String html):
String text = Jsoup.parse(html).text();
It is printing:
Á example link.
And if I clean the text using Jsoup#clean(String bodyHtml, Whitelist whitelist):
String text = Jsoup.clean(html, Whitelist.none());
It is printing:
Á example link.
My question is, how can I get the text
Á example link.
using Whitelist and clean() method? I want to use Whitelist since I might be needed to use Whitelist#addTags(String... tags).
Any information will be very helpful to me.
Thanks.
Not possible in current version (1.6.1), jsoup print Á as Á because the entity escaping feature, there is no "don't escape" mode now (check Entities.EscapeMode).
You can 1. unescape these HTML entities, 2. extend jsoup's source code by adding a new escape mode with an empty map.

How to strip all html tags and extract content using java?

I have a requirement to escape all html tags from a string and extract only the content. I will have an HTML content as input. for example
<html><body><input type=’text’ value=’Hello World’ size=’50’ /> <div> This is a basic example </div><br/><span align=’center’>Hello Sam!!!</span></body><html>
I need the output as below :
Hello World. This is a basic example.
Hello Sam!!!
I have tried to use HtmlCleaner and even JSoup. First of all I am not getting any full sample application of them. I was able to extract
This is a basic example.
Hello Sam!!!
using HTMLCleaner but could not extract the textbox value as it’s an attribute. Please help.
Here's an example, using JSoup, that shows how to extract attribute values from elements.

How do I use ColdFusion to replace text in HTML without replacing HTML tags?

I have a html source as a String variable.
And a word as another variable that will be highlighted in that html source.
I need a Regular Expression which does not highlights tags, but obly text within the tags.
For example I have a html source like
<cfset html = "<span>Text goes here, forr example it container also **span** </span>" />
<cfset wordToReplace = "span" />
<cfset html = ReReplace(html ,"[^(<#wordToReplace#\b[^>]*>)]","replaced","ALL")>
and what I want to get is
<span>Text goes here, forr example it container also **replaced** </span>
But I have an error. Any tip!
I need a Regular Expression which does
not highlights tags, but obly text
within the tags.
You wont find one. Not one that is fully reliable against all legal/wild HTML.
The simple reason is that Regular Expressions match Regular languages, and HTML is not even remotely a Regular language.
Even if you're very careful, you run the risk of replacing stuff you didn't want to, and not replacing stuff you did want to, simply due to how complicated HTML syntax can be.
The correct way to parse HTML is using a purpose-built HTML DOM parser.
Annoyingly CF doesn't have one built in, though if your HTML is XHTML, then you can use XmlParse and XmlSearch to allow you to do an xpath search for only text (not tags) that match your text... something like //*[contains(text(), 'span')] should do (more details here).
If you've not got XHTML then you'll need to look at using a HTML DOM parser for Java - Google turns up plenty, (I've not tried any yet so can't give any specific recommendations).
what you have to do is use a lookahead to make sure that your text isn't contained within a tag. granted this could probably be written better, but it will get you the results you want. it will even handle when the tag has attributes.
<cfset html = "<span class='me'>Text goes here, forr example it container also **span** </span>" />
<cfset wordToReplace = "span" />
<cfset html = ReReplace(html ,"(?!/?<)(#wordToReplace#)(?![^.*>]*>)","replaced","ALL")>

Categories