I'm trying to parse a HTML file using Jsoup. In this HTML there is a special character that I want to remove, (€), this is how it's originally:
<span class="price-value">
49,99 €
</span>
However, Netbeans shows this when printing that element:
49.99 ?
Therefore, I cannot do this:
price.replace( "€", "" ).replace( ",", "." ).trim();
Neither this:
price.replace( "\\?", "" ).replace( ",", "." ).trim();
What can I do about it?
Netbeans shows this when printing that element
Almost certainly this is because your NetBeans console hasn't been configured to support Unicode chars, which is why you've been misled. For a solution to that, see: How to change default encoding in NetBeans 8.0
So, the document is fine, the regular expressions would have worked, and there's no need to change anything else.
Here's a minimal working example of the original document getting parsed correctly, the Euro symbol replaced, and 49.99 returned.
Element doc = Jsoup.parse("<html><body><span class=\"price-value\">49,99 €</span></body></html>");
Element span = doc.select("span").get(0);
System.out.println( span.text().replace("€", "").replace(",", ".").trim() );
Modified from here:
To match individual characters, you can simply include them in an a character class, either as literals or via the \u20AC syntax
The unicode for the Euro is \u20AC.
Note: I'm not sure why it would be displayed as a ?, but that might be just because it's not ASCII, and might be missing in the font.
Use this ->
<span class="price-value">
49,99 €
</span>
It is the representation of € sign in HTML
Related
I have some HTML (String) that I am putting through Jsoup just so I can add something to all href and src attributes, that works fine. However, I'm noticing that for some special HTML characters, Jsoup is converting them from say “ to the actual character “. I output the value before and after and I see that change.
Before:
THIS — IS A “TEST”. 5 > 4. trademark:
After:
THIS — IS A “TEST”. 5 > 4. trademark: ?
What the heck is going on? I was specifically converting those special characters to their HTML entities before any Jsoup stuff to avoid this. The quotes changed to the actual quote characters, the greater-than stayed the same, and the trademark changed into a question mark. Aaaaaaa.
FYI, my Jsoup code is doing:
Document document = Jsoup.parse(fileHtmlStr);
//some stuff
String modifiedFileHtmlStr = document.html();
Thanks for any help!
The code below will give similar to the input markup. It changes the escaping mode for specific characters and sets ASCII mode to escape the TM sign for systems which don't support Unicode.
The output:
<p>THIS — IS A “TEST”. 5 > 4. trademark: </p>
The code:
Document doc = Jsoup.parse("" +
"<p>THIS — IS A “TEST”. 5 > 4. trademark: </p>");
Document.OutputSettings settings = doc.outputSettings();
settings.prettyPrint(false);
settings.escapeMode(Entities.EscapeMode.extended);
settings.charset("ASCII");
String modifiedFileHtmlStr = doc.html();
System.out.println(modifiedFileHtmlStr);
I am trying to clean HTML text and to extract plain text from it using Jsoup. The HTML might contain non-english character.
For example the HTML text is:
String html = "<p>Á <a href='http://example.com/'><b>example</b></a> link.</p>";
Now if I use Jsoup#parse(String html):
String text = Jsoup.parse(html).text();
It is printing:
Á example link.
And if I clean the text using Jsoup#clean(String bodyHtml, Whitelist whitelist):
String text = Jsoup.clean(html, Whitelist.none());
It is printing:
Á example link.
My question is, how can I get the text
Á example link.
using Whitelist and clean() method? I want to use Whitelist since I might be needed to use Whitelist#addTags(String... tags).
Any information will be very helpful to me.
Thanks.
Not possible in current version (1.6.1), jsoup print Á as Á because the entity escaping feature, there is no "don't escape" mode now (check Entities.EscapeMode).
You can 1. unescape these HTML entities, 2. extend jsoup's source code by adding a new escape mode with an empty map.
I have something like
Whitelist whitelist = new Whitelist();
whitelist.addTags("p", "i", "b", "em", "strong", "u");
String content = Jsoup.clean(data.html(), whitelist);
in my code. But the Jsoup library removes " and '. How do I prevent that.
e.g. = <p>It's a sunny day.</p>
result = It? s a sunny day.
You are using data.html() . here is what the API of Element class tells about it: Element API
Retrieves the element's inner HTML. E.g. on a <div> with one empty <p>, would return <p></p>. (Whereas Node.outerHtml() would return <div><p></p></div>.)
so you should be using the method outerHtml() instead:
String content = Jsoup.clean(data.outerHtml(), whitelist);
here is also another link for useful examples. the example contains both methods and you can see the difference: Jsoup Attribute text and HTML example
As for the other issue (quote being turned into question mark), I think its a matter of encoding and charachter set as it is not happening on my pc. check the encoding of the source html file and try to initially parse it in Jsoup with the matching charachter set.
I get a special html code:
< ;p > ;This is < ;a href=" ;http://www.test.hu" ;> ;a test link< ;/a> ; and this is & ;nbsp;a sample text with special char: & ;#233;va < ;/p> ;
(There isn't space before ; char, but if I don't insert space the stackoverflow format it)
It's not a normally html code, but if I paste in a empty html page, the browser show it with normal tags:
<i><_p_>This is <_a_ href="http://www.test.hu">a test link<_/a_> and this is a sample text with special char: éva <_/p_>
</i>
This code will be shown in a browser:
This is a test link And this is a sample text with special char: éva
So I want to get this text, but I can't use Html.fromHtml, because the component what I use doesn't support Spanned. I wanted to try StringEscapeUtils, but I couldn't import it.
How can I replace special chars and remove tags?
I guess I am too late to answer Robertoq's question, but I am sure many other guys are still struggeling with this issue, I was one of them.
Anyway, the easiest way I found is this:
In strings.xml, add your html code inside CDATA, and then in the activity retrieve the string and load it in WebView, here is the example:
in strings.xml:
<string name="st1"><![CDATA[<p>This is a test link and this is a sample text with special char: éva </p>]]>
</string>
you may wish to replace é with é ; (note: there is no space between é and the ; )
Now, in your activity, create WebView and load string st1 to it:
WebView mWebview = (WebView)findViewById(R.id.*WebViewControlID*);
mWebview.loadDataWithBaseURL(null, getString(R.string.st1), "text/html", "utf-8", null);
And horraaa, it should work correctly. If you find this post useful I will be greatful if you can mark it as answered, so we help other struggling with this issue
Write a parser, no different than you would in any other situation where you have to parse data.
Now, if you can get it as ordinary unescaped HTML, there are a variety of open source Java HTML parsers out there that you can use. If you are going to work with the escaped HTML as you have in your first example, you will have to write the parser yourself.
My java webapp fetches content from a textarea, and e-mails the same.
The problem I'm facing is that the newline character in the textarea message is not preserved when reading the same using
request.getParameter("message");
Any clues how it can be tackled?
TIA.
EDIT:
The content in the textarea is:
abcd
abcd
CODE:
String message = request.getParameter("message");
System.out.println("index loc for message "+message+" using \\r\\n : "+message.indexOf("\r\n"));
System.out.println("index loc for message "+message+" using \\n : "+message.indexOf("\n"));
System.out.println("index loc for message "+message+" using \\r : "+message.indexOf("\r"));
System.out.println("index loc for message "+message+" using \\n\\r : "+message.indexOf("\n\r"));
OUTPUT:
index loc for message asdfasdf using \r\n : -1
index loc for message asdfasdf using \n : -1
index loc for message asdfasdf using \r : -1
index loc for message asdfasdf using \n\r : -1
That completely depends on how you're redisplaying it.
It sounds like that you're redisplaying it in HTML. "Raw" newlines are not part of HTML markup. Do a rightclick, View Page Source in webbrowser. You'll see linebreaks over all place. Usually before and/or after HTML tags.
In order to visually present linebreaks in the HTML presentation, you should actually be using <br> tags. You can replace newlines by <br> strings as below:
message = message.replace("\n", "<br>");
This is only sensitive to XSS attack holes if the message is an user-controlled variable, because you have to present it unescaped in JSP (i.e. without <c:out>) in order to get <br> to work. You thus need to make sure that the message variable is sanitized beforehand.
Alternatively, you can also set CSS white-space property there where you're redisplaying the message to pre. If you'd like to wrap lines inside the context of a block element, then set pre-wrap. Or if you'd like to collapse spaces and tabs as well, then set pre-line.
<div id="message"><c:out value="${message}" /></div>
#message {
white-space: pre-line;
}
This will display the text preformatted (as a textarea by default does).
Two possible problems:
The text in the textarea is word wrapped and doesn't really have any newlines.
The String you get with getParameter() contains newlines (\n) but no carriage returns (\r) as expected by many email programs.
As a first step, I'd try dumping the retrieved String in a way you can check for this. You could write to a file and use od or a hex editor to look at the file, for example.
If it turns out you're simply missing CRs, you could do some simple regexp-based replacement on the string to fix that.
Searching the ASCII codes i found that the new line is not defined like the often \n, instead is defined like \r\n.
Regards.
You need to encodeURIComponent() before submitting the form and decodeURIComponent() on the server side.