Sanitize HTML string - java

I have an HTML sting like:
<p dir="ltr"><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u>bold</u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u> </u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u>all</u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u> </u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u>in</u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u> </u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u>one</u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></p>
I want to sanitize the html like <b><i><u> bold all in one </b></i></u>
I tried this method: webText = webText.replaceAll("(</?(?:b|i|u)>)\\1+", "$1").replaceAll("</(b|i|u)><\\1>", "");
But it is of no use. The html remains clumsy. What should I do to mend the same? Is there any other Regex or JSON way?

But it is of no use. The html remains clumsy. What should I do to mend
the same? Is there any other Regex or JSON way?
Regex may help here, but in general they serve not very well as Html parser if things get complex. Jsoup is a great Html library, and i really can recommend it.
Unfortunately your html is still valid html, so the solution is tricky.
Best you start with the Jsoup documentation, especially the one of it's Selector syntax.
Here's something for starting:
final String html = ... // your html from above
// Parse the html string into a document
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
/*
* Select all elements, which ...
*
* (a) have a text (= at least not empty)
* (b) has no childs it's own
*
* Iterate over those found and print them.
*/
for( Element element : doc.select("*:matches(^..+?$):not(:has(*))") )
{
System.out.println(element);
}
Result:
<u>bold</u>
<u>all</u>
<u>in</u>
<u>one</u>
If you need literally <b><i><u> bold all in one </b></i></u>:
final String html = ... // your html from above
// As above
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
// All text of the document
String text = doc.text();
// Create an element and it's childs
Element element = new Element(Tag.valueOf("b"), "");
element.appendElement("i").appendElement("u").text(text);
System.out.println(element);
Result:
<b><i><u>bold all in one</u></i></b>

You could try below method to remove unwanted html tags:
public String stripHtml(String html)
{
return Html.fromHtml(html).toString();
}

Related

Jsoup don't parse xml correctly, missing tags

I want to parse a xml text but jsoup seems to delete <col> tags.
This is what happens:
Original:
<rowh> <col>DTC Code</col> <col>Description</col> </rowh>
Result:
<rowh> DTC Code Description
</rowh>
This is the code I am using to see the content.
Document jDoc = Jsoup.parse(contentXML);
Log.d("Original", contentXML);
Log.d("Document", jDoc.outerHtml());
I need to count how many <col> tags are inside each <rowh> tag but it always returns 0. I am using Jsoup version 1.11.2
May this helps you:
String html = "<?xml version=\\\"1.0\\\" encoding=\\\"UTF-8\\\"><rowh><col>DTC Code</col><col>Description</col></rowh></xml>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
Elements e = doc.select("rowh");
String text = e.text();
Log.i("TAG1", text);
OutPut:

JSOUP Element.html("<th>test</th>") ignore th tags

I work on a html templating engine based on jsoup.
JSOUP ignore th and td flags if element is not inside table;
To deal with this, I change parser to :
final Document docToWrite = Jsoup.parse(docToRead.outerHtml(),"", Parser.xmlParser());
But I didn't find any solution to fill an Element with html that contain a td or a th:
element.html("<th>test</th>");
return only test, because JSOUP is cleaning html by removing unused tags
How can I solve this?
Thank you
If you element is 'th', then calling:
element.html("<th>test</th>") // th.innerHTML = "<th>test</th>"
should produce dirty html:
<th><th>test</th></th>
which is correctly cleared up by JSoup to:
<th>test</th> // th.innerHTML == "test"
To fill element with innerHTML == "<th>test</th>" your element has to be a <tr> tag.
// Given
String s = "<th>test</th>";
assert element.tag() == "tr";
// When
element.html(s);
// Then
assert element.html().equals(s);

Java - Obtain text within script tag using Jsoup

I am using the Jsoup library to read a URL. This url has text within a few <script> tags. Is it possible for me to obtain the text within each <script> tag? Please note that I am not asking to parse a Javascript file as I am already aware JSoup does not allow that. The actual source code of the URL has text within a script tag, I need that.
doc = Jsoup.connect("http://www.example.com").timeout(10000).get();
Element div = doc.select("script").first();
for (Element element : div.children()) {
System.out.println(element.toString());
}
This is what one of the script tags look like from the source code:
<script type="text/javascript">
(function() {
...
})();
</script>
Alternatively, you could use the Element#html() method that returns the inner html of an element.
Since 1.11.1: Use efficient Element#selectFirst() method to find the script element.
Document doc = Jsoup.connect("http://www.example.com").timeout(10000).get();
Element scriptElement = doc.selectFirst("script");
// Don't forget to check scriptElement is not null...
String jsCode = scriptElement.html();
Up to Jsoup 1.10.3: Combine Element#select() and Elements#first() calls to find the script element.
Document doc = Jsoup.connect("http://www.example.com").timeout(10000).get();
Element scriptElement = doc.select("script").first();
// Don't forget to check scriptElement is not null...
String jsCode = scriptElement.html();
Yes. You can use Element#getElementsByTag() to get all the script tag . Each script tags will be represented by the DataNode.
Document doc =Jsoup.connect("http://stackoverflow.com/questions/16780517/java-obtain-text-within-script-tag-using-jsoup").timeout(10000).get();
Elements scriptElements = doc.getElementsByTag("script");
for (Element element :scriptElements ){
for (DataNode node : element.dataNodes()) {
System.out.println(node.getWholeData());
}
System.out.println("-------------------");
}
Document doc = Jsoup.parse(html);
Elements scripts = doc.getElementsByTag("script");
for (Element script : scripts) {
System.out.println(script.data());
}
According to your case the solution will be as below.
Document doc = Jsoup.connect("http://www.example.com").timeout(10000).get();
Elements scripts = doc.select("script");
for (Element script : scripts) {
String type = script.attr("type");
if (type.contentEquals("text/javascript")) {
String scriptData = script.data(); // your text from the script
break;
}
}

jsoup clean includes unwanted carriage return

This is currently vexing me.
Jsoup is including an extra line break in the returned string if the string includes <br />
eg.
String html ="TEST<br />TEST";
Jsoup.clean(html, org.jsoup.safety.Whitelist.basic());
returns
TEST\n<br />TEST
Any advice on how to avoid the inclusion of the troublesome \n?
Have you tried .text(); or .ownText(); from the Elements class?
//If you want the whole page
String url = "http://www.yourwebsite.com";
Document doc = Jsoup.connect(url).get();
System.out.println(doc.text());
//If you want some specific part of the page
Elements elems = doc.select("query");
for (Element element : elems) {
System.out.println(element.text() + "\n");
System.out.println(element.ownText() + "\n\n");
}
If each element returned < p>Hello< b> there< /b> now!< /p>
The method text(); would return Hello there now!
The method ownText(); would return Hello now!
Just to make it easier to understand: The .text(); will return the whole text within the tag you got. The ownText(); method will return the text from the tag itself, and not the text from its children.
About the query in doc.select("query");, you can search here for any pattern you want.
Cleaner cleaner = new Cleaner(WHITE_LIST);
Document clean = cleaner.clean(body);
Document.OutputSettings outputSettings = new Document.OutputSettings();
outputSettings.prettyPrint(false);
clean.outputSettings(outputSettings);
return clean.body().html();
outputSettings.prettyPrint(false);

jsoup tag extraction problem

test: example test1:example1
Elements size = doc.select("div:contains(test:)");
how can i extract the value example and example1 from this html tag....using jsoup..
Since this HTML is not semantic enough for the final purpose you have (a <br> cannot have children and : is not HTML), you can't do much with a HTML parser like Jsoup. A HTML parser isn't intented to do the job of specific text extraction/tokenizing.
Best what you can do is to get the HTML content of the <div> using Jsoup and then extract that further using the usual java.lang.String or maybe java.util.Scanner methods.
Here's a kickoff example:
String html = "<div style=\"height:240px;\"><br>test: example<br>test1:example1</div>";
Document document = Jsoup.parse(html);
Element div = document.select("div[style=height:240px;]").first();
String[] parts = div.html().split("<br />"); // Jsoup transforms <br> to <br />.
for (String part : parts) {
int colon = part.indexOf(':');
if (colon > -1) {
System.out.println(part.substring(colon + 1).trim());
}
}
This results in
example
example1
If I was the HTML author, I would have used a definition list for this. E.g.
<dl id="mydl">
<dt>test:</dt><dd>example</dd>
<dt>test1:</dt><dd>example1</dd>
</dl>
This is more semantic and thus more easy parseable:
String html = "<dl id=\"mydl\"><dt>test:</dt><dd>example</dd><dt>test1:</dt><dd>example1</dd></dl>";
Document document = Jsoup.parse(html);
Elements dts = document.select("#mydl dd");
for (Element dt : dts) {
System.out.println(dt.text());
}

Categories