Jsoup.clean without adding html entities

Jsoup.clean without adding html entities - java

I'm cleaning some text from unwanted HTML tags (such as <script>) by using
String clean = Jsoup.clean(someInput, Whitelist.basicWithImages());
The problem is that it replaces for instance å with å (which causes troubles for me since it's not "pure xml").
For example
Jsoup.clean("hello å <script></script> world", Whitelist.basicWithImages())
yields
"hello å world"
but I would like
"hello å world"
Is there a simple way to achieve this? (I.e. simpler than converting å back to å in the result.)

You can configure Jsoup's escaping mode: Using EscapeMode.xhtml will give you output w/o entities.
Here's a complete snippet that accepts str as input, and cleans it using Whitelist.simpleText():
// Parse str into a Document
Document doc = Jsoup.parse(str);
// Clean the document.
doc = new Cleaner(Whitelist.simpleText()).clean(doc);
// Adjust escape mode
doc.outputSettings().escapeMode(EscapeMode.xhtml);
// Get back the string of the body.
str = doc.body().html();

There are already feature requests on the website of Jsoup. You can extend source code yourself by adding a new empty Map and a new escaping type. If you don't want to do this you can use StringEscapeUtils from apache commons.
public static String getTextOnlyFromHtmlText(String htmlText){
Document doc = Jsoup.parse( htmlText );
doc.outputSettings().charset("UTF-8");
htmlText = Jsoup.clean( doc.body().html(), Whitelist.simpleText() );
htmlText = StringEscapeUtils.unescapeHtml(htmlText);
return htmlText;
}

Answer from &bmoc is working fine, but you could use a shorter solution :
// Clean html
Jsoup.clean(someInput, "yourBaseUriOrEmpty", Whitelist.simpleText(), new OutputSettings().escapeMode(EscapeMode.xhtml))

A simpler way to do this is
// clean the html
String output = Jsoup.clean(html, Whitelist.basicWithImages());
// Parse string into a document
Document doc = Jsoup.parse(output);
// Adjust escape mode
doc.outputSettings().escapeMode(EscapeMode.xhtml);
// Get back the string
System.out.println(doc.body().html());
I have tested this and it works

The accepted answer is using Jsoup.parse which seems more heavyweight than what is going on in Jsoup.clean after a quick glance at the source.
I copied the source code of Jsoup.clean(...) and added the line to set the escape mode. This should avoid some unecessary steps done by the parse method because it doesn't have to parse a whole html document but just handle a fragment.
private String clean(String html, Whitelist whitelist) {
Document dirty = Jsoup.parseBodyFragment(html, "");
Cleaner cleaner = new Cleaner(whitelist);
Document clean = cleaner.clean(dirty);
clean.outputSettings().escapeMode(EscapeMode.xhtml);
return clean.body().html();
}

Simple way:
EscapeMode em = EscapeMode.xhtml;
em.getMap().clear();
doc.outputSettings().escapeMode(em);
This will remove ALL html entities, including these: &apos;, ", & ,< and >. The EscapeMode.xhtml allows these entities.

Parse the HTML as a Document, then use a Cleaner to clean the document and generate another one, get the outputSettings of the document and set the appropriate charset and the escape mode to xhtml, then transform the document to a String. Not tested, but should work.

Related

get <img> value from a string in java

I'm parsing data from a json file. Now, I've a data like this
String Content = <p><img class="alignleft size-full wp-image-56999" alt="abdullah" src="http://www.some.com/wp-content/uploads/2013/12/imageName.jpg" width="348" height="239" />Text</p>
<p>Text</p> <p>Text</p><p>The post Some Text appeared first on Some Webiste</p>
Now, I want to divide this string in two pieces. I want to get this URL from src.
http://www.some.com/wp-content/uploads/2013/12/imageName.jpg
and store it a variable. Also, I want to remove the last line The Post appeared... and store the text's in another variable.
So, the questions are:
Is it possible to get that?
If possible, how can I achieve that ?

IN Java
Get a Document object
Document originalDoc = new SAXReader().read(new StringReader("<div>data</div>");
Then you can parse it.. (read this tutorial)
http://www.mkyong.com/java/how-to-read-xml-file-in-java-dom-parser/
In JavaScript
to get attribute
var url = document.getElementsByTagName('img')[0].getAttribute('src');
In case if you have a string and you want a document object, use jquery
string stringValue = '<div>data</div>';
var myObject= $(stringValue);

Use String.substring(firstIndex, lastIndex) to get the link from src attribute
learn to use a HTML parser like JSoup, will be useful in near future

If its a well structured string you can parse it using any DOM parser and extract data from it...

java xml parsing between tags

What im trying to do is parse xml through java. and i only want a snippet of text from each tag for example.
xml example
<data>\nSome Text :\n\MY Spectre around me night and day. Some More: Like a wild beast
guards my way.</data>
<data>\nSome Text :\n\Cruelty has a human heart. Some More: And Jealousy a human face
</data>
so far i have this
NodeList ageList = firstItemElement.getElementsByTagName("data");
Element ageElement =(Element)ageList.item(0);
NodeList textAgeList = ageElement.getChildNodes();
out.write("Data : " + ((Node)textAgeList.item(0)).getNodeValue().trim());
im trying to just get the "Some More:....." part i dont want the whole tag
also im trying to get rid of all the \n

If you're not restricted to the standard DOM API, you could try to use jOOX, which wraps standard DOM. Your example would then translate to:
// Use jOOX's jquery-like API to find elements and their text content
for (String string : $(firstItemElement).find("data").texts()) {
// Use standard String methods to replace content
System.out.println(string.replace("\\n", ""));
}

I would take all of the element text and use regular expressions to capture the relevant parts.

How to change the width and height of an html file using java

I wanted to change width="xyz" , where (xyz) can be any particular value to width="300". I researched on regular expressions and this was the one I am using a syntax with regular expression
String holder = "width=\"340\"";
String replacer="width=\"[0-9]*\"";
theWeb.replaceAll(replacer,holder);
where theWeb is the string
. But this was not getting replaced. Any help would be appreciated.

Your regex is correct. One thing you might be forgetting is that in Java all string methods do not affect the current string - they only return a new string with the appropriate transformation. Try this instead:
String replacement = 'width="340"';
String regex = 'width="[0-9]*"';
String newWeb = theWeb.replaceAll(regex, replacement); // newWeb holds new text

Better use JSoup for manipulating and extracting data, etc. from Html
See this link for more details:
http://jsoup.org/

Java: Carriage returns populating a var in js code?

I am not sure if this is possible, but I'm trying to find a front-end solution to this situation:
I am setting a JavaScript variable to a dynamic tag, populated by backend Java code:
var myString = '#myDynamicContent#';
However, there are some situations, in which the content from the output contains a carriage return; which breaks the code:
var mystring = '<div>
Carriage Return happened above and below.
</div>';
Is there anyway I can resolve this problem on the front-end? Or is it too late in the script to do something about it, because the dynamic tag will run before any JavaScript runs (thus the script is broken by that point)?

I'm sure my JS could be cleaned up (just thought this was a fun problem), but you could search out the comment in the JS.
Lets say your JS looks like this (noticed I added a tag to the comment so we know we're going after the correct one, and there is a div to just for testing):
<script id="testScript">
/*<captureMe><div>
Carriage Return happened above and below.
</div>
*/
var foo = 'bar';
</script>
<div id='test'>What do I see:</div>
Just use this to grab the comment:
var something = $("#testScript").html();
var newSomething = '';
newSomething = something.substr(something.indexOf("/*<captureMe>")+13);
newSomething = newSomething.substr(0, newSomething.indexOf("*/"));
$('#test').append('<br>'+newSomething); // just proving we captured the output, will not render returns or newline as expected by HTML
Technically, it works :), scripting-scripting...
Charbs

JavaScript supports strings that can span multiple lines by putting a backslash (\) at the end of the line, for example:
var myString = 'foo\
bar';
So you should be able to do a Java replace when you write in your server-side variable:
var myString = '#myDynamicContent.replaceAll("\\n", "\\\\n")#';

Replace the \n and/or \r with \\n and/or \\r respectively ... but it has to be done in the server-side language (in your case Java); it can't be done in JavaScript.

Building off of #Charbs' answer, you could avoid the JavaScript comments if you give your script tag a different mime type, so the browser won't try to evaluate it as JavaScript:
<script id="testScript" type="text/notjs" style="display:none">#myDynamicContent#</script>
And then just grab it like this (using jQuery):
var myString = $('#testScript').text();

To me it looks like you're doing token replacement instead of using a template engine. If you like token replacement you might Snippetory too, as it creates similar code. However it has a number of additional features. Using
var myString = '{v:myDynamicContent enc="string"}'
would create
var mystring = '<div>\r\n Carriage Return happened above and below.\r\n </div>'
And thus solve your problem. But you would have to change your code behind, too.

Extracting an "encompassing" string based on a term within the string

I have a java function to extract a string out of the HTML Page source for any website...The function basically accepts the site name, along with a term to search for. Now, this search term is always contained within javascript tags. What I need to do is to pull the entire javascript (within the tags) that contains the search term.
Here is an example -
<script type="text/javascript">
//Roundtrip
rtTop = Number(new Date());
document.documentElement.className += ' jsenabled';
</script>
For the javascript snippet above, my search term would be "rtTop". Once found, I want my function to return the string containing everything within the script tags.
Any novel solution? Thanks.

You could use a regular expression along the lines of
String someHTML = //get your HTML from wherever
Pattern pattern = Pattern.compile("<script type=\"text/javascript\">(.*?rtTop.*?)</script>",Pattern.DOTALL);
Matcher myMatcher = pattern.matcher(someHTML);
myMatcher.find();
String result = myMatcher.group(1);

I wish I could just comment on JacobM's answer, but I think I need more stackCred.
You could use an HTML parser, that's usually the better solution. That said, for limited scopes, I often use regEx. It's a mean beast though. One change I would make to JacobM's pattern is to replace the attributes within the opening element with [^<]+
That will allow you to match even if the "type" isn't present or if it has some other oddities. I'd also wrap the .*? with parens to make using the values later a little easier.
* UPDATE *
Borrowing from JacobM's answer. I'd change the pattern a little bit to handle multiple elements.
String someHTML = //get your HTML from wherever
String lKeyword = "rtTop";
String lRegexPattern = "(.*)(<script[^>]*>(((?!</).)*)"+lKeyword +"(((?!</).)*)</script>)(.*)";
Pattern pattern = Pattern.compile(lRegexPattern ,Pattern.DOTALL);
Matcher myMatcher = pattern.matcher(someHTML);
myMatcher.find();
String lPreKeyword = myMatcher.group(3);
String lPostKeyword = myMatcher.group(5);
String result = lPreKeyword + lKeyword + lPostKeyword;
An example of this pattern in action can be found here. Like I said, parsing HTML via regex can get real ugly real fast.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Jsoup.clean without adding html entities - java

Answer from &bmoc is working fine, but you could use a shorter solution : // Clean html Jsoup.clean(someInput, "yourBaseUriOrEmpty", Whitelist.simpleText(), new OutputSettings().escapeMode(EscapeMode.xhtml))

Simple way: EscapeMode em = EscapeMode.xhtml; em.getMap().clear(); doc.outputSettings().escapeMode(em); This will remove ALL html entities, including these: ', ", & ,< and >. The EscapeMode.xhtml allows these entities.

Parse the HTML as a Document, then use a Cleaner to clean the document and generate another one, get the outputSettings of the document and set the appropriate charset and the escape mode to xhtml, then transform the document to a String. Not tested, but should work.

Related

get <img> value from a string in java

java xml parsing between tags

How to change the width and height of an html file using java

Java: Carriage returns populating a var in js code?

Extracting an "encompassing" string based on a term within the string

Categories

Resources

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Jsoup.clean without adding html entities - java

Answer from &bmoc is working fine, but you could use a shorter solution : // Clean html Jsoup.clean(someInput, "yourBaseUriOrEmpty", Whitelist.simpleText(), new OutputSettings().escapeMode(EscapeMode.xhtml))

Simple way: EscapeMode em = EscapeMode.xhtml; em.getMap().clear(); doc.outputSettings().escapeMode(em); This will remove ALL html entities, including these: &apos;, ", & ,< and >. The EscapeMode.xhtml allows these entities.

Parse the HTML as a Document, then use a Cleaner to clean the document and generate another one, get the outputSettings of the document and set the appropriate charset and the escape mode to xhtml, then transform the document to a String. Not tested, but should work.

Related

get <img> value from a string in java

java xml parsing between tags

How to change the width and height of an html file using java

Java: Carriage returns populating a var in js code?

Extracting an "encompassing" string based on a term within the string

Categories

Resources

Simple way: EscapeMode em = EscapeMode.xhtml; em.getMap().clear(); doc.outputSettings().escapeMode(em); This will remove ALL html entities, including these: ', ", & ,< and >. The EscapeMode.xhtml allows these entities.