How to Strip Out the Text From And Html String in Java

How to Strip Out the Text From And Html String in Java - java

I want to analyze the structure of the html pages. For a page I have it as a string and I want to strip out the text and to keep only the html structure. I don't want to use a DOM parser, and I need something robust which works on regular html not only xhtml. I know regular expressions are good enough to strip out html tags from a string, but can they be used to strip out the text and to keep only the html tags?
Do you know any other option/framework I could use?

I doubt that there is an easy way to do this using regex.
Jericho is a pretty neat HTML parser with a small footprint and a single jar without additional external libraries.

Do you know any other option/framework I could use?
You might want to look at JSoup. Seems to be designed to solve exactly this type of problem.

If you've stripped out tags before, you know the basic gist is to strip out everything between < and >. Stripping out text is very similar, except you're stripping out everything between > and <. So yes, regular expressions would serve you very well in stripping out the text and leaving just the tags. They could also be used to strip out tag attributes as well if you didn't want to deal with them.

This might give you a decent start. I don't have much experience with HTML so I don't know if there is anything else to parse out of the string besides < tags >.
public static void main(String[] args){
String html = "<body> text text text text </body>";
String htmlTags = null;
char c;
for(int i = 0 ; i < html.length() ; i++){
c = html.charAt(i);
if(tagStart(Character.toString(c))){
for(int j = i ; j < html.length() ; j++){
if(htmlTags != null){
htmlTags += Character.toString(html.charAt(j));
}else{
htmlTags = Character.toString(html.charAt(j));
}
c = html.charAt(j);
if(tagStop(Character.toString(c))){
break;
}
}
}
}
}
private static boolean tagStart(String check){
if(check.equals("<")){
return true;
}else{
return false;
}
}
private static boolean tagStop(String check){
if(check.equals(">")){
return true;
}else{
return false;
}
}

Something along the lines of:
pageSource.replaceAll(">.*<", "><");
Should get you started.

Related

Determine file extension for image urls

Is there a reliable and fast way to determine the file extension of an image url? THere are a few options I see but none of them work consistently for images of the below format
https://cdn-image.blay.com/sites/default/files/styles/1600x1000/public/images/12.jpg?itok=e-zA1T
I have tried:
new MimetypesFileTypeMap().getContentType(url)
Results in the generic "application/octet-stream" in which case I use the below two:
Files.getFileExtension
FilenameUtils.getExtension
I would like to avoid regex when possible so is there another utility that properly gets past links that have args (.jpeg?blahblah). I would also like to avoid downloading the image or connection to the url in anyway as this should be a performant call

If you can trust that the URLs are not malformed, how about this:
FilenameUtils.getExtension(URI.create(url).getPath())

Cant you just look at the file extension in the URL? so that would be something like:
public static String getFileExtension(String url) {
int phpChar = url.length();
for(int i = 0; i < url.length(); i++) {
if(url.charAt(i) == '?') {
phpChar = i;
break;
}
}
int character = phpChar - 1;
while(url.charAt(character) != '.') character -= 1;
return url.substring(character + 1, phpChar);
}
Maybe not the most elegant solution, but it works, even with the php ? in the url.

Java Library to truncate html strings?

I need to truncate html string that was already sanitized by my app before storing in DB & contains only links, images & formatting tags. But while presenting to users, it need to be truncated for presenting an overview of content.
So I need to abbreviate html strings in java such that
<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" />
<br/><a href="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" />
when truncated does not return something like this
<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" />
<br/><a href="htt
but instead returns
<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" />
<br/>

Your requirements are a bit vague, even after reading all the comments. Given your example and explanations, I assume your requirements are the following:
The input is a string consisting of (x)html tags. Your example doesn't contain this, but I assume the input can contain text between the tags.
In the context of your problem, we do not care about nesting. So the input is really only text intermingled with tags, where opening, closing and self-closing tags are all considered equivalent.
Tags can contain quoted values.
You want to truncate your string such that the string is not truncated in the middle of a tag. So in the truncated string every '<' character must have a corresponding '>' character.
I'll give you two solutions, a simple one which may not be correct, depending on what the input looks like exactly, and a more complex one which is correct.
First solution
For the first solution, we first find the last '>' character before the truncate size (this corresponds to the last tag which was completely closed). After this character may come text which does not belong to any tag, so we then search for the first '<' character after the last closed tag. In code:
public static String truncate1(String input, int size)
{
if (input.length() < size) return input;
int pos = input.lastIndexOf('>', size);
int pos2 = input.indexOf('<', pos);
if (pos2 < 0 || pos2 >= size) {
return input.substring(0, size);
}
else {
return input.substring(0, pos2);
}
}
Of course this solution does not consider the quoted value strings: the '<' and '>' characters might occur inside a string, in which case they should be ignored. I mention the solution anyway because you mention your input is sanatized, so possibly you can ensure that the quoted strings never contain '<' and '>' characters.
Second solution
To consider the quoted strings, we cannot rely on standard Java classes anymore, but we have to scan the input ourselves and remember if we are currently inside a tag and inside a string or not. If we encounter a '<' character outside of a string, we remember its position, so that when we reach the truncate point we know the position of the last opened tag. If that tag wasn't closed, we truncate before the beginning of that tag. In code:
public static String truncate2(String input, int size)
{
if (input.length() < size) return input;
int lastTagStart = 0;
boolean inString = false;
boolean inTag = false;
for (int pos = 0; pos < size; pos++) {
switch (input.charAt(pos)) {
case '<':
if (!inString && !inTag) {
lastTagStart = pos;
inTag = true;
}
break;
case '>':
if (!inString) inTag = false;
break;
case '\"':
if (inTag) inString = !inString;
break;
}
}
if (!inTag) lastTagStart = size;
return input.substring(0, lastTagStart);
}

A robust way of doing it is to use the hotsax code which parses HTML letting you interface with the parser using the traditional low level SAX XML API [Note it is not an XML parser it parses poorly formed HTML in only chooses to let you interface with it using a standard XML API).
Here on github I have created a working quick-and-dirty example project which has a main class that parses your truncated example string:
XMLReader parser = XMLReaderFactory.createXMLReader("hotsax.html.sax.SaxParser");
final StringBuilder builder = new StringBuilder();
ContentHandler handler = new DoNothingContentHandler(){
StringBuilder wholeTag = new StringBuilder();
boolean hasText = false;
boolean hasElements = false;
String lastStart = "";
#Override
public void characters(char[] ch, int start, int length)
throws SAXException {
String text = (new String(ch, start, length)).trim();
wholeTag.append(text);
hasText = true;
}
#Override
public void endElement(String namespaceURI, String localName,
String qName) throws SAXException {
if( !hasText && !hasElements && lastStart.equals(localName)) {
builder.append("<"+localName+"/>");
} else {
wholeTag.append("</"+ localName +">");
builder.append(wholeTag.toString());
}
wholeTag = new StringBuilder();
hasText = false;
hasElements = false;
}
#Override
public void startElement(String namespaceURI, String localName,
String qName, Attributes atts) throws SAXException {
wholeTag.append("<"+ localName);
for( int i = 0; i < atts.getLength(); i++) {
wholeTag.append(" "+atts.getQName(i)+"='"+atts.getValue(i)+"'");
hasElements = true;
}
wholeTag.append(">");
lastStart = localName;
hasText = false;
}
};
parser.setContentHandler(handler);
//parser.parse(new InputSource( new StringReader( "<div>this is the <em>end</em> my <br> friend some link" ) ));
parser.parse(new InputSource( new StringReader( "<img src=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" />\n<br/><a href=\"htt" ) ));
System.out.println( builder.toString() );
It outputs:
<img src='http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg'></img><br/>
It is adding an </img> tag but thats harmless for html and it would be possible to tweak the code to exactly match the input in the output if you felt that necessary.
Hotsax is actually generated code from using yacc/flex compiler tools run over the HtmlParser.y and StyleLexer.flex files which define the low level grammar of html. So you benefit from the work of the person who created that grammar; all you need to do is write some fairly trivial code and test cases to reassemble the parsed fragments as shown above. That's much better than trying to write your own regular expressions, or worst and coded string scanner, to try to interpret the string as that is very fragile.

Afer I understand what you want here is the most simple solution I could come up with.
Just work from the end of your substring to the start until you find '>' This is the end mark of the last tag. So you can be sure that you only have complete tags in the majority of cases.
But what if the > is inside texts?
Well to be sure about this just search on until you find < and ensure this is part of a tag (do you know the tag string for instance?, since you only have links, images and formating you can easily check this. If you find another > before finding < starting a tag this is the new end of your string.
Easy to do, correct and should work for you.
If you are not certain if strings / attributes can contain < or > you need to check the appearence of " and =" to check if you are inside a string or not. (Remember you can cut of an attribute values). But I think this is overengineering. I never found an attribute with < and > in it and usually within text it is also escaped using & lt ; and something alike.

I don't know the context of the problem the OP needs to solve, but I am not sure if it makes a lot of sense to truncate html code by the length of its source code instead of the length of its visual representation (which can become arbitrarily complex, of course).
Maybe a combined solution could be useful, so you don't penalize html code with a lot of markup or long links, but also set a clear total limit which cannot be exceeded. Like others already wrote, the usage of a dedicated HTML parser like JSoup allows the processing of non well-formed or even invalid HTML.
The solution is loosely based on JSoup's Cleaner. It traverses the parsed dom tree of the source code and tries to recreate a destination tree while continuously checking, if a limit has been reached.
import org.jsoup.nodes.*;
import org.jsoup.parser.*;
import org.jsoup.select.*;
String html = "<img src=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" />" +
"<br/><a href=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" />";
//String html = "<b>foo</b>bar<p class=\"baz\">Some <img />Long Text</p><a href='#'>hello</a>";
Document srcDoc = Parser.parseBodyFragment(html, "");
srcDoc.outputSettings().prettyPrint(false);
Document dstDoc = Document.createShell(srcDoc.baseUri());
dstDoc.outputSettings().prettyPrint(false);
Element dst = dstDoc.body();
NodeVisitor v = new NodeVisitor() {
private static final int MAX_HTML_LEN = 85;
private static final int MAX_TEXT_LEN = 40;
Element cur = dst;
boolean stop = false;
int resTextLength = 0;
#Override
public void head(Node node, int depth) {
// ignore "body" element
if (depth > 0) {
if (node instanceof Element) {
Element curElement = (Element) node;
cur = cur.appendElement(curElement.tagName());
cur.attributes().addAll(curElement.attributes());
String resHtml = dst.html();
if (resHtml.length() > MAX_HTML_LEN) {
cur.remove();
throw new IllegalStateException("html too long");
}
} else if (node instanceof TextNode) {
String curText = ((TextNode) node).getWholeText();
String resHtml = dst.html();
if (curText.length() + resHtml.length() > MAX_HTML_LEN) {
cur.appendText(curText.substring(0, MAX_HTML_LEN - resHtml.length()));
throw new IllegalStateException("html too long");
} else if (curText.length() + resTextLength > MAX_TEXT_LEN) {
cur.appendText(curText.substring(0, MAX_TEXT_LEN - resTextLength));
throw new IllegalStateException("text too long");
} else {
resTextLength += curText.length();
cur.appendText(curText);
}
}
}
}
#Override
public void tail(Node node, int depth) {
if (depth > 0 && node instanceof Element) {
cur = cur.parent();
}
}
};
try {
NodeTraversor t = new NodeTraversor(v);
t.traverse(srcDoc.body());
} catch (IllegalStateException ex) {
System.out.println(ex.getMessage());
}
System.out.println(" in='" + srcDoc.body().html() + "'");
System.out.println("out='" + dst.html() + "'");
For the given example with max length of 85, the result is:
html too long
in='<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg"><br>'
out='<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg"><br>'
It also correctly truncates within nested elements, for a max html length of 16 the result is:
html too long
in='<i>f<b>oo</b>b</i>ar'
out='<i>f<b>o</b></i>'
For a maximum text length of 2, the result of a long link would be:
text too long
in='<b>foo</b>bar'
out='<b>fo</b>'

You can achieve this with library "JSOUP" - html parser.
You can download it from below link.
Download JSOUP
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class HTMLParser
{
public static void main(String[] args)
{
String html = "<img src=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" /><br/><a href=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" /><img src=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" /><br/><a href=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" />";
Document doc = Jsoup.parse(html);
doc.select("a").remove();
System.out.println(doc.body().children());
}
}

Well whatever you want to do. There are two libraries out there jSoup and HtmlParser which I tend to use. Please check them out. Also I see bearly XHTML in the wild anymore. Its more about HTML5 (which does not have an XHTML counterpart) nowadays.
[Update]
I mention JSoup and HtmlParser since they are fault tollerant in a way the browser is. Please check if they suite you since they are very good at dealing with malformed and damaged HTML text. Create a DOM out of your HTML and write it back to string you should get rid of the damaged tags also you can filter the DOM by yourself and remove even more content if you have to.
PS: I guess the XML decade is finally (and gladly) over. Today JSON is going to be overused.

A third potential answer I would consider as a potential solution is not to work with strings ins the first place.
When I remember correctly there are DOM tree representations that work closely with the underlying string presentation. Therefore they are character exact. I wrote one myself but I think jSoup has such a mode. Since there are a lot of parsers out there you should be able to find one that actually does.
With such a parser you can easily see which tag runs from what string position to another. Actually those parsers maintain a String of the document and alter it but only store range information like start and stop positions within the document avoiding to multiply those information for nested nodes.
Therefore you can find the most outer node for a given position, know exactly from what to where and easily can decide if this tag (including all its children) can be used to be presented within your snippet. So you will have the chance to print complete text nodes and alike without the risk to only present partial tag information or headline text and alike.
If you do not find a parser that suites you on this, you can ask me for advise.

Java Regex or XML parser?

I want to remove any tags such as
<p>hello <namespace:tag : a>hello</namespace:tag></p>
to become
<p> hello hello </p>
What is the best way to do this if it is regex for some reason this is now working can anyone help?
(<|</)[:]{1,2}[^</>]>
edit:
added

Definitely use an XML parser. Regex should not be used to parse *ML

You should not use regex for these purposes use a parser like lxml or BeautifulSoup
>>> import lxml.html as lxht
>>> myString = '<p>hello <namespace:tag : a>hello</namespace:tag></p>'
>>> lxht.fromstring(myString).text_content()
'hello hello'
Here is a reason why you should not parse html/xml with regex.

If you're just trying to pull the plain text out of some simple XML, the best (fastest, smallest memory footprint) would be to just run a for loop over the data:
PSEUDOCODE BELOW
bool inMarkup = false;
string text = "";
for each character in data // (dunno what you're reading from)
{
char c = current;
if( c == '<' ) inMarkup = true;
else if( c == '>') inMarkup = false;
else if( !inMarkup ) text += c;
}
Note: This will break if you encounter things like CDATA, JavaScript, or CSS in your parsing.
So, to sum up... if it's simple, do something like above and not a regular expression. If it isn't that simple, listen to the other guys an use an advanced parser.

This is a solution I personally used for a likewise problem in java. The library used for this is Jsoup : http://jsoup.org/.
In my particular case I had to unwrap tags that had an attribute with a particular value in them. You see that reflected in this code, it's not the exact solution to this problem but could put you on your way.
public static String unWrapTag(String html, String tagName, String attribute, String matchRegEx) {
Validate.notNull(html, "html must be non null");
Validate.isTrue(StringUtils.isNotBlank(tagName), "tagName must be non blank");
if (StringUtils.isNotBlank(attribute)) {
Validate.notNull(matchRegEx, "matchRegEx must be non null when an attribute is provided");
}
Document doc = Jsoup.parse(html);
OutputSettings outputSettings = doc.outputSettings();
outputSettings.prettyPrint(false);
Elements elements = doc.getElementsByTag(tagName);
for (Element element : elements) {
if(StringUtils.isBlank(attribute)){
element.unwrap();
}else{
String attr = element.attr(attribute);
if(!StringUtils.isBlank(attr)){
String newData = attr.replaceAll(matchRegEx, "");
if(StringUtils.isBlank(newData)){
element.unwrap();
}
}
}
}
return doc.html();
}

removing html tags using a for loop in Java [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Removing HTML from a Java String
I am having a problem removing htmls tags from a text file in java. I know it would be easy to use something like
str=str.toString().replaceAll("\\<.*?>","");
However I want to know if I could split the string and go throught and replace everything srarting from < to > with "".
I tried
String [] str= "<tag>with some string </tag>";
String s="";
for (i=0; i < str.length; i++)
{
if (str[i].toString()=="<")
{
str[i]="";
}
else if (str[i].toString()==">")
{
s=s+str[i+1];
}
}
when i try printing the new string s, it just prints out with just white space.
thanks for the help

You need to some flag variable denoting you are inside of tag and add the third situation when you are not in the tag, so the rest of content gets added to string. For example:
String [] str= "<tag>with some string </tag>";
String s="";
boolean inTag = false;
for (i=0; i < str.length; i++)
{
if (str[i].toString()=="<")
{
inTag = true;
}
else if (str[i].toString()==">")
{
inTag = false;
}else{
if (!inTag)
s = s + str[i];
}
}

The code you supplied have a few errors. But anyway, you may do it with String#Split:
String[] strArr = str.split("\\<.*?>");
This will eliminate the tags.

In order to remove the html tags from a text file just look into this topic previously discussed in this forum

Best way to encode text data for XML in Java?

Very similar to this question, except for Java.
What is the recommended way of encoding strings for an XML output in Java. The strings might contain characters like "&", "<", etc.

As others have mentioned, using an XML library is the easiest way. If you do want to escape yourself, you could look into StringEscapeUtils from the Apache Commons Lang library.

Very simply: use an XML library. That way it will actually be right instead of requiring detailed knowledge of bits of the XML spec.

Just use.
<![CDATA[ your text here ]]>
This will allow any characters except the ending
]]>
So you can include characters that would be illegal such as & and >. For example.
<element><![CDATA[ characters such as & and > are allowed ]]></element>
However, attributes will need to be escaped as CDATA blocks can not be used for them.

This question is eight years old and still not a fully correct answer! No, you should not have to import an entire third party API to do this simple task. Bad advice.
The following method will:
correctly handle characters outside the basic multilingual plane
escape characters required in XML
escape any non-ASCII characters, which is optional but common
replace illegal characters in XML 1.0 with the Unicode substitution character. There is no best option here - removing them is just as valid.
I've tried to optimise for the most common case, while still ensuring you could pipe /dev/random through this and get a valid string in XML.
public static String encodeXML(CharSequence s) {
StringBuilder sb = new StringBuilder();
int len = s.length();
for (int i=0;i<len;i++) {
int c = s.charAt(i);
if (c >= 0xd800 && c <= 0xdbff && i + 1 < len) {
c = ((c-0xd7c0)<<10) | (s.charAt(++i)&0x3ff); // UTF16 decode
}
if (c < 0x80) { // ASCII range: test most common case first
if (c < 0x20 && (c != '\t' && c != '\r' && c != '\n')) {
// Illegal XML character, even encoded. Skip or substitute
sb.append("�"); // Unicode replacement character
} else {
switch(c) {
case '&': sb.append("&"); break;
case '>': sb.append(">"); break;
case '<': sb.append("<"); break;
// Uncomment next two if encoding for an XML attribute
// case '\'' sb.append("&apos;"); break;
// case '\"' sb.append("""); break;
// Uncomment next three if you prefer, but not required
// case '\n' sb.append("
"); break;
// case '\r' sb.append("
"); break;
// case '\t' sb.append(" "); break;
default: sb.append((char)c);
}
}
} else if ((c >= 0xd800 && c <= 0xdfff) || c == 0xfffe || c == 0xffff) {
// Illegal XML character, even encoded. Skip or substitute
sb.append("�"); // Unicode replacement character
} else {
sb.append("&#x");
sb.append(Integer.toHexString(c));
sb.append(';');
}
}
return sb.toString();
}
Edit: for those who continue to insist it foolish to write your own code for this when there are perfectly good Java APIs to deal with XML, you might like to know that the StAX API included with Oracle Java 8 (I haven't tested others) fails to encode CDATA content correctly: it doesn't escape ]]> sequences in the content. A third party library, even one that's part of the Java core, is not always the best option.

This has worked well for me to provide an escaped version of a text string:
public class XMLHelper {
/**
* Returns the string where all non-ascii and <, &, > are encoded as numeric entities. I.e. "<A & B >"
* .... (insert result here). The result is safe to include anywhere in a text field in an XML-string. If there was
* no characters to protect, the original string is returned.
*
* #param originalUnprotectedString
* original string which may contain characters either reserved in XML or with different representation
* in different encodings (like 8859-1 and UFT-8)
* #return
*/
public static String protectSpecialCharacters(String originalUnprotectedString) {
if (originalUnprotectedString == null) {
return null;
}
boolean anyCharactersProtected = false;
StringBuffer stringBuffer = new StringBuffer();
for (int i = 0; i < originalUnprotectedString.length(); i++) {
char ch = originalUnprotectedString.charAt(i);
boolean controlCharacter = ch < 32;
boolean unicodeButNotAscii = ch > 126;
boolean characterWithSpecialMeaningInXML = ch == '<' || ch == '&' || ch == '>';
if (characterWithSpecialMeaningInXML || unicodeButNotAscii || controlCharacter) {
stringBuffer.append("&#" + (int) ch + ";");
anyCharactersProtected = true;
} else {
stringBuffer.append(ch);
}
}
if (anyCharactersProtected == false) {
return originalUnprotectedString;
}
return stringBuffer.toString();
}
}

Try this:
String xmlEscapeText(String t) {
StringBuilder sb = new StringBuilder();
for(int i = 0; i < t.length(); i++){
char c = t.charAt(i);
switch(c){
case '<': sb.append("<"); break;
case '>': sb.append(">"); break;
case '\"': sb.append("""); break;
case '&': sb.append("&"); break;
case '\'': sb.append("&apos;"); break;
default:
if(c>0x7e) {
sb.append("&#"+((int)c)+";");
}else
sb.append(c);
}
}
return sb.toString();
}

StringEscapeUtils.escapeXml() does not escape control characters (< 0x20). XML 1.1 allows control characters; XML 1.0 does not. For example, XStream.toXML() will happily serialize a Java object's control characters into XML, which an XML 1.0 parser will reject.
To escape control characters with Apache commons-lang, use
NumericEntityEscaper.below(0x20).translate(StringEscapeUtils.escapeXml(str))

public String escapeXml(String s) {
return s.replaceAll("&", "&").replaceAll(">", ">").replaceAll("<", "<").replaceAll("\"", """).replaceAll("'", "&apos;");
}

For those looking for the quickest-to-write solution: use methods from apache commons-lang:
StringEscapeUtils.escapeXml10() for xml 1.0
StringEscapeUtils.escapeXml11() for xml 1.1
StringEscapeUtils.escapeXml() is now deprecated, but was used commonly in the past
Remember to include dependency:
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.5</version> <!--check current version! -->
</dependency>

While idealism says use an XML library, IMHO if you have a basic idea of XML then common sense and performance says template it all the way. It's arguably more readable too. Though using the escaping routines of a library is probably a good idea.
Consider this: XML was meant to be written by humans.
Use libraries for generating XML when having your XML as an "object" better models your problem. For example, if pluggable modules participate in the process of building this XML.
Edit: as for how to actually escape XML in templates, use of CDATA or escapeXml(string) from JSTL are two good solutions, escapeXml(string) can be used like this:
<%#taglib prefix="fn" uri="http://java.sun.com/jsp/jstl/functions"%>
<item>${fn:escapeXml(value)}</item>

The behavior of StringEscapeUtils.escapeXml() has changed from Commons Lang 2.5 to 3.0.
It now no longer escapes Unicode characters greater than 0x7f.
This is a good thing, the old method was to be a bit to eager to escape entities that could just be inserted into a utf8 document.
The new escapers to be included in Google Guava 11.0 also seem promising:
http://code.google.com/p/guava-libraries/issues/detail?id=799

While I agree with Jon Skeet in principle, sometimes I don't have the option to use an external XML library. And I find it peculiar the two functions to escape/unescape a simple value (attribute or tag, not full document) are not available in the standard XML libraries included with Java.
As a result and based on the different answers I have seen posted here and elsewhere, here is the solution I've ended up creating (nothing worked as a simple copy/paste):
public final static String ESCAPE_CHARS = "<>&\"\'";
public final static List<String> ESCAPE_STRINGS = Collections.unmodifiableList(Arrays.asList(new String[] {
"<"
, ">"
, "&"
, """
, "&apos;"
}));
private static String UNICODE_NULL = "" + ((char)0x00); //null
private static String UNICODE_LOW = "" + ((char)0x20); //space
private static String UNICODE_HIGH = "" + ((char)0x7f);
//should only be used for the content of an attribute or tag
public static String toEscaped(String content) {
String result = content;
if ((content != null) && (content.length() > 0)) {
boolean modified = false;
StringBuilder stringBuilder = new StringBuilder(content.length());
for (int i = 0, count = content.length(); i < count; ++i) {
String character = content.substring(i, i + 1);
int pos = ESCAPE_CHARS.indexOf(character);
if (pos > -1) {
stringBuilder.append(ESCAPE_STRINGS.get(pos));
modified = true;
}
else {
if ( (character.compareTo(UNICODE_LOW) > -1)
&& (character.compareTo(UNICODE_HIGH) < 1)
) {
stringBuilder.append(character);
}
else {
//Per URL reference below, Unicode null character is always restricted from XML
//URL: https://en.wikipedia.org/wiki/Valid_characters_in_XML
if (character.compareTo(UNICODE_NULL) != 0) {
stringBuilder.append("&#" + ((int)character.charAt(0)) + ";");
}
modified = true;
}
}
}
if (modified) {
result = stringBuilder.toString();
}
}
return result;
}
The above accommodates several different things:
avoids using char based logic until it absolutely has to - improves unicode compatibility
attempts to be as efficient as possible given the probability is the second "if" condition is likely the most used pathway
is a pure function; i.e. is thread-safe
optimizes nicely with the garbage collector by only returning the contents of the StringBuilder if something actually changed - otherwise, the original string is returned
At some point, I will write the inversion of this function, toUnescaped(). I just don't have time to do that today. When I do, I will come update this answer with the code. :)

Note: Your question is about escaping, not encoding. Escaping is using <, etc. to allow the parser to distinguish between "this is an XML command" and "this is some text". Encoding is the stuff you specify in the XML header (UTF-8, ISO-8859-1, etc).
First of all, like everyone else said, use an XML library. XML looks simple but the encoding+escaping stuff is dark voodoo (which you'll notice as soon as you encounter umlauts and Japanese and other weird stuff like "full width digits" (&#FF11; is 1)). Keeping XML human readable is a Sisyphus' task.
I suggest never to try to be clever about text encoding and escaping in XML. But don't let that stop you from trying; just remember when it bites you (and it will).
That said, if you use only UTF-8, to make things more readable you can consider this strategy:
If the text does contain '<', '>' or '&', wrap it in <![CDATA[ ... ]]>
If the text doesn't contain these three characters, don't warp it.
I'm using this in an SQL editor and it allows the developers to cut&paste SQL from a third party SQL tool into the XML without worrying about escaping. This works because the SQL can't contain umlauts in our case, so I'm safe.

If you are looking for a library to get the job done, try:
Guava 26.0 documented here
return XmlEscapers.xmlContentEscaper().escape(text);
Note: There is also an xmlAttributeEscaper()
Apache Commons Text 1.4 documented here
StringEscapeUtils.escapeXml11(text)
Note: There is also an escapeXml10() method

To escape XML characters, the easiest way is to use the Apache Commons Lang project, JAR downloadable from: http://commons.apache.org/lang/
The class is this: org.apache.commons.lang3.StringEscapeUtils;
It has a method named "escapeXml", that will return an appropriately escaped String.

You could use the Enterprise Security API (ESAPI) library, which provides methods like encodeForXML and encodeForXMLAttribute. Take a look at the documentation of the Encoder interface; it also contains examples of how to create an instance of DefaultEncoder.

Use JAXP and forget about text handling it will be done for you automatically.

Here's an easy solution and it's great for encoding accented characters too!
String in = "Hi Lârry & Môe!";
StringBuilder out = new StringBuilder();
for(int i = 0; i < in.length(); i++) {
char c = in.charAt(i);
if(c < 31 || c > 126 || "<>\"'\\&".indexOf(c) >= 0) {
out.append("&#" + (int) c + ";");
} else {
out.append(c);
}
}
System.out.printf("%s%n", out);
Outputs
Hi Lârry & Môe!

Try to encode the XML using Apache XML serializer
//Serialize DOM
OutputFormat format = new OutputFormat (doc);
// as a String
StringWriter stringOut = new StringWriter ();
XMLSerializer serial = new XMLSerializer (stringOut,
format);
serial.serialize(doc);
// Display the XML
System.out.println(stringOut.toString());

Just replace
& with &
And for other characters:
> with >
< with <
\" with "
' with &apos;

Here's what I found after searching everywhere looking for a solution:
Get the Jsoup library:
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.12.1</version>
</dependency>
Then:
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Entities
import org.jsoup.parser.Parser
String xml = '''<?xml version = "1.0"?>
<SOAP-ENV:Envelope
xmlns:SOAP-ENV = "http://www.w3.org/2001/12/soap-envelope"
SOAP-ENV:encodingStyle = "http://www.w3.org/2001/12/soap-encoding">
<SOAP-ENV:Body xmlns:m = "http://www.example.org/quotations">
<m:GetQuotation>
<m:QuotationsName> MiscroSoft#G>>gle.com </m:QuotationsName>
</m:GetQuotation>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>'''
Document doc = Jsoup.parse(new ByteArrayInputStream(xml.getBytes("UTF-8")), "UTF-8", "", Parser.xmlParser())
doc.outputSettings().charset("UTF-8")
doc.outputSettings().escapeMode(Entities.EscapeMode.base)
println doc.toString()
Hope this helps someone

I have created my wrapper here, hope it will helps a lot, Click here You can modify depends on your requirements

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to Strip Out the Text From And Html String in Java - java

I doubt that there is an easy way to do this using regex. Jericho is a pretty neat HTML parser with a small footprint and a single jar without additional external libraries.

Do you know any other option/framework I could use? You might want to look at JSoup. Seems to be designed to solve exactly this type of problem.

Something along the lines of: pageSource.replaceAll(">.*<", "><"); Should get you started.

Related

Determine file extension for image urls

Java Library to truncate html strings?

Java Regex or XML parser?

removing html tags using a for loop in Java [duplicate]

Best way to encode text data for XML in Java?

Categories

Resources