I am creating an XML file using Java and am then reading the data from it too. The data I am adding as text node contains <p> at several places, but as soon I try to read it, the string terminates on encountering <. What am I doing wrong? Would using escape sequence be helpful.?
You need to use entities for < and > as they are reserved characters. Use < and > to replace the angle brackets.
XML has other reserved characters like & also.
You could use StringEscapeUtils.escapeXml() from Apache Commons Lang to do escaping. However you shouldn't need to deal with escaping if you're using any library to read and write your XML. If you're constructing the XML entirely with your own strings then you should reconsider that approach.
Related
I want to escape special characters in xml input.
I tried StringEscapeUtils.escapeXml10(xmlString) but it ends up escaping xml tags also(<>).
For example:
<Company>Test & Test</Company>
should normalized to
<Company>Test&Test</Company>
Not
<Company>Test&Test</Company>
You're basically asking how to automatically convert invalid XML to valid XML. That's not a tractable problem, in the general case (imagine for example that you had an embedded < in the actual data).
The correct solution to this problem is to identify why you're starting with invalid XML, and fix that issue.
I'm using StringBuidler in Java to build a HTML page.
I want to know how to escape all quotes (") without placing a "\" every time?
For example, every time when I append a string like this :
StringBuilder a ;
a.append(<div id = \"Name\" ...>)
I want to write directly :
a.append(<div id = "Name" ..>
Thanks.
Short answer: There is no way around this in Java
Long answer: Java does not have multiple ways to enclose Strings. You always do it with double quotes, so if you want to have double quotes in your String you have to escape them.
But if they really annoy you you can apply some trickery:
put your Strings in a text file and read them from there.
use a different character instead of the quote character and use replace to put in the proper quotes. Of course your replacement character must not appear anywhere else in the string.
Write the code in question in a different programming language like Groovy, which has different ways to delimit Strings.
Since you seem to generate HTML: use a proper templating engine, which really is option 1 on steroids.
When building a HTML template, the easiest solution is to use a text file.
You can do this as
a simple text file where you replace() tags with code you want to alter
use a properties file for the sections of text to inline.
use a library which has a fluent API for generating HTML
use velocity to perform the substitution for you.
use one of the other many web page formats like JSP.
However, there is no way to avoid escaping " in Java code. The only alternative is you use another character like ” (Alt-Graphic-B) which you replace at the end.
You can't, which is only one of the reasons it's a bad idea to fill a StringBuilder with HTML code by hand.
It exists in other language than Java, but with Java is not possible.
With coffescript, you can, for example :
html = """
<div id="Name" > ... </div>
"""
There's no proper way to do it, but you might be able to put a rarely used substitute character (a tilde or something) in your String and then call .replace() on it.
Ideally, you should be loading the data from a file if you want the raw string.
I am loading XML into a Marklogic database using Java and Marklogic's XCC API. Before I do so, I use Apache Commons to escape the elements' contents (StringEscapeUtils.escapeXml). Upon loading the contents, though, I error out due to a curly brace character in the contents. escapeXml doesn't handle the curly brace. My questions are:
1) Is that a Marklogic specific issue (maybe with XCC) or is it an issue with XML in general?
2) Are there other characters that could also cause a problem (i.e. not escaped by the escapeXml routine)?
3) Is there a different routine that could be used to avoid this and any future undesired characters?
You should not escape contents when using XCC, it escapes them itself so you would be doing double-escapes. However curly brackes are generally not something XML complains about, perhaps you are using the Invoke instead of Insert methods.
( i.e. then XCC would try to interpret your content as XQuery )
Could you provide a sample of your content and code snippet ?
I read elements with CDATA sections from a rss-feed which I need to convert to valid xml. The content in the CDATA section is mostly valid xhtml, but some times characters like ampersand appear in attributes (url's).
I can use .replaceAll("&", "&") to solve this but thinking a bit forward it may be that other invalid characters show up in attributes or text.
The CMS to which I'm importing the element, won't accept CDATA sections without setting up another configuration for the content, so my question is: is there any simple way to escape the string, only for attributes and text?
I'm using the jdom library to manipulate the xml after the import.
Edit: I've checked out apache's StringEscapeUtils, but this is escaping the whole string. I need something that will only escape attribute values and text inside elements.
Apache Commons provides handy functions for this: StringEscapeUtils
When you use JDOM it will automatically correctly escape ay content that needs it. Is your CMS loaded with the output of JDOM, or are you using some other library to populate the CMS...?
In essence, if you have valid XML input, and you use JDOM (something from org.jdom2.output.*) to output the data, then you will always have good output.... so, what are you doing to have broken output?
Rolf
We have a JAVA application that pulls the data from SAP, parses it and renders to the users.
The data is pulled using JCO connector.
Recently we were thrown an exception:
org.xml.sax.SAXParseException: Character reference "�" is an invalid XML character.
So, we are planning to write a new level of indirection where ALL special/illegal characters are replaced BEFORE parsing the XML.
My questions here are :
Is there any existing(open source) utility that does this job of replacing illegal characters in XML?
Or if I had to write such utility, how should i handle them?
Why is the above exception thrown?
Thank You.
From my point of view, the source (SAP) should do the replacement. Otherwise, what it transmits to your programm may looks like XML, but is not.
While replacing the '&' by '&' can be done by a simple String.replaceAll(...) to the string from to toXML() call, others characters can be harder to replace (the '<' and '>' for exemple).
regards
Guillaume
It sounds like a bug in their escaping. Depending on context you might be best off just writing your own version of their XMLWriter class that uses a real XML library rather than trying to write your own XML utilities like the SAP developers did.
Alternatively, looking at the character code, �, you might be able to get away with a replace all on it with the empty string:
String goodXml = badXml.replaceAll("", "");
I've had a related, but opposite problem, where I was trying to insert character 1 into the output of an XSLT transformation. I considered post-processing to replace a marker with the zero, but instead chose to use an xsl:param.
If I was in your situation, I'd either come up with a bespoke encoding, replacing the characters which are invalid in XML, and handling them as special cases in your parsing, or if possible, replace them with whitespace.
I don't have experience with JCO, so can't advise on how or where I'd replace the invalid characters.
You can encode/decode non-ASCII characters in XML by using the Apache Commons Lang class StringEscapeUtils escapeXML method. See:
http://commons.apache.org/lang/api-2.4/index.html
To read about how XML character references work, search for "numeric character references" on wikipedia.