I am loading XML into a Marklogic database using Java and Marklogic's XCC API. Before I do so, I use Apache Commons to escape the elements' contents (StringEscapeUtils.escapeXml). Upon loading the contents, though, I error out due to a curly brace character in the contents. escapeXml doesn't handle the curly brace. My questions are:
1) Is that a Marklogic specific issue (maybe with XCC) or is it an issue with XML in general?
2) Are there other characters that could also cause a problem (i.e. not escaped by the escapeXml routine)?
3) Is there a different routine that could be used to avoid this and any future undesired characters?
You should not escape contents when using XCC, it escapes them itself so you would be doing double-escapes. However curly brackes are generally not something XML complains about, perhaps you are using the Invoke instead of Insert methods.
( i.e. then XCC would try to interpret your content as XQuery )
Could you provide a sample of your content and code snippet ?
Related
I read elements with CDATA sections from a rss-feed which I need to convert to valid xml. The content in the CDATA section is mostly valid xhtml, but some times characters like ampersand appear in attributes (url's).
I can use .replaceAll("&", "&") to solve this but thinking a bit forward it may be that other invalid characters show up in attributes or text.
The CMS to which I'm importing the element, won't accept CDATA sections without setting up another configuration for the content, so my question is: is there any simple way to escape the string, only for attributes and text?
I'm using the jdom library to manipulate the xml after the import.
Edit: I've checked out apache's StringEscapeUtils, but this is escaping the whole string. I need something that will only escape attribute values and text inside elements.
Apache Commons provides handy functions for this: StringEscapeUtils
When you use JDOM it will automatically correctly escape ay content that needs it. Is your CMS loaded with the output of JDOM, or are you using some other library to populate the CMS...?
In essence, if you have valid XML input, and you use JDOM (something from org.jdom2.output.*) to output the data, then you will always have good output.... so, what are you doing to have broken output?
Rolf
I've been using Apache's StringEscapeUtils for HTML entities, but if you want to escape HTML attribute values, is there a standard way to do this? I guess that using the escapeHtml function won't cut it, since otherwise why would the Owasp
Encoder interface have two different methods to cope with this?
Does anyone know what is involved in escaping HTML attributes vs. entities and what to do about attribute encoding in the case that you don't have the Owasp library to hand?
It looks like this is Rule #2 of the Owasp's XSS Prevention Cheat Sheet. Note the bit where is says:
Properly quoted attributes can only be escaped with the corresponding
quote
Therefore, I guess so long as the attributes are correctly bounded with double or single quotes and you escape these (i.e. double quote (") becomes " and single quote (') becomes ' (or ')) then you should be ok. Note that Apache's StringEscapeUtils.escapeHtml will be insufficient for this task since it does not escape the single quote ('); you should use the String's replaceAll method to do this.
Otherwise, if the attribute is written: <div attr=some_value> then you need to follow the recommendation on that page and..
escape all characters with ASCII values less than 256 with the &#xHH;
format (or a named entity if available) to prevent switching out of
the attribute
Not sure if there a non-Owasp standard implementation of this though. However, it guess it's good practice not to write attributes in this manner anyway!
Note that this is only valid when you are putting in a standard attribute values, if the attribute is a href or some JavaScript handler, then it's a different story. For examples of possible XSS scripting attacks that can occur from unsafe code inside event handler attributes see: http://ha.ckers.org/xss.html.
I get some malformed xml text input like:
"<Tag>something</Tag> 8 > 3, 2 < 3, ... <Tag>something</Tag>"
I want to clean the input so to get:
"<Tag>something</Tag> 8 > 3, 2 < 3, ... <Tag>something</Tag>"
That is, escape those special symbols like <,> and yet keep the valid tags ("<Tag>something</Tag>, note, with the same case)
Do you know of any java library to do this? Probably a xml/html parser? (though I don't really need a parser, simple a "clean" procedure)
JTidy is "HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML"
But it can also be used with xml. Check the documentation. It's incredible smart, it will probably work for you.
I don't know of any library that would do that. Your input is malformed XML, and no proper XML parser would accept it. More important, it is not always possible to distinguish an actual tag from something that looks-like-a-tag-but-is-really-text. Therefore any heuristic-based attempt that you make to solve the problem will be fragile; i.e. it could occasionally produce malformed XML.
The best approach is address the problem before you assemble the XML.
If you generate the XML by (for example) unparsing a DOM, then the unparser will take care of the escaping for you.
If you are generating the XML by templating or string bashing, then you need to call something like StringEscapeUtils.escapeXml on the relevant text chunks ... before the XML tags get incorporated.
If you leave the problem until after the "XML" has been assembled, it cannot be properly fixed.
The best solution is to fix the program generating your text input. The easiest such fix would involve an escape utility like the other answers suggested. If that's not an option, I'd use a regular expression like
</?[a-zA-Z]+ */?>
to match the expected tags, and then split the string up into tags (which you want to pass through unchanged) and text between tags (against which you want to apply an escape method.)
I wouldn't count on an XML parser to be able to do it for you because what you're dealing with isn't valid XML. It is possible for the existing lack of escaping to produce ambiguities, so you might not be able to do a perfect job either.
Check out Guava's XmlEscaper. It is in pre-release for version 11 but the code is available.
Apache Commons Lang contains a class named StringEscapeUtils which does exactly what you want! The method you'd want to use is escapeXml, I presume.
I am creating an XML file using Java and am then reading the data from it too. The data I am adding as text node contains <p> at several places, but as soon I try to read it, the string terminates on encountering <. What am I doing wrong? Would using escape sequence be helpful.?
You need to use entities for < and > as they are reserved characters. Use < and > to replace the angle brackets.
XML has other reserved characters like & also.
You could use StringEscapeUtils.escapeXml() from Apache Commons Lang to do escaping. However you shouldn't need to deal with escaping if you're using any library to read and write your XML. If you're constructing the XML entirely with your own strings then you should reconsider that approach.
We have a JAVA application that pulls the data from SAP, parses it and renders to the users.
The data is pulled using JCO connector.
Recently we were thrown an exception:
org.xml.sax.SAXParseException: Character reference "�" is an invalid XML character.
So, we are planning to write a new level of indirection where ALL special/illegal characters are replaced BEFORE parsing the XML.
My questions here are :
Is there any existing(open source) utility that does this job of replacing illegal characters in XML?
Or if I had to write such utility, how should i handle them?
Why is the above exception thrown?
Thank You.
From my point of view, the source (SAP) should do the replacement. Otherwise, what it transmits to your programm may looks like XML, but is not.
While replacing the '&' by '&' can be done by a simple String.replaceAll(...) to the string from to toXML() call, others characters can be harder to replace (the '<' and '>' for exemple).
regards
Guillaume
It sounds like a bug in their escaping. Depending on context you might be best off just writing your own version of their XMLWriter class that uses a real XML library rather than trying to write your own XML utilities like the SAP developers did.
Alternatively, looking at the character code, �, you might be able to get away with a replace all on it with the empty string:
String goodXml = badXml.replaceAll("", "");
I've had a related, but opposite problem, where I was trying to insert character 1 into the output of an XSLT transformation. I considered post-processing to replace a marker with the zero, but instead chose to use an xsl:param.
If I was in your situation, I'd either come up with a bespoke encoding, replacing the characters which are invalid in XML, and handling them as special cases in your parsing, or if possible, replace them with whitespace.
I don't have experience with JCO, so can't advise on how or where I'd replace the invalid characters.
You can encode/decode non-ASCII characters in XML by using the Apache Commons Lang class StringEscapeUtils escapeXML method. See:
http://commons.apache.org/lang/api-2.4/index.html
To read about how XML character references work, search for "numeric character references" on wikipedia.