Specifying DTD to be used by DocumentBuilders for XML parsing? - java

I am currently writing a tool, using Java 1.6, that brings together a number of XML files. All of the files validate to the DocBook 4.5 DTD (I have checked this using xmllint and specifying the DocBook 4.5 DTD as the --dtdvalid parameter), but not all of them include the DOCTYPE declaration.
I load each XML file into the DOM to perform the required manipulation like so:
private Document fileToDocument( File input ) throws ParserConfigurationException, IOException, SAXException {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
factory.setIgnoringElementContentWhitespace(false);
factory.setIgnoringComments(false);
factory.setValidating(false);
factory.setExpandEntityReferences(false);
DocumentBuilder builder = factory.newDocumentBuilder();
return builder.parse( input );
}
For the most part this has worked quite well, I can use he returned object to navigate the tree and perform the required manipulations and then write the document back out. Where I am encountering problems is with files which:
Do not include the DOCTYPE declaration, and
Do include entities defined in the DTD (for example — / —).
Where this is the case an exception is thrown from the builder.parse(...) call with the message:
[Fatal Error] :5:15: The entity "mdash" was referenced, but not declared.
Fair enough, it isn't declared. What I would ideally do in this instance is set the DocumentBuilderFactory to always use the DocBook 4.5 DTD regardless of whether one is specified in the file.
I did try validation using the DocBook 4.5 schema but found that this produced a number of unrelated errors with the XML. It seems like the schema might not be functionally equivalent to the DTD, at least for this version of the DocBook specification.
The other option I can think of is to read the file in, try and detect whether a doctype was set or not, and then set one if none was found prior to actually parsing the XML into the DOM.
So, my question is, is there a smarter way that I have not seen to tell the parser to use a specific DTD or ensure that parsing proceeds despite the entities not resolving (not just the &emdash; example but any entities in the XML - there are a large number of potentials)?

Could using an EntityResolver2 and implementing EntityResolver2.getExternalSubset() help?
... This method can also be used with documents that have no DOCTYPE declaration. When the root element is encountered, but no DOCTYPE declaration has been seen, this method is invoked. If it returns a value for the external subset, that root element is declared to be the root element, giving the effect of splicing a DOCTYPE declaration at the end the prolog of a document that could not otherwise be valid. ...

Related

JAXP parse error on valid XML

I am trying to run some XPath Queries on XML in Java and apparently the recommended way to do so is to construct a document first.
Here is the standard JAXP code sample that I was using:
import org.w3c.dom.Document;
import javax.xml.parsers.*;
final DocumentBuilder xmlParser = DocumentBuilderFactory.newInstance().newDocumentBuilder();
final Document doc = xmlParser.parse(xmlFile);
I also tried the Saxon API, but got the same errors:
import net.sf.saxon.s9api.*;
final DocumentBuilder documentBuilder = new Processor(false).newDocumentBuilder();
final XdmNode xdm = documentBuilder.build(new File("out/data/blog.xml"));
Here is a minimal reconstructed example XML which the DocumentBuilder in JDK 1.8 can't parse:
<?xml version="1.1" encoding="UTF-8" ?>
<xml>
<![CDATA[Some example text with [funny highlight]]]>
</xml>
According to the spec, the square bracket ] just before the end of CDATA marker ]]> is perfectly legal, but the parser just exits with a stack trace and the message org.xml.sax.SAXParseException; XML document structures must start and end within the same entity..
On my original data file which contains a lot of CDATA sections, the message is instead org.xml.sax.SAXParseException; The element type "item" must be terminated by the matching end-tag "</item>". In both cases ´com.sun.org.apache.xerces´ shows up in the stacktrace a lot.
Form both observations it seems as if the parser just didn't end the CDATA section at ]]>.
EDIT: As it turned out, the example will pass when the <?xml ... ?> declaration is omitted. I hadn't checked that before posting here and added it just now.
Short answer: add Apache Xerces to the build path, it will automatically be loaded instead of the parser from the JDK and the XML will be parsed just fine! Copy-paste Gradle Dependency:
implementation "xerces:xercesImpl:2.11.0"
Some background: Apache Xerces is indeed the same parser which is also used in the JDK, but even though Xerces 2.11 dates from 2013 the JDK comes with a much older version. That really sucks!
As the Saxon team puts it:
Saxonica recommends use of the Xerces parser from Apache in preference to the version bundled in the JDK, which is known to have some serious bugs.
In case you wonder how simply putting Xerces on the classpath makes the problem disappear: even though the JDK and Saxon DocumentBuilders construct entirely different document types, they both use the same Standard Java Interfaces to call the parser and also the same mechanism to find and load the parser (or rather, the parser factory). In short, a java.util.ServiceLoader is called and looks into all the JARs in the classpath for properties files in META-INF/services and this is how the xercesJar announces that it does provide an XML parser. And good for us, the JDK's own implementation is superseded by anything found there.
After making this bad experience with JDK XML classes, I am even more motivated to refactor projects to use Saxon for XPath processing instead of the implementation of XPath in the JDK. The other reason is the technical advantage of XDM over DOM (same link as above).

Veracode XML External Entity Reference (XXE)

I've got the next finding in my veracode report:
Improper Restriction of XML External Entity Reference ('XXE') (CWE ID 611)
referring the next code bellow
...
DocumentBuilderFactory dbf=null;
DocumentBuilder db = null;
try {
dbf=DocumentBuilderFactory.newInstance();
dbf.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);
dbf.setExpandEntityReferences(false);
dbf.setXIncludeAware(false);
dbf.setValidating(false);
dbf.newDocumentBuilder();
InputStream stream = new ByteArrayInputStream(datosXml.getBytes());
Document doc = db.parse(stream, "");
...
I've been researching but I haven't found out a reason for this finding or a way of making it disappear.
Could you tell me how to do it?
Have you seen the OWASP guide about XXE?
You are not disabling the 3 features you should disable. Most importantly the first one:
dbf.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
dbf.setFeature("http://xml.org/sax/features/external-general-entities", false);
dbf.setFeature("http://xml.org/sax/features/external-parameter-entities", false);
Background:
The XXE attack is constructed around XML language capabilities to define arbitrary entities using the external Data Type Definition (DTD) and the ability to read or execute files.
Below is an example of XML file containing DTD declaration that when processed may return output of local “/etc/passwd” file:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE test [
<!ELEMENT test ANY >
<!ENTITY xxe SYSTEM "file:///etc/passwd" >]>
Mitigation:
To avoid exploitation of XEE vulnerability the best approach is to disable the ability to load entities from external source.
Now the way to disable the DTDs will defer depending upon the language used (Java,C++, .NET) and the XML parser being used (DocumentBuilderFactory, SAXParserFactory, TransformerFactory to name a few considering the java language).
Below two official references provides the best information on how to achieve the same.
https://rules.sonarsource.com/java/RSPEC-2755
https://github.com/OWASP/CheatSheetSeries/blob/master/cheatsheets/XML_External_Entity_Prevention_Cheat_Sheet.md

Java XML Parsing: Avoid entity reference resolution

I am currently parsing XHTML documents with a DOM parser, like:
final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false);
final DocumentBuilder db = dbf.newDocumentBuilder();
db.setEntityResolver(MY_ENTITY_RESOLVER);
db.setErrorHandler(MY_ERROR_HANDLER);
...
final Document doc = db.parse(inputSource);
And my problem is that when my document contains an entity reference like, for example:
<p>€</p>
My parser creates a Text node for that content containing "€" instead of "€". This is, it is resolving the entity in the way it is supposed to do it (the XHTML 1.0 Strict DTD links to the ENTITIES Latin1 DTD, which in turn establishes the equivalence of "€" with "€").
The problem is, I don't want the parser to do such thing. I would like to keep the "€" text unmodified.
I've already tried with:
final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setExpandEntityReferences(false);
But:
I don't like this because I fear this might make some parser implementations not navigate from the XHTML 1.0 Strict DTD to the ENTITIES Latin1 DTD and therefore not consider "€" as a declared entity.
When I do this, it weirdly creates two nodes: a "pound" Entity node, and a Text node with the "€" symbol after it.
Any ideas? Is it possible to configure this in a DOM Parser without resorting to preprocessing the XHTML and substituting all "&" symbols for something other?...
Solutions could be for a DOM parser or also a SAX one, I wouldn't mind using SAX parsing and then creating my DOM using a transformation...
Also, I cannot switch to a non standard XML parsing libray. No jdom, no jsoup, no HtmlCleaner, etc.
Thanks a lot.
The approach I took was to replace any entities with a unique marker that is treated as plain text by Xerces. Once converted into a Document object, the markers are replaced with Entity Reference objects.
See the convertStringToDocument() function in http://sourceforge.net/p/commonclasses/code/14/tree/trunk/src/com/redhat/ecs/commonutils/XMLUtilities.java

Parse file containing XML Fragments in Java

I inherited an "XML" license file containing no root element, but rather two XML fragments (<XmlCreated> and <Product>) so when I try to parse the file, I (expectantly) get an error about a document that is not-well-formed.
I need to get both the XmlCreated and Product tags.
Sample XML file:
<?xml version="1.0"?>
<XmlCreated>May 11 2009</XmlCreated>
<!-- License Key file Attributes -->
<Product image ="LicenseKeyFile">
<!-- MyCompany -->
<Manufacturer ID="7f">
<SerialNumber>21072832521007</SerialNumber>
<ChassisId>72060034465DE1C3</ChassisId>
<RtspMaxUsers>500</RtspMaxUsers>
<MaxChannels>8</MaxChannels>
</Manufacturer>
</Product>
Here is the current code that I use to attempt to load the XML. It does not work, but I've used it before as a starting point for well-formed XML.
public static void main(String[] args) {
try {
File file = new File("C:\\path\\LicenseFile.xml");
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(file);
} catch (Exception e) {
e.printStackTrace();
}
}
At the db.parse(file) line, I get the following Exception:
[Fatal Error] LicenseFile.xml:6:2: The markup in the document following the root element must be well-formed.
org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
at com.mycompany.licensesigning.LicenseSigner.main(LicenseSigner.java:20)
How would I go about parsing this frustrating file?
If you know this document is always going to be non-well formed... make it so. Add a new dummy <root> tag after the <?xml...>and </root> after the last of the data.
You're going to need to create two separate Document objects by breaking the file up into smaller pieces and parsing those pieces individually (or alternatively reconstructing them into a larger document by adding a tag which encloses both of them).
If you can rely on the structure of the file it should be easy to read the file into a string and then search for substrings like <Product and </Product> and then use those markers to create a string you can pass into a document builder.
How about implementing a simple wrapper around InputStream that wraps the input from the file with a root-level tag, and using that as the input to DocumentBuilder.parse()?
If the expected input is small enough to load into memory, read into a string, wrap it with a dummy start/end tag and then use:
DocumentBuilder.parse(new InputSource(new StringReader(string)))
I'd probably create a SequenceInputStream where you sandwich the real stream with two ByteArrayInputStreams that return some dummy root start tag, and end tag.
Then i'd use use the parse method that takes a stream rather than a file name.
I agree with Jim Garrison to some extent, use an InputStream or StreamReader and wrap the input in the required tags, its a simple and easy method. Main problem i can forsee is you'll have to have some checks for valid and invalid formatting (if you want to be able to use the method for both valid and invalid data), if the formatting is invalid (because of root level tags missing) wrap the input with the tags, if its valid then don't wrap the input. If the input is invalid for some other reason, you can also alter the input to correct the formatting issues.
Also, its probably better to store the ipnut in a collection of strings (of some sort) rather than a string itself, this will mean that you wont have as much of a limit to your input size. Make each string one line from the file. You should end up with a logical and easy to follow structure which mwill make it easier to allow for corrections of other formatting issues in the future.
Hardest part about that is figuring out what has caused the invalid formatting. In your case just check for root level tags, if the tags exist and are formatted correctly, dont wrap, If not, wrap.

XML to be validated against multiple xsd schemas

I'm writing the xsd and the code to validate, so I have great control here.
I would like to have an upload facility that adds stuff to my application based on an xml file. One part of the xml file should be validated against different schemas based on one of the values in the other part of it. Here's an example to illustrate:
<foo>
<name>Harold</name>
<bar>Alpha</bar>
<baz>Mercury</baz>
<!-- ... more general info that applies to all foos ... -->
<bar-config>
<!-- the content here is specific to the bar named "Alpha" -->
</bar-config>
<baz-config>
<!-- the content here is specific to the baz named "Mercury" -->
</baz>
</foo>
In this case, there is some controlled vocabulary for the content of <bar>, and I can handle that part just fine. Then, based on the bar value, the appropriate xml schema should be used to validate the content of bar-config. Similarly for baz and baz-config.
The code doing the parsing/validation is written in Java. Not sure how language-dependent the solution will be.
Ideally, the solution would permit the xml author to declare the appropriate schema locations and what-not so that s/he could get the xml validated on the fly in a sufficiently smart editor.
Also, the possible values for <bar> and <baz> are orthogonal, so I don't want to do this by extension for every possible bar/baz combo. What I mean is, if there are 24 possible bar values/schemas and 8 possible baz values/schemas, I want to be able to write 1 + 24 + 8 = 33 total schemas, instead of 1 * 24 * 8 = 192 total schemas.
Also, I'd prefer to NOT break out the bar-config and baz-config into separate xml files if possible. I realize that might make all the problems much easier, as each xml file would have a single schema, but I'm trying to see if there is a good single-xml-file solution.
I finally figured this out.
First of all, in the foo schema, the bar-config and baz-config elements have a type which includes an any element, like this:
<sequence>
<any minOccurs="0" maxOccurs="1"
processContents="lax" namespace="##any" />
</sequence>
In the xml, then, you must specify the proper namespace using the xmlns attribute on the child element of bar-config or baz-config, like this:
<bar-config>
<config xmlns="http://www.example.org/bar/Alpha">
... config xml here ...
</config>
</bar-config>
Then, your XML schema file for bar Alpha will have a target namespace of http://www.example.org/bar/Alpha and will define the root element config.
If your XML file has namespace declarations and schema locations for both of the schema files, this is sufficient for the editor to do all of the validating (at least good enough for Eclipse).
So far, we have satisfied the requirement that the xml author may write the xml in such a way that it is validated in the editor.
Now, we need the consumer to be able to validate. In my case, I'm using Java.
If by some chance, you know the schema files that you will need to use to validate ahead of time, then you simply create a single Schema object and validate as usual, like this:
Schema schema = factory().newSchema(new Source[] {
new StreamSource(stream("foo.xsd")),
new StreamSource(stream("Alpha.xsd")),
new StreamSource(stream("Mercury.xsd")),
});
In this case, however, we don't know which xsd files to use until we have parsed the main document. So, the general procedure is to:
Validate the xml using only the main (foo) schema
Determine the schema to use to validate the portion of the document
Find the node that is the root of the portion to validate using a separate schema
Import that node into a brand new document
Validate the brand new document using the other schema file
Caveat: it appears that the document must be built namespace-aware in order for this to work.
Here's some code (this was ripped from various places of my code, so there might be some errors introduced by the copy-and-paste):
// Contains the filename of the xml file
String filename;
// Load the xml data using a namespace-aware builder (the method
// 'stream' simply opens an input stream on a file)
Document document;
DocumentBuilderFactory docBuilderFactory =
DocumentBuilderFactory.newInstance();
docBuilderFactory.setNamespaceAware(true);
document = docBuilderFactory.newDocumentBuilder().parse(stream(filename));
// Create the schema factory
SchemaFactory sFactory = SchemaFactory.newInstance(
XMLConstants.W3C_XML_SCHEMA_NS_URI);
// Load the main schema
Schema schema = sFactory.newSchema(
new StreamSource(stream("foo.xsd")));
// Validate using main schema
schema.newValidator().validate(new DOMSource(document));
// Get the node that is the root for the portion you want to validate
// using another schema
Node node= getSpecialNode(document);
// Build a Document from that node
Document subDocument = docBuilderFactory.newDocumentBuilder().newDocument();
subDocument.appendChild(subDocument.importNode(node, true));
// Determine the schema to use using your own logic
Schema subSchema = parseAndDetermineSchema(document);
// Validate using other schema
subSchema.newValidator().validate(new DOMSource(subDocument));
Take a look at NVDL (Namespace-based Validation Dispatching Language) - http://www.nvdl.org/
It is designed to do what you want to do (validate parts of an XML document that have their own namespaces and schemas).
There is a tutorial here - http://www.dpawson.co.uk/nvdl/ - and a Java implementation here - http://jnvdl.sourceforge.net/
Hope that helps!
Kevin
You need to define a target namespace for each separately-validated portions of the instance document. Then you define a master schema that uses <xsd:include> to reference the schema documents for these components.
The limitation with this approach is that you can't let the individual components define the schemas that should be used to validate them. But it's a bad idea in general to let a document tell you how to validate it (ie, validation should something that your application controls).
You can also use a "resource resolver" to allow "xml authors" to specify their own schema file, at least to some extent, ex: https://stackoverflow.com/a/41225329/32453 at the end of the day, you want a fully compliant xml file that can be validatable with normal tools, anyway :)

Categories