Validate and parse xml using woodstox with local dtd - java

I have seen multiple questions that relate to parsing xmls using woodstox and JAXB to unmarshal using the XMLStreamReader and validating against schemas.Reading though them hasn't helped. What I need is to validate an incoming xml with a local DTD and parse the entire contents into an object representation. The incoming xml can have a DOCTYPE which includes a DTD. This needs to be skipped and a local DTD needs to be used instead. The implementation should be very quick. Expected < 1ms to do the validation and parsing. I could manage to parse alone using the following in 5ms. Incorporating validation doesn't work with setting the schema (commented lines of code)
xmlif = XMLInputFactory2.newInstance();
xmlif.setProperty(XMLInputFactory2.SUPPORT_DTD, false);
JAXBContext ucontext;
ucontext = JAXBContext.newInstance(XMLOuterElementClass.class);
unmarshaller = ucontext.createUnmarshaller();
/*SchemaFactory sf = SchemaFactory.newInstance(XMLConstants.XML_DTD_NS_URI);
Schema schema = sf.newSchema(new File("c:/resources/schma.dtd"));
unmarshaller.setSchema(schema);*/
XMLStreamReader xsr = xmlif
.createXMLStreamReader(new StringReader(xml));
//xsr = new StreamReaderDelegate(xsr);
long start = System.currentTimeMillis();
try {
while (xsr.hasNext()) {
if (xsr.isStartElement()
&& xsr.getLocalName() == "XMLOuterElementClass") {
break;
}
xsr.next();
}
JAXBElement<XMLOuterElementClass> jb = unmarshaller.unmarshal(xsr,
XMLOuterElementClass.class);
System.out.println("Total time taken in ms :" + (end - start));
} finally {
xsr.close();
}

There are multiple ways to do it; and the best way to get an answer with more depth is to ask this on Woodstox user list (see http://xircles.codehaus.org/projects/woodstox/lists).
But one thing to note is that JAXB knows nothing about Stax2 (Woodstox/Aalto extension over basic Stax), so you need to access it via Stax2 API, not JAXB. So, to enable "external" validation, you need to call:
xmlStreamReader2.validateAgainst(schemaFromDTD);
and you can do this right after constructing stream reader (needs to cast to XMLStreamReader2, or at least to Validatable).
Note that you can validate when reading OR writing, both work similarly (in latter case you enable it via XMLStreamWriter).
Another possibility is to define XMLResolver property (see XMLInputFactory.RESOLVER).
It gets called when trying to read an external dtd, that is, when DOCTYPE contains reference to an external file. Custom XMLResolver can then redirect this read to use some other source.
Note that the first approach (one you started with) is likely more efficient as it only needs to read and parse Schema once, assuming you read it once and reuse afterwards.
Validation itself should be fast, and if parsing takes 4 milliseconds, should not take more than 1 millisecond; especially if you include JAXB processing in 4 milliseconds (that's technically data-binding, above lower level parsing).

Related

how to append the xml string to existing xml file using xstream

I am using xstream to marshal / unmarshal between java object to / from xml, one question is, is there a right solution to solve my problem (using xstream or other advanced method, instead of pure java API).
The existing XML file can grow quite big (more than 200 mb, for example), and I like to append the new xml to this existing XML file, but without unmarshal the existing XML file first, simply append it to the end (before the root element).
Please advice, thanks.
You could load the first XML as org.w3c.dom.Document and import the second one as org.w3c.dom.Element:
Element nodeToImport = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse( secondXmlFile ).getDocumentElement();
dom.importNode( nodeToImport, true );
...
There is a similar example here, but with a new node around the root one: Add an element around root element of given XML file that is stored in org.w3c.dom.Document

Handling character entities in Java using JDOM : how to?

I've to convert a xml file to a sgml file.
I'm using Java 1.6.0.31. and jDOM 2.0.5
I do not own the sgml's DTD.
The DTDs declare lots of character entities ( like γ , ω... but i'm not allowed to use the γ entity form)
I do own the xml ( I mean I'm able to edit the xsd and do whatever I want with this part)
The XML's xsd do not declare these entities, but I'm using a xml editor that allow inserting these entities
My problem is when I try to convert a xml containing these entitites I get a "&entities; referenced but not declared" exception message.
The code is :
File sourceFile = new File(path);
if (sourceFile.exists()) {
DocumentBuilderFactory factory DocumentBuilderFactory.newInstance();
factory.setExpandEntityReferences(false);
factory.setValidating(false);
factory.setIgnoringComments(true);
factory.setIgnoringElementContentWhitespace(false);
DOMBuilder builder = new DOMBuilder();
this.xmlDocument = builder.build(factory.newDocumentBuilder().parse(sourceFile));
The factory.newDocumentBuilder().parse() is the exception thrower (Obviously).
I've been looking for answers, but I'm not good enough with JDOM to decide what I should do, so my question is : What is the safest thing to do to allow entities resolve in this case ?
Should I create a custom EntityResolver that will do the job ?
Should I force the inputed xml to have γ format entities then replace the numeric value by "full-text" value ?
Thanks for your help !
EDIT: Replacing & so you can see the code, not the entities :/
Jeez again,
I've ended doing something very ugly :
I've inserted all the ENTITIES I needed in the document internal subset using a filecontent.replaceFirst("<!DOCTYPE X \\[", "<!DOCTYPE X [" + getEntityFile());
and
function getEntityFile() {
return FileUtils.readFileToString(f);
}
where f is the DTD file containing all the char entities I'am allowed to use ( copied from the SGML DTD).. So I can avoid the "entites referenced but not declared". Then these entities were replaced ( Yeah, I haven't found a way to not replace internal entities using jDOM2 => If someone have an idea, I'll bring the beer)
And at the end, when I'm outputing the SGML file, I replace the value by there entities reference ...
I'm ashamed, but for now, it work ...

Specifying DTD to be used by DocumentBuilders for XML parsing?

I am currently writing a tool, using Java 1.6, that brings together a number of XML files. All of the files validate to the DocBook 4.5 DTD (I have checked this using xmllint and specifying the DocBook 4.5 DTD as the --dtdvalid parameter), but not all of them include the DOCTYPE declaration.
I load each XML file into the DOM to perform the required manipulation like so:
private Document fileToDocument( File input ) throws ParserConfigurationException, IOException, SAXException {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
factory.setIgnoringElementContentWhitespace(false);
factory.setIgnoringComments(false);
factory.setValidating(false);
factory.setExpandEntityReferences(false);
DocumentBuilder builder = factory.newDocumentBuilder();
return builder.parse( input );
}
For the most part this has worked quite well, I can use he returned object to navigate the tree and perform the required manipulations and then write the document back out. Where I am encountering problems is with files which:
Do not include the DOCTYPE declaration, and
Do include entities defined in the DTD (for example — / —).
Where this is the case an exception is thrown from the builder.parse(...) call with the message:
[Fatal Error] :5:15: The entity "mdash" was referenced, but not declared.
Fair enough, it isn't declared. What I would ideally do in this instance is set the DocumentBuilderFactory to always use the DocBook 4.5 DTD regardless of whether one is specified in the file.
I did try validation using the DocBook 4.5 schema but found that this produced a number of unrelated errors with the XML. It seems like the schema might not be functionally equivalent to the DTD, at least for this version of the DocBook specification.
The other option I can think of is to read the file in, try and detect whether a doctype was set or not, and then set one if none was found prior to actually parsing the XML into the DOM.
So, my question is, is there a smarter way that I have not seen to tell the parser to use a specific DTD or ensure that parsing proceeds despite the entities not resolving (not just the &emdash; example but any entities in the XML - there are a large number of potentials)?
Could using an EntityResolver2 and implementing EntityResolver2.getExternalSubset() help?
... This method can also be used with documents that have no DOCTYPE declaration. When the root element is encountered, but no DOCTYPE declaration has been seen, this method is invoked. If it returns a value for the external subset, that root element is declared to be the root element, giving the effect of splicing a DOCTYPE declaration at the end the prolog of a document that could not otherwise be valid. ...

java.net.ConnectException : Validating Xml against XSD : local machine

I need to validate an XML against a local XSD and I do not have a internet connection on the target machine (on which this process runs). The code look like below :
SchemaFactory factory = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");
File schemaLocation = new File(xsd);
Schema schema = factory.newSchema(schemaLocation);
Validator validator = schema.newValidator();
Source source = new StreamSource(new BufferedInputStream(new FileInputStream(new File(xml))));
validator.validate(source);
I always get a java.net.ConnectException when validate() is called.
Can you please let me know what is not being done correctly ?
Many Thanks.
Abhishek
Agreed with Mads' comment - there are likely many references here that will attempt outgoing connections to the Internet, and you will need to download local copies for them. However, I'd advise against changing references within the XML or schema files, etc. - but instead, provide an EntityResolver to return the contents of your local copies instead of connecting out to the Internet. (I previously wrote a little bit about this at http://blogger.ziesemer.com/2009/01/xml-and-xslt-tips-and-tricks-for-java.html#InputValidation.)
However, in your case, since you're using a Validator instead of Validator.setResourceResolver(...) - and pass-in a LSResourceResolver, before calling validate.

XML to be validated against multiple xsd schemas

I'm writing the xsd and the code to validate, so I have great control here.
I would like to have an upload facility that adds stuff to my application based on an xml file. One part of the xml file should be validated against different schemas based on one of the values in the other part of it. Here's an example to illustrate:
<foo>
<name>Harold</name>
<bar>Alpha</bar>
<baz>Mercury</baz>
<!-- ... more general info that applies to all foos ... -->
<bar-config>
<!-- the content here is specific to the bar named "Alpha" -->
</bar-config>
<baz-config>
<!-- the content here is specific to the baz named "Mercury" -->
</baz>
</foo>
In this case, there is some controlled vocabulary for the content of <bar>, and I can handle that part just fine. Then, based on the bar value, the appropriate xml schema should be used to validate the content of bar-config. Similarly for baz and baz-config.
The code doing the parsing/validation is written in Java. Not sure how language-dependent the solution will be.
Ideally, the solution would permit the xml author to declare the appropriate schema locations and what-not so that s/he could get the xml validated on the fly in a sufficiently smart editor.
Also, the possible values for <bar> and <baz> are orthogonal, so I don't want to do this by extension for every possible bar/baz combo. What I mean is, if there are 24 possible bar values/schemas and 8 possible baz values/schemas, I want to be able to write 1 + 24 + 8 = 33 total schemas, instead of 1 * 24 * 8 = 192 total schemas.
Also, I'd prefer to NOT break out the bar-config and baz-config into separate xml files if possible. I realize that might make all the problems much easier, as each xml file would have a single schema, but I'm trying to see if there is a good single-xml-file solution.
I finally figured this out.
First of all, in the foo schema, the bar-config and baz-config elements have a type which includes an any element, like this:
<sequence>
<any minOccurs="0" maxOccurs="1"
processContents="lax" namespace="##any" />
</sequence>
In the xml, then, you must specify the proper namespace using the xmlns attribute on the child element of bar-config or baz-config, like this:
<bar-config>
<config xmlns="http://www.example.org/bar/Alpha">
... config xml here ...
</config>
</bar-config>
Then, your XML schema file for bar Alpha will have a target namespace of http://www.example.org/bar/Alpha and will define the root element config.
If your XML file has namespace declarations and schema locations for both of the schema files, this is sufficient for the editor to do all of the validating (at least good enough for Eclipse).
So far, we have satisfied the requirement that the xml author may write the xml in such a way that it is validated in the editor.
Now, we need the consumer to be able to validate. In my case, I'm using Java.
If by some chance, you know the schema files that you will need to use to validate ahead of time, then you simply create a single Schema object and validate as usual, like this:
Schema schema = factory().newSchema(new Source[] {
new StreamSource(stream("foo.xsd")),
new StreamSource(stream("Alpha.xsd")),
new StreamSource(stream("Mercury.xsd")),
});
In this case, however, we don't know which xsd files to use until we have parsed the main document. So, the general procedure is to:
Validate the xml using only the main (foo) schema
Determine the schema to use to validate the portion of the document
Find the node that is the root of the portion to validate using a separate schema
Import that node into a brand new document
Validate the brand new document using the other schema file
Caveat: it appears that the document must be built namespace-aware in order for this to work.
Here's some code (this was ripped from various places of my code, so there might be some errors introduced by the copy-and-paste):
// Contains the filename of the xml file
String filename;
// Load the xml data using a namespace-aware builder (the method
// 'stream' simply opens an input stream on a file)
Document document;
DocumentBuilderFactory docBuilderFactory =
DocumentBuilderFactory.newInstance();
docBuilderFactory.setNamespaceAware(true);
document = docBuilderFactory.newDocumentBuilder().parse(stream(filename));
// Create the schema factory
SchemaFactory sFactory = SchemaFactory.newInstance(
XMLConstants.W3C_XML_SCHEMA_NS_URI);
// Load the main schema
Schema schema = sFactory.newSchema(
new StreamSource(stream("foo.xsd")));
// Validate using main schema
schema.newValidator().validate(new DOMSource(document));
// Get the node that is the root for the portion you want to validate
// using another schema
Node node= getSpecialNode(document);
// Build a Document from that node
Document subDocument = docBuilderFactory.newDocumentBuilder().newDocument();
subDocument.appendChild(subDocument.importNode(node, true));
// Determine the schema to use using your own logic
Schema subSchema = parseAndDetermineSchema(document);
// Validate using other schema
subSchema.newValidator().validate(new DOMSource(subDocument));
Take a look at NVDL (Namespace-based Validation Dispatching Language) - http://www.nvdl.org/
It is designed to do what you want to do (validate parts of an XML document that have their own namespaces and schemas).
There is a tutorial here - http://www.dpawson.co.uk/nvdl/ - and a Java implementation here - http://jnvdl.sourceforge.net/
Hope that helps!
Kevin
You need to define a target namespace for each separately-validated portions of the instance document. Then you define a master schema that uses <xsd:include> to reference the schema documents for these components.
The limitation with this approach is that you can't let the individual components define the schemas that should be used to validate them. But it's a bad idea in general to let a document tell you how to validate it (ie, validation should something that your application controls).
You can also use a "resource resolver" to allow "xml authors" to specify their own schema file, at least to some extent, ex: https://stackoverflow.com/a/41225329/32453 at the end of the day, you want a fully compliant xml file that can be validatable with normal tools, anyway :)

Categories