XML to be validated against multiple xsd schemas - java

I'm writing the xsd and the code to validate, so I have great control here.
I would like to have an upload facility that adds stuff to my application based on an xml file. One part of the xml file should be validated against different schemas based on one of the values in the other part of it. Here's an example to illustrate:
<foo>
<name>Harold</name>
<bar>Alpha</bar>
<baz>Mercury</baz>
<!-- ... more general info that applies to all foos ... -->
<bar-config>
<!-- the content here is specific to the bar named "Alpha" -->
</bar-config>
<baz-config>
<!-- the content here is specific to the baz named "Mercury" -->
</baz>
</foo>
In this case, there is some controlled vocabulary for the content of <bar>, and I can handle that part just fine. Then, based on the bar value, the appropriate xml schema should be used to validate the content of bar-config. Similarly for baz and baz-config.
The code doing the parsing/validation is written in Java. Not sure how language-dependent the solution will be.
Ideally, the solution would permit the xml author to declare the appropriate schema locations and what-not so that s/he could get the xml validated on the fly in a sufficiently smart editor.
Also, the possible values for <bar> and <baz> are orthogonal, so I don't want to do this by extension for every possible bar/baz combo. What I mean is, if there are 24 possible bar values/schemas and 8 possible baz values/schemas, I want to be able to write 1 + 24 + 8 = 33 total schemas, instead of 1 * 24 * 8 = 192 total schemas.
Also, I'd prefer to NOT break out the bar-config and baz-config into separate xml files if possible. I realize that might make all the problems much easier, as each xml file would have a single schema, but I'm trying to see if there is a good single-xml-file solution.

I finally figured this out.
First of all, in the foo schema, the bar-config and baz-config elements have a type which includes an any element, like this:
<sequence>
<any minOccurs="0" maxOccurs="1"
processContents="lax" namespace="##any" />
</sequence>
In the xml, then, you must specify the proper namespace using the xmlns attribute on the child element of bar-config or baz-config, like this:
<bar-config>
<config xmlns="http://www.example.org/bar/Alpha">
... config xml here ...
</config>
</bar-config>
Then, your XML schema file for bar Alpha will have a target namespace of http://www.example.org/bar/Alpha and will define the root element config.
If your XML file has namespace declarations and schema locations for both of the schema files, this is sufficient for the editor to do all of the validating (at least good enough for Eclipse).
So far, we have satisfied the requirement that the xml author may write the xml in such a way that it is validated in the editor.
Now, we need the consumer to be able to validate. In my case, I'm using Java.
If by some chance, you know the schema files that you will need to use to validate ahead of time, then you simply create a single Schema object and validate as usual, like this:
Schema schema = factory().newSchema(new Source[] {
new StreamSource(stream("foo.xsd")),
new StreamSource(stream("Alpha.xsd")),
new StreamSource(stream("Mercury.xsd")),
});
In this case, however, we don't know which xsd files to use until we have parsed the main document. So, the general procedure is to:
Validate the xml using only the main (foo) schema
Determine the schema to use to validate the portion of the document
Find the node that is the root of the portion to validate using a separate schema
Import that node into a brand new document
Validate the brand new document using the other schema file
Caveat: it appears that the document must be built namespace-aware in order for this to work.
Here's some code (this was ripped from various places of my code, so there might be some errors introduced by the copy-and-paste):
// Contains the filename of the xml file
String filename;
// Load the xml data using a namespace-aware builder (the method
// 'stream' simply opens an input stream on a file)
Document document;
DocumentBuilderFactory docBuilderFactory =
DocumentBuilderFactory.newInstance();
docBuilderFactory.setNamespaceAware(true);
document = docBuilderFactory.newDocumentBuilder().parse(stream(filename));
// Create the schema factory
SchemaFactory sFactory = SchemaFactory.newInstance(
XMLConstants.W3C_XML_SCHEMA_NS_URI);
// Load the main schema
Schema schema = sFactory.newSchema(
new StreamSource(stream("foo.xsd")));
// Validate using main schema
schema.newValidator().validate(new DOMSource(document));
// Get the node that is the root for the portion you want to validate
// using another schema
Node node= getSpecialNode(document);
// Build a Document from that node
Document subDocument = docBuilderFactory.newDocumentBuilder().newDocument();
subDocument.appendChild(subDocument.importNode(node, true));
// Determine the schema to use using your own logic
Schema subSchema = parseAndDetermineSchema(document);
// Validate using other schema
subSchema.newValidator().validate(new DOMSource(subDocument));

Take a look at NVDL (Namespace-based Validation Dispatching Language) - http://www.nvdl.org/
It is designed to do what you want to do (validate parts of an XML document that have their own namespaces and schemas).
There is a tutorial here - http://www.dpawson.co.uk/nvdl/ - and a Java implementation here - http://jnvdl.sourceforge.net/
Hope that helps!
Kevin

You need to define a target namespace for each separately-validated portions of the instance document. Then you define a master schema that uses <xsd:include> to reference the schema documents for these components.
The limitation with this approach is that you can't let the individual components define the schemas that should be used to validate them. But it's a bad idea in general to let a document tell you how to validate it (ie, validation should something that your application controls).

You can also use a "resource resolver" to allow "xml authors" to specify their own schema file, at least to some extent, ex: https://stackoverflow.com/a/41225329/32453 at the end of the day, you want a fully compliant xml file that can be validatable with normal tools, anyway :)

Related

Multiple XSD schemas in one xjc/JAXB generated XML file?

I've got a set of XSD schema files I'm generating Java classes from:
http://xmlgw.companieshouse.gov.uk/v1-0/schema/Egov_ch.xsd
http://xmlgw.companieshouse.gov.uk/v1-0/schema/forms/ReturnofAllotmentShares-v3-0.xsd
http://xmlgw.companieshouse.gov.uk/v1-0/schema/forms/FormSubmission-v2-11.xsd
These come together to form a single XML document I need to submit to the Companies House XML Gateway. Right now, the XML files generated from the Java files generated by xjc from these schemas do not include schemaLocation or other schema-related information. By following this answer I'm able to add a top-level schemaLocation to the <GovTalkMessage> element, but not to the others (e.g. <FormSubmission> and <ReturnofAllotmentShares>.
Is there a way I can do this, or to set it at Java-class generation time?
EDIT: Error returned when I remove one of the schemaLocation attributes (if I include it, it returns successfully)
<GovTalkErrors>
<Error>
<RaisedBy>CH_XML_Gateway</RaisedBy>
<Number>100</Number>
<Type>fatal</Type>
<Text>XML failed schema validation: Invalid XML: Unknown element 'ReturnofAllotmentShares' line 36 column 86</Text>
<Location></Location>
</Error>
</GovTalkErrors>
My understanding is that without the schemaLocation attribute, the parser doesn't know what specific document we're referring to here. Each document (maybe 20 of them) is it's own XML schema.

Handling character entities in Java using JDOM : how to?

I've to convert a xml file to a sgml file.
I'm using Java 1.6.0.31. and jDOM 2.0.5
I do not own the sgml's DTD.
The DTDs declare lots of character entities ( like γ , ω... but i'm not allowed to use the γ entity form)
I do own the xml ( I mean I'm able to edit the xsd and do whatever I want with this part)
The XML's xsd do not declare these entities, but I'm using a xml editor that allow inserting these entities
My problem is when I try to convert a xml containing these entitites I get a "&entities; referenced but not declared" exception message.
The code is :
File sourceFile = new File(path);
if (sourceFile.exists()) {
DocumentBuilderFactory factory DocumentBuilderFactory.newInstance();
factory.setExpandEntityReferences(false);
factory.setValidating(false);
factory.setIgnoringComments(true);
factory.setIgnoringElementContentWhitespace(false);
DOMBuilder builder = new DOMBuilder();
this.xmlDocument = builder.build(factory.newDocumentBuilder().parse(sourceFile));
The factory.newDocumentBuilder().parse() is the exception thrower (Obviously).
I've been looking for answers, but I'm not good enough with JDOM to decide what I should do, so my question is : What is the safest thing to do to allow entities resolve in this case ?
Should I create a custom EntityResolver that will do the job ?
Should I force the inputed xml to have γ format entities then replace the numeric value by "full-text" value ?
Thanks for your help !
EDIT: Replacing & so you can see the code, not the entities :/
Jeez again,
I've ended doing something very ugly :
I've inserted all the ENTITIES I needed in the document internal subset using a filecontent.replaceFirst("<!DOCTYPE X \\[", "<!DOCTYPE X [" + getEntityFile());
and
function getEntityFile() {
return FileUtils.readFileToString(f);
}
where f is the DTD file containing all the char entities I'am allowed to use ( copied from the SGML DTD).. So I can avoid the "entites referenced but not declared". Then these entities were replaced ( Yeah, I haven't found a way to not replace internal entities using jDOM2 => If someone have an idea, I'll bring the beer)
And at the end, when I'm outputing the SGML file, I replace the value by there entities reference ...
I'm ashamed, but for now, it work ...

Validate and parse xml using woodstox with local dtd

I have seen multiple questions that relate to parsing xmls using woodstox and JAXB to unmarshal using the XMLStreamReader and validating against schemas.Reading though them hasn't helped. What I need is to validate an incoming xml with a local DTD and parse the entire contents into an object representation. The incoming xml can have a DOCTYPE which includes a DTD. This needs to be skipped and a local DTD needs to be used instead. The implementation should be very quick. Expected < 1ms to do the validation and parsing. I could manage to parse alone using the following in 5ms. Incorporating validation doesn't work with setting the schema (commented lines of code)
xmlif = XMLInputFactory2.newInstance();
xmlif.setProperty(XMLInputFactory2.SUPPORT_DTD, false);
JAXBContext ucontext;
ucontext = JAXBContext.newInstance(XMLOuterElementClass.class);
unmarshaller = ucontext.createUnmarshaller();
/*SchemaFactory sf = SchemaFactory.newInstance(XMLConstants.XML_DTD_NS_URI);
Schema schema = sf.newSchema(new File("c:/resources/schma.dtd"));
unmarshaller.setSchema(schema);*/
XMLStreamReader xsr = xmlif
.createXMLStreamReader(new StringReader(xml));
//xsr = new StreamReaderDelegate(xsr);
long start = System.currentTimeMillis();
try {
while (xsr.hasNext()) {
if (xsr.isStartElement()
&& xsr.getLocalName() == "XMLOuterElementClass") {
break;
}
xsr.next();
}
JAXBElement<XMLOuterElementClass> jb = unmarshaller.unmarshal(xsr,
XMLOuterElementClass.class);
System.out.println("Total time taken in ms :" + (end - start));
} finally {
xsr.close();
}
There are multiple ways to do it; and the best way to get an answer with more depth is to ask this on Woodstox user list (see http://xircles.codehaus.org/projects/woodstox/lists).
But one thing to note is that JAXB knows nothing about Stax2 (Woodstox/Aalto extension over basic Stax), so you need to access it via Stax2 API, not JAXB. So, to enable "external" validation, you need to call:
xmlStreamReader2.validateAgainst(schemaFromDTD);
and you can do this right after constructing stream reader (needs to cast to XMLStreamReader2, or at least to Validatable).
Note that you can validate when reading OR writing, both work similarly (in latter case you enable it via XMLStreamWriter).
Another possibility is to define XMLResolver property (see XMLInputFactory.RESOLVER).
It gets called when trying to read an external dtd, that is, when DOCTYPE contains reference to an external file. Custom XMLResolver can then redirect this read to use some other source.
Note that the first approach (one you started with) is likely more efficient as it only needs to read and parse Schema once, assuming you read it once and reuse afterwards.
Validation itself should be fast, and if parsing takes 4 milliseconds, should not take more than 1 millisecond; especially if you include JAXB processing in 4 milliseconds (that's technically data-binding, above lower level parsing).

What is meant by 'parsed data' in the xml 1.1 spec?

I am re-wording my question because the 'parsed entity' thing has nothing to do with the problem at hand.
XML 1.1 versus 1.0
Is an xml 1.1 library is to escape illegal characters before serializing/deserializing them? Or is the library is to forbid them outright? Which is the correct way to set Text on an xml element?
if Element e = new Element("foo")
Should I do this:
e.setText(sanitized_text_illegal_characters_removed_or_escaped) ?
or
e.setText(any_text)
A parsed entity is something you don't really need to worry about unless you're writing an XML parser. It's things like < and &. You can define your own in the document DTD, but it's a rarely used feature. An external parsed entity is one whose contents reside in another file or network resource or somewhere like that.
As to your main question:
Which is the correct way to set Text on an xml element?
if Element e = new Element("foo")
Should I do this:
e.setText(string_of_sanitized_data_with_illegal_characters_escaped) ?
or
e.setText(any_text)
You should set the text as you would like it to come out the other end, when the document is deserialized. This normally means you should not escape the data, and the XML library will do this for you.
e.g.:
You insert the text "bed & breakfast".
The XML library converts this to "bed & breakfast" or "<![CDATA[bed & breakfast]]>" or some other representation, it doesn't really matter.
You send the document somewhere else.
The other parser reads the document and converts the text back.
The end software retrieves the string "bed & breakfast".
If you're writing XML programmatically, then you almost certainly don't want to use parsed entities.
There are two kinds of parsed entities: internal and external. An internal parsed entity is defined by a DTD declaration like this:
<!ENTITY me "Mike">
or
<!ENTITY me "<name>Mike</name>">
An external parsed entity is defined by a DTD declaration like this:
<!ENTITY me SYSTEM "me.xml">
Whether the entity is internal or external, it can be referenced by an entity reference like this:
&me;
which can appear within the content of an element or attribute.

Adding source validation to a StructuredTextViewer

I added to my application a nice XML source viewer. Now, I have an XSD scheme that defines the xml document. Any idea where to start on adding some source validation that relies on this scheme?
Thanks!
To check that your XML is well-formed, just run it through a DocumentBuilderFactory parser. To additionally validate it against an .xsd schema referenced in the XML, call:
factory.setValidating( true );
If the xsd schema is not referenced within the XML that you are validating, you can supply it yourself like this:
factory.setAttribute(JAXP_SCHEMA_SOURCE, new File(schemaSource) );
For more information, read the article from Oracle here:
http://download.oracle.com/javaee/1.4/tutorial/doc/JAXPDOM8.html

Categories