Given a schema document (XSD foramt) such as the MODS 3.5 schema (US Library of Congress, LoC), and a document (XML) known to be valid according to that schema, such as the metadata for the Antitrust & Competition Policy Blog archives 2007 (HTML view) from the LoC Law Blawgs Web Archive, is there a Java API such that would allow a Java program to query the XML document for the XML schema data types that elements of the document are instances of?
It may seem as though I have may XML schemas and UML models confused. I'm thinking of an XML schema as it representing something like a UML model (M1), and an XML document then, like user data (M0) representing instances of UML model elements. If it may be possible, similarly, to query an XML element, to determine the XML schema data type or element definition that the element either derives from or is conformant to in the parse tree, I've thought it could make for a nice feature for a sequencer for ModeShape.
I think, the idea is essentially: That it may be possible to reference the JCR nodes representing XML elements of a sequenced XML document, in a ModeShape JCR repository, to reference each element to a JCR node representing an XML schema data type, such the type's representative JCR node would be defined in the sequencing of the schema used by the document, such as would have been sequenced by the ModeShape XSD sequencer.
I'm simply not certain if there may be an API, in Java, for determining the XML schema element than a valid XML document element -- when the XML document is validated according to an XML schema -- such that the element is conformant to in the parse tree. I'm of an impression that it would be possible to perform such a computation. Simply, I wonder, might there already be an API for that?
Alternately, there is UML...
The answer is yes.
In terms of standards, validating an XML document against a schema produces a PSVI, (post schema validation infoset), and the PSVI decorates nodes in the parse tree with information about what types they were validated against.
In terms of concrete implementation, if you use the JAXP Validation API you can either generate a DOM augmented with TypeInfo that tells you the type of each node, or you can use a SAX-based validation pipeline in which type information is notified to a TypeInfoProvider.
You can also do this using schema-aware XSLT and XQuery; after a validation operation, nodes are augmented with a "type annotation", which you can interrogate using the "instance of" test. If you use Saxon, you can use the extension functions saxon:type() or saxon:type-annotation() to explore further:
http://www.saxonica.com/documentation/#!functions/saxon/type
http://www.saxonica.com/documentation/#!functions/saxon/type-annotation
A limitation of the XSLT/XQuery approach is that it only works if validation succeeds. The DOM/SAX interfaces also provide information in cases where validation fails.
Related
Recently I've encountered a service that returns its results in XML, in sort of following fashion
<event>
<event-header>
...
</event-header>
<event-body>
...
</event-body>
</event>
Notice that the document does not have a namespace definition. As a result, there is no "official" schema that I can use.
I have written a schema definition that I can use to generate classes that are usable in code to interact with equivalent elements in the document. From observation I can tell that the document format does not change (field order remains the same, fields are not introduced or go away). But question stands, can I still deserialize the provided document using my schema? As far as I know, schemas must define a namespace, and in theory the documents above and below
<event xmlns="http://saltyjuice.lt/dragas/event-service/1.0/event-schema.xsd">
<event-header>
...
</event-header>
<event-body>
...
</event-body>
</event>
are not equivalent.
For reference, I'm using stax and woodstox 6 as implementation.
You can have a schema for a no-namespace document, I don't know why you thought otherwise. It's not ideal, because a namespace can guide people to the right schema. But it's allowed. Anyway, even with a namespace, it's quite possible to have several schemas for the same namespace (usually, versions and variants).
I recently got an old XML over HTTP API. It has few response types and all those responses have no namespace or type attributes. They all have the same root node and then different set of child nodes.
Is there a way in java to UnMarshall such XMLs ? It would be like using child nodes as discriminator fields. Two sample responses are given below.
<Response>
<A1/>
<A2/>
</Response>
<Response>
<B1/>
<B2/>
</Response>
The best approach really depends on what you want to do. If you just want to unmarshal the data you could define a model using JAXB, for instance, which includes all the potential child elements. Then when you unmarshalled an instance documents only the child elements actually present in the document would have values.
If you instead want to have separate models for the different response variations your best approach would be to use a BufferedInputStream and call mark() at the start, then read enough of the document with a pull parser such as XMLStreamReader to determine the actual response type. Then you can reset() the stream to the start of the document and start over using JAXB with the appropriate data model.
I have a couple of questions about JAXB:
What options are there for parsing? Can I implement / plugin my own parser easily?
What about validity? Suppose I have a relaxed parser that is somewhat relaxed regarding the schema. Can I still create an (invalid) object-structure?
Does JAXB provide special means to do e.g. validation on the objects? I'd like to parse to an "invalid" object structure, have some algorithm repair it, then validate (in Java).
Does JAXB provide other means to do fancy things on the objects (e.g. visitor pattern).
What about the memory footprint? Is the object representation (disregarding the parsing) feasible for XML files of 10-100MB?
Good tutorials covering this kind of questions are appreciated, Google revealed only coarse overviews.
Below are my answers to your questions:
What options are there for parsing? Can I implement / plugin my own
parser easily?
JAXB (JSR-222) implementations can unmarshal from many different input types: InputStream, InputSource',Node,XMLStreamReader,XMLEventReader,File,Source`. If your XML representation matches any of these then you're all set.
What about validity? Suppose I have a relaxed parser that is somewhat
relaxed regarding the schema. Can I still create an (invalid)
object-structure?
JAXB implementations requires that the XML be well formed, but does not require it be valid against an XML schema. It is designed to handle a wide range of documents. If you want to ensure "validity" then you can set an XML schema (see JAXB and Marshal/Unmarshal Schema Validation).
Does JAXB provide special means to do e.g. validation on the objects?
I'd like to parse to an "invalid" object structure, have some
algorithm repair it, then validate (in Java).
You can use the javax.xml.validation APIs to do validation on an object model. For a full example see:
http://blog.bdoughan.com/2010/11/validate-jaxb-object-model-with-xml.html
Does JAXB provide other means to do fancy things on the objects (e.g.
visitor pattern).
JAXB models are POJOs so you can design them as you wish. You may be interested in the following classes:
http://docs.oracle.com/javase/6/docs/api/javax/xml/bind/Marshaller.Listener.html
http://docs.oracle.com/javase/6/docs/api/javax/xml/bind/Unmarshaller.Listener.html
What about the memory footprint? Is the object representation
(disregarding the parsing) feasible for XML files of 10-100MB?
Yes JAXB can be used to process documents of that size. If you are concerned about size, you can use an XMLStreamReader to parse the XML file and then unmarshal objects from the XMLStreamReader in chunks.
We have an XML which needs to be validated against an XSD. The XML is being generated by XSTREAM. and We are using jaxp api's to validate the XML against the respective XSD. Unfortunately, currently our test case fail as the generated XML has elements/Tags in different order/sequence than the XSD.
Is it possible to ignore the order of elements in generated XML while validating it against XSD?
Thanks for the help in advance.
What you are asking for is a way to say "validate some of the XSD and ignore other parts". I don't think that can be done.
One possible solution would be to modify the schema so that instead of using a <sequence> for those elements (which requires that the elements be in a particular order) you can use <all>, which allows the elements to be in any order.
The point of a schema is to impose certain structure and requirements on an XML document. You can't just say "eh, I don't like that particular part of the schema, ignore it" as then the document doesn't conform to the schema anymore.
Let's say I have a doc.xml and corresponding doc.xsd. I use xpath to retrieve some nodes, so I get a list of org.w3c.dom.Node. How can I get type of each node from schema, eg. xs:integer, xs:string etc ?
Some solution would be to parse schema with xpath query "//NodeName[#type]" using node.getNodeName() as NodeName, but that's not perfect. I can't be sure that schema is elegant - what if NodeName exists in many places in schema and has not been extracted as a separate type?
So generally I am looking for a reliable solution to get the node type for ANY valid xml & xsd.
You should consider using JAXB. It will create Java classes for you based on the schema type. Then your XML docs are read into those classes, which are typed according to how you defined your XSD. Therefore xsd:int maps to java int(or Integer wrapper class, I can't recall), etc.
Cast your DOM Elements to TypeInfo: from there, you can access the type information you're looking for.
Unfortunately types as defined in an XML Schema (XSD) or Document Type Definition (DTD) are not directly tied to XML document they validate. The elements and attributes in an XML document do not inherently have a type they are just text. Think of an XSD as a script that validates an XML document rather than a set of type annotations for elements and attributes.
The XML specification does not define types as you are thinking of them here. Even Document Type Definitions (DTD) which can be embedded inside XML documents more about the structure of the document not the type of the data contained in elements and attributes.
The type system described in XML Schema is an optional layer of validation that can be applied to XML documents. Since this validation optional the standard XML APIs do not provide a way to bind the validation rules in an XSD to the actual attributes and elements.
I think it would be possible for an XML API to provide a mechanism to bind an XSD to a specific XML document, but I am not aware of an XML parser that does this. One reason why this is not so easy is that the type system that is defined in XML Schema is much richer than is supported in most mainstream programming languages. In your example you may only be interested in xs:integer, xs:string and the like but in XML Schema you can create types that specify ranges, patterns and other things that are just not possible with data types in most programming languages. To represent this complex type system in Java or any programming language would have to be done through a fairly complex API. The the question becomes it is really worth it? I would say probably not.
As per David Ds answer, slightly cleaner, call getSchemaTypeInfo() on an element or attribute