I have a couple of questions about JAXB:
What options are there for parsing? Can I implement / plugin my own parser easily?
What about validity? Suppose I have a relaxed parser that is somewhat relaxed regarding the schema. Can I still create an (invalid) object-structure?
Does JAXB provide special means to do e.g. validation on the objects? I'd like to parse to an "invalid" object structure, have some algorithm repair it, then validate (in Java).
Does JAXB provide other means to do fancy things on the objects (e.g. visitor pattern).
What about the memory footprint? Is the object representation (disregarding the parsing) feasible for XML files of 10-100MB?
Good tutorials covering this kind of questions are appreciated, Google revealed only coarse overviews.
Below are my answers to your questions:
What options are there for parsing? Can I implement / plugin my own
parser easily?
JAXB (JSR-222) implementations can unmarshal from many different input types: InputStream, InputSource',Node,XMLStreamReader,XMLEventReader,File,Source`. If your XML representation matches any of these then you're all set.
What about validity? Suppose I have a relaxed parser that is somewhat
relaxed regarding the schema. Can I still create an (invalid)
object-structure?
JAXB implementations requires that the XML be well formed, but does not require it be valid against an XML schema. It is designed to handle a wide range of documents. If you want to ensure "validity" then you can set an XML schema (see JAXB and Marshal/Unmarshal Schema Validation).
Does JAXB provide special means to do e.g. validation on the objects?
I'd like to parse to an "invalid" object structure, have some
algorithm repair it, then validate (in Java).
You can use the javax.xml.validation APIs to do validation on an object model. For a full example see:
http://blog.bdoughan.com/2010/11/validate-jaxb-object-model-with-xml.html
Does JAXB provide other means to do fancy things on the objects (e.g.
visitor pattern).
JAXB models are POJOs so you can design them as you wish. You may be interested in the following classes:
http://docs.oracle.com/javase/6/docs/api/javax/xml/bind/Marshaller.Listener.html
http://docs.oracle.com/javase/6/docs/api/javax/xml/bind/Unmarshaller.Listener.html
What about the memory footprint? Is the object representation
(disregarding the parsing) feasible for XML files of 10-100MB?
Yes JAXB can be used to process documents of that size. If you are concerned about size, you can use an XMLStreamReader to parse the XML file and then unmarshal objects from the XMLStreamReader in chunks.
Related
Recently I've encountered a service that returns its results in XML, in sort of following fashion
<event>
<event-header>
...
</event-header>
<event-body>
...
</event-body>
</event>
Notice that the document does not have a namespace definition. As a result, there is no "official" schema that I can use.
I have written a schema definition that I can use to generate classes that are usable in code to interact with equivalent elements in the document. From observation I can tell that the document format does not change (field order remains the same, fields are not introduced or go away). But question stands, can I still deserialize the provided document using my schema? As far as I know, schemas must define a namespace, and in theory the documents above and below
<event xmlns="http://saltyjuice.lt/dragas/event-service/1.0/event-schema.xsd">
<event-header>
...
</event-header>
<event-body>
...
</event-body>
</event>
are not equivalent.
For reference, I'm using stax and woodstox 6 as implementation.
You can have a schema for a no-namespace document, I don't know why you thought otherwise. It's not ideal, because a namespace can guide people to the right schema. But it's allowed. Anyway, even with a namespace, it's quite possible to have several schemas for the same namespace (usually, versions and variants).
Given a schema document (XSD foramt) such as the MODS 3.5 schema (US Library of Congress, LoC), and a document (XML) known to be valid according to that schema, such as the metadata for the Antitrust & Competition Policy Blog archives 2007 (HTML view) from the LoC Law Blawgs Web Archive, is there a Java API such that would allow a Java program to query the XML document for the XML schema data types that elements of the document are instances of?
It may seem as though I have may XML schemas and UML models confused. I'm thinking of an XML schema as it representing something like a UML model (M1), and an XML document then, like user data (M0) representing instances of UML model elements. If it may be possible, similarly, to query an XML element, to determine the XML schema data type or element definition that the element either derives from or is conformant to in the parse tree, I've thought it could make for a nice feature for a sequencer for ModeShape.
I think, the idea is essentially: That it may be possible to reference the JCR nodes representing XML elements of a sequenced XML document, in a ModeShape JCR repository, to reference each element to a JCR node representing an XML schema data type, such the type's representative JCR node would be defined in the sequencing of the schema used by the document, such as would have been sequenced by the ModeShape XSD sequencer.
I'm simply not certain if there may be an API, in Java, for determining the XML schema element than a valid XML document element -- when the XML document is validated according to an XML schema -- such that the element is conformant to in the parse tree. I'm of an impression that it would be possible to perform such a computation. Simply, I wonder, might there already be an API for that?
Alternately, there is UML...
The answer is yes.
In terms of standards, validating an XML document against a schema produces a PSVI, (post schema validation infoset), and the PSVI decorates nodes in the parse tree with information about what types they were validated against.
In terms of concrete implementation, if you use the JAXP Validation API you can either generate a DOM augmented with TypeInfo that tells you the type of each node, or you can use a SAX-based validation pipeline in which type information is notified to a TypeInfoProvider.
You can also do this using schema-aware XSLT and XQuery; after a validation operation, nodes are augmented with a "type annotation", which you can interrogate using the "instance of" test. If you use Saxon, you can use the extension functions saxon:type() or saxon:type-annotation() to explore further:
http://www.saxonica.com/documentation/#!functions/saxon/type
http://www.saxonica.com/documentation/#!functions/saxon/type-annotation
A limitation of the XSLT/XQuery approach is that it only works if validation succeeds. The DOM/SAX interfaces also provide information in cases where validation fails.
Until now, I've been handling extensions by defining a placeholder element that has "name" and "value" attributes as shown in the below example
<root>
<typed-content>
...
</typed-content>
<extension name="var1" value="val1"/>
<extension name="var2" value="val2"/>
....
</root>
I am now planning to switch to using xsd:any. I'd appreciate if you can help me choose th best approach
What is the value add of xsd:any over my previous approach if I specify processContents="strict"
Can a EAI/ESB tool/library execute XPATH expressions against the arbitrary elements I return
I see various binding tools treating this separately while generating the binding code. Is this this the same case if I include a namespace="http://mynamespace" and provide the schema for the "http://mynamespace" during code gen time?
Is this WS-I compliant?
Are there any gotchas that I am missing?
Thank you
Using <xsd:any processContents="strict"> gives people the ability to add extensions to their XML instance documents without changing the original schema. This is the critical benefit it gives you.
Yes. tools than manipulate the instances don't care what the schema looks like, it's the instance documents they look at. To them, it doesn't really matter if you use <xsd:any> or not.
Binding tools generally don't handle <xsd:any> very elegantly. This is understandable, since they have no information about what it could contain, so they'll usually give you an untyped placeholder. It's up the the application code to handle that at runtime. JAXB is particular (the RI, at least) makes a bit of a fist of it, but it's workable.
Yes. It's perfectly good XML Schema practice, and all valid XML Schema are supported by WS-I
<xsd:any> makes life a bit harder on the programmer, due to the untyped nature of the bindings, but if you need to support arbitrary extension points, this is the way to do it. However, if your extensions are well-defined, and do not change, then it may not be worth the irritation factor.
Regarding point 3
Binding tools generally don't handle
very elegantly. This is
understandable, since they have no
information about what it could
contain, so they'll usually give you
an untyped placeholder. It's up the
the application code to handle that at
runtime. JAXB is particular (the RI,
at least) makes a bit of a fist of it,
but it's workable.
This corresponds to the #XmlAnyElement annotation in JAXB. The behaviour is as follows:
#XmlAnyElement - Keep All as DOM Nodes
If you annotate a property with this annotation the corresponding portion of the XML document will be kept as DOM nodes.
#XMLAnyElement(lax=true) - Convert Known Elements to Domain Objects
By setting lax=true, if JAXB has a root type corresponding to that QName then it will convert that chunk to a domain object.
http://bdoughan.blogspot.com/2010/08/using-xmlanyelement-to-build-generic.html
Let's say I have a doc.xml and corresponding doc.xsd. I use xpath to retrieve some nodes, so I get a list of org.w3c.dom.Node. How can I get type of each node from schema, eg. xs:integer, xs:string etc ?
Some solution would be to parse schema with xpath query "//NodeName[#type]" using node.getNodeName() as NodeName, but that's not perfect. I can't be sure that schema is elegant - what if NodeName exists in many places in schema and has not been extracted as a separate type?
So generally I am looking for a reliable solution to get the node type for ANY valid xml & xsd.
You should consider using JAXB. It will create Java classes for you based on the schema type. Then your XML docs are read into those classes, which are typed according to how you defined your XSD. Therefore xsd:int maps to java int(or Integer wrapper class, I can't recall), etc.
Cast your DOM Elements to TypeInfo: from there, you can access the type information you're looking for.
Unfortunately types as defined in an XML Schema (XSD) or Document Type Definition (DTD) are not directly tied to XML document they validate. The elements and attributes in an XML document do not inherently have a type they are just text. Think of an XSD as a script that validates an XML document rather than a set of type annotations for elements and attributes.
The XML specification does not define types as you are thinking of them here. Even Document Type Definitions (DTD) which can be embedded inside XML documents more about the structure of the document not the type of the data contained in elements and attributes.
The type system described in XML Schema is an optional layer of validation that can be applied to XML documents. Since this validation optional the standard XML APIs do not provide a way to bind the validation rules in an XSD to the actual attributes and elements.
I think it would be possible for an XML API to provide a mechanism to bind an XSD to a specific XML document, but I am not aware of an XML parser that does this. One reason why this is not so easy is that the type system that is defined in XML Schema is much richer than is supported in most mainstream programming languages. In your example you may only be interested in xs:integer, xs:string and the like but in XML Schema you can create types that specify ranges, patterns and other things that are just not possible with data types in most programming languages. To represent this complex type system in Java or any programming language would have to be done through a fairly complex API. The the question becomes it is really worth it? I would say probably not.
As per David Ds answer, slightly cleaner, call getSchemaTypeInfo() on an element or attribute
If you have a Java object and an XML schema (XSD), what is the best way to take that object and convert it into an xml file in line with the schema. The object and the schema do not know about each other (in that the java classes weren't created from the schema).
For example, in the class, there may be an integer field 'totalCountValue' which would correspond to an element called 'countTotal' in the xsd file. Is there a way of creating a mapping that will say "if the object contains an int totalCountValue, create an element called 'countTotal' and put it into the XML".
Similarly, there may be a field in the object that should be ignored, or a list in the object that should correspond to multiple XML elements.
I looked at XStream, but didn't see any (obvious) way of doing it. Are there other XML libraries that can simplify this task?
I believe this can be achieved via JAXB using it's annotations. I've usually found it much easier to generate Objects from JAXB ( as defined in your schema) using XJC than to map an existing Java object to match my schema. YMMV.
I'm doing Object do XML serialization with XStream. What don't you find "obvious" with this serializer? Once you get the hang of it its very simple.
In the example you provided you could have something like this:
...
XStream xstream = new XStream(new DomDriver());
xstream.alias("myclass", MyClass.class);
xstream.aliasField("countTotal", MyClass.class, "totalCountValue");
String xml = xstream.toXML(this);
...
for this sample class:
class MyClass {
private int totalCountValue;
public MyClass() {
}
}
If you find some serializer more simple or "cool" than this please share it with us. I'm also looking for a change...
Check the XStream mini tutorial here
I use a java library called JiBx to do this work. You need to write a mapping file (in XML) to describe how you want the XML Schema elements to map to the java objects. There are a couple of generator tools to help with automating the process. Plus it's really fast.
I tried out most of the libraries suggested in order to see which one is most suited to my needs. I also tried out a library that was not mentioned here, but suggested by a colleague, which was a StAX implementation called Woodstox.
Admittedly my testing was not complete for all of these libraries, but for the purpose mentioned in the question, I found Woodstox to be the best. It is fastest for marshalling (in my testing, beating XStream by around 30~40%). It is also fairly easy to use and control.
The drawback to this approach is that the XML created (since it is defined by me) needs to be run through a validator to ensure that it is correct with the schema.
You can use a library from Apache Commons called Betwixt. It can map a bean to XML and then back again if you need to round trip.
Take a look at JDOM.
I would say JAXB or Castor. I have found Castor to be easier to use and more reliable, but JAXB is the standard