Java XML validation without knowing the schema in advance - java

I can receive one of 82 XML structures, each of which contains a root which is not in a name space, and also contains several xmlns attributes the first of which defines a urn for the schema for the object, and the rest (which define namespaces) also contain the urns for the common objects.
The Schema Aware Parsing in Java assumes you know the schema before you start the parsing, but I do not know it until either I have loaded the XML without validation and extracted the root, at point I can load it again with the right schema, or I can find some way to get to the xmlns elements in the root and select the right schema (I know how to map the urn to the correct schema, and all the schemas are held as resources in my classpath.
It seems a shame to load the XML twice, is there a way to do this in a single pass?
As an example I have a possible document which looks like:-
<?xml version="1.0" encoding="UTF-8"?>
<BusinessCard xmlns="urn:oasis:names:specification:ubl:schema:xsd:BusinessCard-2"
xmlns:cac="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2"
xmlns:cbc="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2">
</BusinessCard>
(there is obviously content inside the BusinessCard object, but I left it out as it is no relevance here)
and I the schema for this is in resource "xsd/main/UBL-BusinessCard-2.2.xsd".
I have tried using an EntityResolver, but it does not get called before the parser complains that it can not find the declaration of BusinessCard.

I'm not sure why you say the root isn't in a namespace, when the xmlns="urn:oasis:names:... declaration makes it clear that it is.
One way to do this is to load a single composite schema that contains all the different component schemas, and validate against that. If the union of the schemas is a valid schema (i.e. no conflicting type definitions) then this might be the best approach, especially if you are validating thousands of document and most of the component schemas are going to be used in each run.
On the other hand, if you're only using a small number of the component schemas in a given run, then this would be expensive.
One approach would be to detect the namespace using an abortive parse of the document. Write a SAX filter that captures the first namespace declaration and then aborts the parse by throwing an exception. Or you could also do this with a streaming XSLT 3.0 transformation.
Even smarter would be to write a little SAX pipeline that does some buffering. Capture the first startElement event, extract the namespace, load the schema, create a validator, feed it the SAX events that you've already consumed (the first startElement), then feed the rest of the SAX events from your preprocessor straight through to the validator.

Related

Xjc generating 1000+ classes for fpml schema

I have xml based on fpml schema.
Used xjc command line tool to generate corresponding pojo classes.
Then I am using JAXB to unmarshal xml into java objects.
I converted this to objects as an intermediate step because then it is easy to read values of some fields.
But problem is fpml schema generated ~1200 classes.
so I am not sure if this is correct approach as jar size will also increase.
My problem statement : convert one xml based on one schema to another xml based on another schema. Both involves fpml. While populating another xml I need to validate few fields from database.
please give me suggestions
Data binding technologies such as JAXB work fine for simple cases, but when the schema is large, complex, or frequently changing, they become quite unwieldy, as you have discovered.
This is a task for XSLT. Use schema-aware XSLT if possible, because it makes your stylesheet easier to debug.

Should Java classes created by JAXB have logic code

I use JAXB to load an XML configuration file to a Java object (ConfigurationDTO). Is it good practice to add some logic code on the this Java object (ConfigurationDTO) or I should create a different java object with this logic code (ie Configuration). When I say logic code I mean some checks/constraints that the configuration file should have. Should the java class 'ConfigurationDTO' contain only getters?
The question is why do you need that constraints? Are you going to use your object not only for marshalling/unmarshalling? If so it is bad idea. The rule of thumb is not to spread DTO objects among all levels of an application. If you follow this rule you'll not need to have additional constraints in your DTO.
The JAXB standard provides you with ability to validate an object during marshal and unmarshal time. It means that if your XML schema requires nonempty field but the corresponding java object has null value then marshal will fail. And vise versa.
Here is quote from the JAXB documentation
Validation is the process of verifying that an XML document meets all the constraints expressed in the schema. JAXB 1.0 provided validation at unmarshal time and also enabled on-demand validation on a JAXB content tree. JAXB 2.0 only allows validation at unmarshal and marshal time. A web service processing model is to be lax in reading in data and strict on writing it out. To meet that model, validation was added to marshal time so users could confirm that they did not invalidate an XML document when modifying the document in JAXB form.
Such approach has its own drawbacks (if you spread the DTO among the application you'll lost control on it) but the advantages are more valuable.

How to search and update in an XML using JaxB in java

I'm able to create java object from an xml schema and creating a new xml is also working.
Now using my java object how can i search for a particular tag and update it back into xml?
You can use an instance of Marshaller to write the object back to XML. If you want to apply changes back to an existing DOM then you can use an instance of Binder. Binder is useful when there is unmapped content in your document that you wish to preserve.
For More Information
http://blog.bdoughan.com/2010/09/jaxb-xml-infoset-preservation.html

How To Search Domain Objects And The Physical Files They Point To Using Solr Or Searchable

I have a digital library system where I store metadata and the path to physical file in the database. The files may be anything: plain text,Word,PDF,MP3,JPEG,MP4...
How can I provide full text search to both my domain objects and the physical files (or some text extraction of the files).
Is my only choice to store the document text in the domain object? I do need to be able to retrieve a list of domain objects regardless of if the search results come from the domain object or the physical document. There is of course a possible connection using the file path and I actually drop each document in a folder named by a GUID, so the connection is there.
I need to do this in Grails, ideally using the solr or searchable plugin, but a Java solution would help.
You don't need to store the content in the domain object, just associated the content with the domain object when creating the index entry. I used Apache POI to extract my content, but there are higher level services like Apache Tika
you could code it up in java using Lucene directly but I would suggest SOLR instead
grails searchable plugin based on Compass which is based on Lucene
Have a look at this article that covers use cases like yours, based on Spring, Hibernate, Hibernate Search, and JSF. It comes with a comprehensive, well-documented, sample application.
Which is focusing on the separations of concerns paradigm and modularity, BTW. Thus, the concepts involved that concern full-text searching ought to suit fine with Grails, or other, Java-based, applications.
The main domain class is de.metagear.library.model.Media (there is an associated MetaData domain class, too). You'll be able to mix Hibernate and GORM classes; however, you'll need to use different APIs then.
The Media class contains a property plainText:
#Column(name = "plain_text", nullable = false)
#Field(index = Index.TOKENIZED, store = Store.YES)
#Lob
private String plainText;
That property holds the extracted text (i.e., from PDFs, etc.). I'm not sure whether it needs to be saved to the database or not (probably not, but it should't harm too much otherwise). Nevertheless, it's not used for full-text search (see below). For full-text search, the Lucene indexes are used, only.
Before a Media is created, the text contents of the corresponding orginal document (possibly, a binary one) is extracted. The de.metagear.library.model.factory.MediaFactory.getInstance(..) method extracts the text, stores the extracted text in a new Media object, and returns that Media.
In the sample, it simply stores the original document into a property of the domain object, but, at that place, you could also save the document to file and store a reference (the GUID you'd mentioned) into a Media's property.
The domain class is saved by the de.metagear.library.dao.MediaCrudDaoImpl class, which is a Spring bean. Internally, it's using an injected EntityManagerFactory that, in /applicationContext.xml, has been defined to use Hibernate under the hood.
Indexing occurs, automatically, because of the Hibernate annotations in the domain class.
As for performing the full-text search itself, that's accomplished by the de.metagear.library.dao.MediaSearchDaoImpl.getSearchResults(..) method that does not query the database, but the Lucene indexes, only.
The sample application contains a powerful query terms pre-processor that can combine AND, OR, and NOT operators on different indexes while preserving the comprehensive Lucene expression syntax.
By setting a custom org.hibernate.transform.ResultTransformer at this place, objects of any type (including domain classes, of course) can be returned.
I haven't looked into the Grails Searchable plugin, yet, and, thus, cannot tell whether it's viable in terms of robustness, maintainability, ease of use, and - last-not-least - extensibility with custom or third-party content extractors, parsers, and analyzers. Probably, it is, as well.
After all, there's a basic knowledge of the Spring and (maybe) the Hibernate framework involved with my approach. These are just the frameworks that Grails and Gorm are based on, but I think that this might make a decision point for you.
At least, looking at the above concepts ought to be informative and empower to advance when looking at different frameworks and approaches.
Thanks.

JAXB: Can I make XmlAttribute's parameter "required=true" to default?

I have
#XmlAttribute(required=true)
in hundreds places in a projects.
Can I make this default?...
...So that I then only need to specify
#XmlAttribute(required=false)
when needed.
No, that behaviour is hard-wired. However, the required attribute is really a lightweight alternative to a proper XML schema. If you need better control over document validation, then I suggest you define an XML Schema for your documents, and inject the schema into the JAXBContext. The documents will then be checked on marshalling and unmarshalling, and you won't have to rely on the annotations for validation.

Categories