Handling character entities in Java using JDOM : how to? - java

I've to convert a xml file to a sgml file.
I'm using Java 1.6.0.31. and jDOM 2.0.5
I do not own the sgml's DTD.
The DTDs declare lots of character entities ( like γ , ω... but i'm not allowed to use the γ entity form)
I do own the xml ( I mean I'm able to edit the xsd and do whatever I want with this part)
The XML's xsd do not declare these entities, but I'm using a xml editor that allow inserting these entities
My problem is when I try to convert a xml containing these entitites I get a "&entities; referenced but not declared" exception message.
The code is :
File sourceFile = new File(path);
if (sourceFile.exists()) {
DocumentBuilderFactory factory DocumentBuilderFactory.newInstance();
factory.setExpandEntityReferences(false);
factory.setValidating(false);
factory.setIgnoringComments(true);
factory.setIgnoringElementContentWhitespace(false);
DOMBuilder builder = new DOMBuilder();
this.xmlDocument = builder.build(factory.newDocumentBuilder().parse(sourceFile));
The factory.newDocumentBuilder().parse() is the exception thrower (Obviously).
I've been looking for answers, but I'm not good enough with JDOM to decide what I should do, so my question is : What is the safest thing to do to allow entities resolve in this case ?
Should I create a custom EntityResolver that will do the job ?
Should I force the inputed xml to have γ format entities then replace the numeric value by "full-text" value ?
Thanks for your help !
EDIT: Replacing & so you can see the code, not the entities :/

Jeez again,
I've ended doing something very ugly :
I've inserted all the ENTITIES I needed in the document internal subset using a filecontent.replaceFirst("<!DOCTYPE X \\[", "<!DOCTYPE X [" + getEntityFile());
and
function getEntityFile() {
return FileUtils.readFileToString(f);
}
where f is the DTD file containing all the char entities I'am allowed to use ( copied from the SGML DTD).. So I can avoid the "entites referenced but not declared". Then these entities were replaced ( Yeah, I haven't found a way to not replace internal entities using jDOM2 => If someone have an idea, I'll bring the beer)
And at the end, when I'm outputing the SGML file, I replace the value by there entities reference ...
I'm ashamed, but for now, it work ...

Related

How match JAXB elements in CIM/RDF?

Trying to load a model from a CIM/XML file acording to IEC 61970 (Common Information Model, for power systems models), I found a problem;
According JAXB´s graphs between elements are provided by #XmlREF #XmlID and these both should be equals to match. But in CIM/RDF the references to a resource through an ID, i.e. rdf:resource="#_37C0E103000D40CD812C47572C31C0AD" contain the "#" character, consequently JAXB is unable to match "GeographicalRegion" vs. "SubGeographicalRegion.Region" when in the rdf:resource atribute the "#" character is present.
Here an example:
<cim:GeographicalRegion rdf:ID="_37C0E103000D40CD812C47572C31C0AD">
<cim:IdentifiedObject.name>GeoRegion</cim:IdentifiedObject.name>
<cim:IdentifiedObject.localName>OpenCIM3bus</cim:IdentifiedObject.localName>
</cim:GeographicalRegion>
<cim:SubGeographicalRegion rdf:ID="_ID_SubGeographicalRegion">
<cim:IdentifiedObject.name>SubRegion</cim:IdentifiedObject.name>
<cim:IdentifiedObject.localName>SubRegion</cim:IdentifiedObject.localName>
<cim:SubGeographicalRegion.Region rdf:resource="#_37C0E103000D40CD812C47572C31C0AD"/>
</cim:SubGeographicalRegion>
I realize you're asking for a solution using JAXB, but I would urge you to consider an RDF-based solution as it is more flexible and robust. You're basically trying to reinvent what RDF parsers already have built in. RDF/XML is a difficult format to parse, it doesn't make much sense to try and hack your own parsing together - especially since files that have very different XML structures can express exactly the same information: this only becomes apparent when looking at the level of the RDF. You may find that your JAXB parser workaround works on one CIM/RDF file but completely fails on another.
So, here's an example of how to process your file using the Sesame RDF API. No inferencing is involved, this just parses the file and puts it in an in-memory RDF model, which you can then manipulate and query from any angle.
Assuming the root element of your CIM file looks something like this:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:cim="http://example.org/cim/">
(only a guess of course, but I need prefixes for a proper example)
Then you can do the following, using Sesame's Rio RDF/XML parser:
String baseURI = "http://example.org/my/file";
FileInputStream in = new FileInputStream("/path/to/my/cim.rdf");
Model model = Rio.parse(in, baseURI, RDFFormat.RDFXML);
This creates an in-memory RDF model of your document. You can then simply filter-query over that. For example, to print out the properties of all resources that have _37C0E103000D40CD812C47572C31C0AD as their SubGeographicalRegion.Region:
String CIM_NS = "http://example.org/cim/";
ValueFactory vf = ValueFactoryImpl.getInstance();
URI subRegion = vf.createURI(CIM_NS, "SubGeographicalRegion.Region");
URI res = vf.createURI("http://example.org/my/file#_37C0E103000D40CD812C47572C31C0AD");
Set<Resource> subs = model.filter(null, subRegion, res).subjects();
for (Resource sub: subs) {
System.out.println("resource: " + sub + " has the following properties: ");
for (URI prop: model.filter(sub, null, null).predicates()) {
System.out.println(prop + ": " + model.filter(sub, prop, null).objectValue());
}
}
Of course at this point you can also choose to convert the model to some other syntax format for further handling by your application - as you see fit. The point is that the difference between the identifiers with the leading # and without has been resolved for you by the RDF/XML parser.
This is of course personal opinion only, since I don't know the details of your use case, but I think you'll find that this is quite quick and flexible. I should also point out that although the above solution keeps the entire model in memory, you can easily adapt this to a more streaming (and therefore less memory-intensive) approach if you find your files are too big.

Validate and parse xml using woodstox with local dtd

I have seen multiple questions that relate to parsing xmls using woodstox and JAXB to unmarshal using the XMLStreamReader and validating against schemas.Reading though them hasn't helped. What I need is to validate an incoming xml with a local DTD and parse the entire contents into an object representation. The incoming xml can have a DOCTYPE which includes a DTD. This needs to be skipped and a local DTD needs to be used instead. The implementation should be very quick. Expected < 1ms to do the validation and parsing. I could manage to parse alone using the following in 5ms. Incorporating validation doesn't work with setting the schema (commented lines of code)
xmlif = XMLInputFactory2.newInstance();
xmlif.setProperty(XMLInputFactory2.SUPPORT_DTD, false);
JAXBContext ucontext;
ucontext = JAXBContext.newInstance(XMLOuterElementClass.class);
unmarshaller = ucontext.createUnmarshaller();
/*SchemaFactory sf = SchemaFactory.newInstance(XMLConstants.XML_DTD_NS_URI);
Schema schema = sf.newSchema(new File("c:/resources/schma.dtd"));
unmarshaller.setSchema(schema);*/
XMLStreamReader xsr = xmlif
.createXMLStreamReader(new StringReader(xml));
//xsr = new StreamReaderDelegate(xsr);
long start = System.currentTimeMillis();
try {
while (xsr.hasNext()) {
if (xsr.isStartElement()
&& xsr.getLocalName() == "XMLOuterElementClass") {
break;
}
xsr.next();
}
JAXBElement<XMLOuterElementClass> jb = unmarshaller.unmarshal(xsr,
XMLOuterElementClass.class);
System.out.println("Total time taken in ms :" + (end - start));
} finally {
xsr.close();
}
There are multiple ways to do it; and the best way to get an answer with more depth is to ask this on Woodstox user list (see http://xircles.codehaus.org/projects/woodstox/lists).
But one thing to note is that JAXB knows nothing about Stax2 (Woodstox/Aalto extension over basic Stax), so you need to access it via Stax2 API, not JAXB. So, to enable "external" validation, you need to call:
xmlStreamReader2.validateAgainst(schemaFromDTD);
and you can do this right after constructing stream reader (needs to cast to XMLStreamReader2, or at least to Validatable).
Note that you can validate when reading OR writing, both work similarly (in latter case you enable it via XMLStreamWriter).
Another possibility is to define XMLResolver property (see XMLInputFactory.RESOLVER).
It gets called when trying to read an external dtd, that is, when DOCTYPE contains reference to an external file. Custom XMLResolver can then redirect this read to use some other source.
Note that the first approach (one you started with) is likely more efficient as it only needs to read and parse Schema once, assuming you read it once and reuse afterwards.
Validation itself should be fast, and if parsing takes 4 milliseconds, should not take more than 1 millisecond; especially if you include JAXB processing in 4 milliseconds (that's technically data-binding, above lower level parsing).

What is meant by 'parsed data' in the xml 1.1 spec?

I am re-wording my question because the 'parsed entity' thing has nothing to do with the problem at hand.
XML 1.1 versus 1.0
Is an xml 1.1 library is to escape illegal characters before serializing/deserializing them? Or is the library is to forbid them outright? Which is the correct way to set Text on an xml element?
if Element e = new Element("foo")
Should I do this:
e.setText(sanitized_text_illegal_characters_removed_or_escaped) ?
or
e.setText(any_text)
A parsed entity is something you don't really need to worry about unless you're writing an XML parser. It's things like < and &. You can define your own in the document DTD, but it's a rarely used feature. An external parsed entity is one whose contents reside in another file or network resource or somewhere like that.
As to your main question:
Which is the correct way to set Text on an xml element?
if Element e = new Element("foo")
Should I do this:
e.setText(string_of_sanitized_data_with_illegal_characters_escaped) ?
or
e.setText(any_text)
You should set the text as you would like it to come out the other end, when the document is deserialized. This normally means you should not escape the data, and the XML library will do this for you.
e.g.:
You insert the text "bed & breakfast".
The XML library converts this to "bed & breakfast" or "<![CDATA[bed & breakfast]]>" or some other representation, it doesn't really matter.
You send the document somewhere else.
The other parser reads the document and converts the text back.
The end software retrieves the string "bed & breakfast".
If you're writing XML programmatically, then you almost certainly don't want to use parsed entities.
There are two kinds of parsed entities: internal and external. An internal parsed entity is defined by a DTD declaration like this:
<!ENTITY me "Mike">
or
<!ENTITY me "<name>Mike</name>">
An external parsed entity is defined by a DTD declaration like this:
<!ENTITY me SYSTEM "me.xml">
Whether the entity is internal or external, it can be referenced by an entity reference like this:
&me;
which can appear within the content of an element or attribute.

Specifying DTD to be used by DocumentBuilders for XML parsing?

I am currently writing a tool, using Java 1.6, that brings together a number of XML files. All of the files validate to the DocBook 4.5 DTD (I have checked this using xmllint and specifying the DocBook 4.5 DTD as the --dtdvalid parameter), but not all of them include the DOCTYPE declaration.
I load each XML file into the DOM to perform the required manipulation like so:
private Document fileToDocument( File input ) throws ParserConfigurationException, IOException, SAXException {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
factory.setIgnoringElementContentWhitespace(false);
factory.setIgnoringComments(false);
factory.setValidating(false);
factory.setExpandEntityReferences(false);
DocumentBuilder builder = factory.newDocumentBuilder();
return builder.parse( input );
}
For the most part this has worked quite well, I can use he returned object to navigate the tree and perform the required manipulations and then write the document back out. Where I am encountering problems is with files which:
Do not include the DOCTYPE declaration, and
Do include entities defined in the DTD (for example — / —).
Where this is the case an exception is thrown from the builder.parse(...) call with the message:
[Fatal Error] :5:15: The entity "mdash" was referenced, but not declared.
Fair enough, it isn't declared. What I would ideally do in this instance is set the DocumentBuilderFactory to always use the DocBook 4.5 DTD regardless of whether one is specified in the file.
I did try validation using the DocBook 4.5 schema but found that this produced a number of unrelated errors with the XML. It seems like the schema might not be functionally equivalent to the DTD, at least for this version of the DocBook specification.
The other option I can think of is to read the file in, try and detect whether a doctype was set or not, and then set one if none was found prior to actually parsing the XML into the DOM.
So, my question is, is there a smarter way that I have not seen to tell the parser to use a specific DTD or ensure that parsing proceeds despite the entities not resolving (not just the &emdash; example but any entities in the XML - there are a large number of potentials)?
Could using an EntityResolver2 and implementing EntityResolver2.getExternalSubset() help?
... This method can also be used with documents that have no DOCTYPE declaration. When the root element is encountered, but no DOCTYPE declaration has been seen, this method is invoked. If it returns a value for the external subset, that root element is declared to be the root element, giving the effect of splicing a DOCTYPE declaration at the end the prolog of a document that could not otherwise be valid. ...

XML to be validated against multiple xsd schemas

I'm writing the xsd and the code to validate, so I have great control here.
I would like to have an upload facility that adds stuff to my application based on an xml file. One part of the xml file should be validated against different schemas based on one of the values in the other part of it. Here's an example to illustrate:
<foo>
<name>Harold</name>
<bar>Alpha</bar>
<baz>Mercury</baz>
<!-- ... more general info that applies to all foos ... -->
<bar-config>
<!-- the content here is specific to the bar named "Alpha" -->
</bar-config>
<baz-config>
<!-- the content here is specific to the baz named "Mercury" -->
</baz>
</foo>
In this case, there is some controlled vocabulary for the content of <bar>, and I can handle that part just fine. Then, based on the bar value, the appropriate xml schema should be used to validate the content of bar-config. Similarly for baz and baz-config.
The code doing the parsing/validation is written in Java. Not sure how language-dependent the solution will be.
Ideally, the solution would permit the xml author to declare the appropriate schema locations and what-not so that s/he could get the xml validated on the fly in a sufficiently smart editor.
Also, the possible values for <bar> and <baz> are orthogonal, so I don't want to do this by extension for every possible bar/baz combo. What I mean is, if there are 24 possible bar values/schemas and 8 possible baz values/schemas, I want to be able to write 1 + 24 + 8 = 33 total schemas, instead of 1 * 24 * 8 = 192 total schemas.
Also, I'd prefer to NOT break out the bar-config and baz-config into separate xml files if possible. I realize that might make all the problems much easier, as each xml file would have a single schema, but I'm trying to see if there is a good single-xml-file solution.
I finally figured this out.
First of all, in the foo schema, the bar-config and baz-config elements have a type which includes an any element, like this:
<sequence>
<any minOccurs="0" maxOccurs="1"
processContents="lax" namespace="##any" />
</sequence>
In the xml, then, you must specify the proper namespace using the xmlns attribute on the child element of bar-config or baz-config, like this:
<bar-config>
<config xmlns="http://www.example.org/bar/Alpha">
... config xml here ...
</config>
</bar-config>
Then, your XML schema file for bar Alpha will have a target namespace of http://www.example.org/bar/Alpha and will define the root element config.
If your XML file has namespace declarations and schema locations for both of the schema files, this is sufficient for the editor to do all of the validating (at least good enough for Eclipse).
So far, we have satisfied the requirement that the xml author may write the xml in such a way that it is validated in the editor.
Now, we need the consumer to be able to validate. In my case, I'm using Java.
If by some chance, you know the schema files that you will need to use to validate ahead of time, then you simply create a single Schema object and validate as usual, like this:
Schema schema = factory().newSchema(new Source[] {
new StreamSource(stream("foo.xsd")),
new StreamSource(stream("Alpha.xsd")),
new StreamSource(stream("Mercury.xsd")),
});
In this case, however, we don't know which xsd files to use until we have parsed the main document. So, the general procedure is to:
Validate the xml using only the main (foo) schema
Determine the schema to use to validate the portion of the document
Find the node that is the root of the portion to validate using a separate schema
Import that node into a brand new document
Validate the brand new document using the other schema file
Caveat: it appears that the document must be built namespace-aware in order for this to work.
Here's some code (this was ripped from various places of my code, so there might be some errors introduced by the copy-and-paste):
// Contains the filename of the xml file
String filename;
// Load the xml data using a namespace-aware builder (the method
// 'stream' simply opens an input stream on a file)
Document document;
DocumentBuilderFactory docBuilderFactory =
DocumentBuilderFactory.newInstance();
docBuilderFactory.setNamespaceAware(true);
document = docBuilderFactory.newDocumentBuilder().parse(stream(filename));
// Create the schema factory
SchemaFactory sFactory = SchemaFactory.newInstance(
XMLConstants.W3C_XML_SCHEMA_NS_URI);
// Load the main schema
Schema schema = sFactory.newSchema(
new StreamSource(stream("foo.xsd")));
// Validate using main schema
schema.newValidator().validate(new DOMSource(document));
// Get the node that is the root for the portion you want to validate
// using another schema
Node node= getSpecialNode(document);
// Build a Document from that node
Document subDocument = docBuilderFactory.newDocumentBuilder().newDocument();
subDocument.appendChild(subDocument.importNode(node, true));
// Determine the schema to use using your own logic
Schema subSchema = parseAndDetermineSchema(document);
// Validate using other schema
subSchema.newValidator().validate(new DOMSource(subDocument));
Take a look at NVDL (Namespace-based Validation Dispatching Language) - http://www.nvdl.org/
It is designed to do what you want to do (validate parts of an XML document that have their own namespaces and schemas).
There is a tutorial here - http://www.dpawson.co.uk/nvdl/ - and a Java implementation here - http://jnvdl.sourceforge.net/
Hope that helps!
Kevin
You need to define a target namespace for each separately-validated portions of the instance document. Then you define a master schema that uses <xsd:include> to reference the schema documents for these components.
The limitation with this approach is that you can't let the individual components define the schemas that should be used to validate them. But it's a bad idea in general to let a document tell you how to validate it (ie, validation should something that your application controls).
You can also use a "resource resolver" to allow "xml authors" to specify their own schema file, at least to some extent, ex: https://stackoverflow.com/a/41225329/32453 at the end of the day, you want a fully compliant xml file that can be validatable with normal tools, anyway :)

Categories