Merging multiple XML files in Java

Merging multiple XML files in Java - java

I've been looking for the best way to do this, but I can't seem to find a clear answer how this should be done.
I have an Arraylist of Files in my Java code, representing a list of xml files which should be merged and written to a new XML file. This is not a fixed length list, I'm estimating it would be between 2-10 files. All these files have a very similar document structure, but some attributes should be summed while merging. For example:
File1
<events>
<commandEvents date="2013-07-16">
<commandEvent count="1" commandId="update"/>
<commandEvent count="1" commandId="debug"/>
<commandEvent count="3" commandId="resume"/>
</commandEvents>
</events>
File 2
<events>
<commandEvents date="2013-07-16">
<commandEvent count="2" commandId="resume"/>
</commandEvents>
<commandEvents date="2013-07-15">
<commandEvent count="2" commandId="resume"/>
<commandEvent count="1" commandId="update"/>
</commandEvents>
</events>
Result
<events>
<commandEvents date="2013-07-16">
<commandEvent count="1" commandId="update"/>
<commandEvent count="1" commandId="debug"/>
<commandEvent count="5" commandId="resume"/>
</commandEvents>
<commandEvents date="2013-07-15">
<commandEvent count="2" commandId="resume"/>
<commandEvent count="1" commandId="update"/>
</commandEvents>
</events>
To clarify, the merging should occur on commandEvents[#date]/commandEvent[#commandId].The commandEvent elements have some more attributes, but these are the same for each element so I've omitted them here. Not all dates will be available in each document.
I first found some answers to go the XSLT route, but I'm quite confused about the XSLT syntax to do this.
Although I'm not entirely sure about the size that these files may reach, but I would be highly suprised they would be >1mb, so a Java DOM parser as JDOM or XOM might work as well, but I'd have to load all these files at the same time or iterate in pairs.
What is regarded as the best way to do this? And if XSLT is regarded as the best solution, would it be possible to give me some tips for this?

Here's a simple merge, in which all children of the root node in one document get appended to the root node of a second document:
public static void mergeSecondLevel(Document from, Document to) {
Element fromRoot = from.getDocumentElement();
Element toRoot = to.getDocumentElement();
Node child = null;
while ((child = fromRoot.getFirstChild()) != null) {
to.adoptNode(child);
toRoot.appendChild(child);
}
}
If you're trying to do some sort of processing on the nodes before merging them (you say some attributes should be summed), then this won't be sufficient. There's a linked post that covers using XPath to retrieve nodes, but even then you're going to have to write logic to ensure the correct updates.

Check XmlCombiner which is a Java library that implements XML merging and allows to add the filter in which you can specify the logic for summing the values of the 'count' attribute.
Here is the code for the initialization of the library:
import org.atteo.xmlcombiner.XmlCombiner;
// create combiner specifying the attributes which are used as a keys
XmlCombiner combiner = new XmlCombiner(Lists.newArrayList("date", "commandId"));
// add the filter
combiner.setFilter(filter);
// combine files
combiner.combine(firstFile);
combiner.combine(secondFile);
// store the result
combiner.buildDocument(resultFile);
And here is the code for the filter itself:
XmlCombiner.Filter filter = new XmlCombiner.Filter() {
#Override
public void postProcess(Element recessive, Element dominant, Element result) {
if (recessive == null || dominant == null) {
return;
}
Attr recessiveNode = recessive.getAttributeNode("count");
Attr dominantNode = dominant.getAttributeNode("count");
if (recessiveNode == null || dominantNode == null) {
return;
}
int recessiveValue = Integer.parseInt(recessiveNode.getValue());
int dominantValue = Integer.parseInt(dominantNode.getValue());
result.setAttribute("count", Integer.toString(recessiveValue + dominantValue));
}
};
Disclaimer: I am the author of the XmlCombiner.

Related

Problem in converting Java objects to XML based on a given XSD file

I am generating a custom XML export in DSpace 5.2. The item that is to be exported as an XML file has an array of metadata values. The values must appear in the XML file as the given XSD file defines their hierarchy. I add the values based on the XSD order into the XML, but some XML tags are in an order different from the insertion order.
More details
The approach I am using is, at first, move the array of metadata values into a map. The keys of the map are the metadata field names. Then, based on the XSD, I get an appropriate value from the map and generate an XML element like this:
import org.dspace.content.Metadatum;
import org.w3c.dom.Element;
import org.w3c.dom.Document;
public class DSpaceXML implements Serializable {
// A member variable
private Document doc;
// A DSpace built-in function used to export an item to an XML
public final void addItem(Item item) throws Exception {
// Initialize this.doc
Element rootElement = doc.createElement("root");
Element higherElement = doc.createElement("higher-element");
Element lowerElement = doc.createElement("lower-element");
insertMetadataAsChildOfElement(higherElement, "child-of-higher", "dc.childOfHigher");
rootElement.appendChild(higherElement);
insertMetadataAsChildOfElement(lowerElement, "child-of-lower", "dc.childOfLower");
rootElement.appendChild(lowerElement);
// stuff to generate other elements of the XML
}
private void insertMetadataAsChildOfElement(Element parentElement, String childElementName,
String key) {
Element childElement;
<Metadatum> metadatumList = (<Metadatum>) metadataMap.get(key);
childElement = createElement(childElementName, metadatum.value);
parentElement.appendChild(childElement);
}
private Element createElement(String name, String value) {
Element el = doc.createElement(name);
el.appendChild(doc.createTextNode(value));
return el;
}
}
I expect an XML like this:
<root>
<higher-element>
<child-of-higher>Value1</child-of-higher>
</higher-element>
<lower-element>
<child-of-lower>Value2</child-of-lower>
</lower-element>
<another-element-1/>
....
<another-element-n/>
</root>
What I get is like this (<lower-element> is before <higher-element>):
<root>
<lower-element>
<child-of-lower>Value2</child-of-lower>
</lower-element>
<another-element-1/>
....
<another-element-k/>
<higher-element>
<child-of-higher>Value1</child-of-higher>
</higher-element>
<another-element-k-plus-1/>
....
<another-element-n/>
</root>
I cannot figure out why this happens while rootElement.appendChild(higherElement) is called before rootElement.appendChild(lowerElement). Also, I would appreciate if someone let me know if my approach is the best one for generating an XML from an XSD.

I figured out that I had a bug in my code. Due to checking a lot of metadata values, many lines after the line rootElement.appendChild(lowerElement), I had a line rootElement.appendChild(higherElement), so it overrode the former hierarchy of XML elements. As a result <higher-element> appeared after <lower-element>. But about the second part of my question, I will be happy if someone would tell me about the best practices of generating an XML based on an XSD regarding the limitations of DSpace 5.

Leave entities as-is when parsing XML with Woodstox

I'm using Woodstox to process an XML that contains some entities (most notably >) in the value of one of the nodes. To use an extreme example, it's something like this:
<parent> < > & " &apos; </parent>
I have tried a lot of different configuration options for both WstxInputFactory (IS_REPLACING_ENTITY_REFERENCES, P_TREAT_CHAR_REFS_AS_ENTS, P_CUSTOM_INTERNAL_ENTITIES...) and WstxOutputFactory, but no matter what I try, the output is always something like this:
<parent>nbsp; < nbsp; > & " ' nbsp;</parent>
(> gets converted to >, < stays the same, loses the &...)
I'm reading the XML with an XMLEventReader created with
XMLEventReader reader = wstxInputFactory.createXMLEventReader(new StringReader(fulltext));
after configuring the WstxInputFactory.
Is there any way to configure Woodstox to just ignore all entities and output the text exactly as it was in the input String?

First of all, you need to include actual code since "output is always something like this" makes no sense without explaining exactly how are you outputting content that is parsed: you may be printing events, using some library, or perhaps using Woodstox stream or event writer.
Second: there is difference in XML between small number of pre-defined entities (lt, gt, apos, quot, amp), and arbitary user-defined entities like what nbsp here would be. Former you can use as-is, they are already defined; latter only exist if you define them in DTD.
Handling of the two groups is different, too; former will always be expanded no matter what, and this is by XML specification. Latter will be resolved (unless resolution disabled), and then expanded -- or if not defined exception will be thrown.
You can also specify custom resolver as mention by the other answer; but this will only be used for custom entities (here, ).
In the end it is also good to explain not what you are doing as much as what you are trying to achieve. That will help suggest things better than specific questions of "how do I do X" which may not be the ways to go about.
And as to configuration of Woodstox, maybe this blog entry:
https://medium.com/#cowtowncoder/configuring-woodstox-xml-parser-woodstox-specific-properties-1ce5030a5173
will help (as well as 2 others in the series) -- it covers existing configuration settings.

The basic five XML entities (quot, amp, apos, lt, gt) will be always processed. As far as I know there is no way to get the source of them with Sax.
For the other entities you can process them manually. You can capture the events until the end of the element and concatenate the values:
XMLInputFactory factory = WstxInputFactory.newInstance();
factory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, Boolean.FALSE);
XMLEventReader xmlr = factory.createXMLEventReader(
this.getClass().getResourceAsStream(xmlFileName));
String value = "";
while (xmlr.hasNext()) {
XMLEvent event = xmlr.nextEvent();
if (event.isCharacters()) {
value += event.asCharacters().getData();
}
if (event.isEntityReference()) {
value += "&" + ((EntityReference) event).getName() + ";";
}
if (event.isEndElement()) {
// Assign it to the right variable
System.out.println(value);
value = "";
}
}
For your example input:
<parent> < > & " &apos; </parent>
The output will be:
< > & " '
Otherwise if you want to convert all the entities maybe you could use a custom XmlResolver for undeclared entities:
public class NaiveHtmlEntityResolver implements XMLResolver {
private static final Map<String, String> ENTITIES = new HashMap<>();
static {
ENTITIES.put("nbsp", " ");
ENTITIES.put("apos", "'");
ENTITIES.put("quot", "\"");
// and so on
}
#Override
public Object resolveEntity(String publicID,
String systemID,
String baseURI,
String namespace) throws XMLStreamException {
if (publicID == null && systemID == null) {
return ENTITIES.get(namespace);
}
return null;
}
}
And then tell Woodstox to use it for the undeclared entities:
factory.setProperty(WstxInputProperties.P_UNDECLARED_ENTITY_RESOLVER, new NaiveHtmlEntityResolver());

Can't make getElementById() method working correctly in a java serializer program

I'm writing an XML serializer with JAXP. I'm receiving pseudo random data from a JAR and I'm building the DOM tree. I have to check if I already inserted the same Element into the tree; in order to perform this control I'm trying to use the method:
Element e = myDocument.getElementById(ao.getId());
if (e == null) {
// element is not a duplicate
access.appendChild(authorizationObject);
}else{
// element already in the tree
}
So, in every Element I create before adding them to the tree I set:
ao = a.getAuthorizationObject();
authorizationObject = myDocument.createElement("authorizationobject");
authorizationObject.setAttribute("id", ao.getId());
authorizationObject.setIdAttribute("id", true);
It can happen that in the object ao sometimes I get the same element twice (or more).
The problem is that the program always enter inside the if instruction.
You can find all the program's code here and the DTD here for your reference.
What am I doing wrong?
Thanks in advance for all your reply.

You have forgotten to append the authorizationObject to the access Element. Your code should be as follows
authorizationObject = myDocument.createElement("authorizationobject");
authorizationObject.setAttribute("id", ao.getId());
authorizationObject.setIdAttribute("id", true);
System.out.println("AO.ID = " + ao.getId());
access.appendChild(authorizationObject);
// then only this Element will be appended to the document
if (myDocument.getElementById(ao.getId()) == null ) {
I see that you have finally appended the authorization object to the document. but, it should be done prior to document.getElementById() method call
Hope this helps!

Get annotations from ObjectPropertyAssertion OWLAPI

I'm using the OWL API for OWL 2.0 and there is one thing I can't seem to figure out. I have an OWL/XML file and I would like to retrieve the annotations for my object property assertions. Here are snippets from my OWL/XML and Java code:
OWL:
<ObjectPropertyAssertion>
<Annotation>
<AnnotationProperty abbreviatedIRI="rdfs:comment"/>
<Literal datatypeIRI="http://www.w3.org/2001/XMLSchema#string">Bob likes sushi</Literal>
</Annotation>
<ObjectProperty IRI="#Likes"/>
<NamedIndividual IRI="#UserBob"/>
<NamedIndividual IRI="#FoodSushi"/>
</ObjectPropertyAssertion>
Java:
OWLIndividual bob = manager.getOWLDataFactory().getOWLNamedIndividual(IRI.create(base + "#UserBob"));
OWLObjectProperty likes = manager.getOWLDataFactory().getOWLObjectProperty(IRI.create(base + "#Likes"));
OWLIndividual sushi = factory.getOWLNamedIndividual(IRI.create(base + "#FoodSushi"));
OWLObjectPropertyAssertionAxiom ax = factory.getOWLObjectPropertyAssertionAxiom(likes, bob, sushi);
for(OWLAnnotation a: ax.getAnnotations()){
System.out.println(a.getValue());
}
Problem is, nothing gets returned even though the OWL states there is one rdfs:comment. It has been troublesome to find any documentations on how to retrieve this information. Adding axioms with comments or whatever is not an issue.

In order to retrieve the annotations you need to walk over the axioms of interest. Using the getSomething() adds things to the ontology, as noted in the comments, it is not possible to retrieve your axiom this way. Here is the code adapted from the OWL-API guide:
//Get rdfs:comment
final OWLAnnotationProperty comment = factory.getRDFSComment();
//Create a walker
OWLOntologyWalker walker =
new OWLOntologyWalker(Collections.singleton(ontology));
//Define what's going to visited
OWLOntologyWalkerVisitor<Object> visitor =
new OWLOntologyWalkerVisitor<Object>(walker) {
//In your case you visit the annotations made with rdfs:comment
//over the object properties assertions
#Override
public Object visit(OWLObjectPropertyAssertionAxiom axiom) {
//Print them
System.out.println(axiom.getAnnotations(comment));
return null;
}
};
//Walks over the structure - triggers the walk
walker.walkStructure(visitor);

Can I put all namspace definition to top level element with JAXB

Using handcrafted code my xml was like this:
<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns="http://musicbrainz.org/ns/mmd-1.0#"
xmlns:ext="http://musicbrainz.org/ns/ext-1.0#">
<artist-list offset="0" count="8">
<artist type="Person" id="00ed154e-8679-42f0-8f42-e59bd7e185af"
ext:score="100">
Now using JAXB which is much better but although the xml is perfectly valid I need to force it to put the xmlns:ext="http://musicbrainz.org/ns/ext#-1.0" within the metadata element not the artist element for compatability with client code that I have no control over.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<metadata xmlns="http://musicbrainz.org/ns/mmd-1.0#">
<artist-list offset="0" count="4">
<artist type="Person" id="00ed154e-8679-42f0-8f42-e59bd7e185af"
ext:score="100" xmlns:ext="http://musicbrainz.org/ns/ext#-1.0">
Can this be done please ?
EDIT:Worked round it with String replace because I only have to deal with one specific case
String xml = sw.toString();
//Remove extension namespace definition
xml=xml.replace("xmlns:ext=\"http://musicbrainz.org/ns/ext#-1.0","");
//Add it to the top instead
xml=xml.replace("<metadata xmlns=\"http://musicbrainz.org/ns/mmd-1.0#\">",
"<metadata xmlns=\"http://musicbrainz.org/ns/mmd-1.0#\" xmlns:ext=\"http://musicbrainz.org/ns/ext-1.0#\">");
//Now write out to the proper output stream
out.write(xml);

I don't think there's a way to do it using JAXB, but here's a quick post-processor using Dom4J:
public static void moveNameSpacesToRoot(Document document) {
final Element rootElement = document.getRootElement();
moveNameSpacesToRootElement(rootElement, rootElement);
}
#SuppressWarnings("unchecked")
private static void moveNameSpacesToRootElement(
Element thisElement, Element rootElement) {
if (!thisElement.equals(rootElement)) {
Namespace namespace = thisElement.getNamespace();
if (!namespace.equals(Namespace.NO_NAMESPACE)) {
Namespace existingRootNamespace =
rootElement.getNamespaceForPrefix(namespace.getPrefix());
if (existingRootNamespace == null) {
rootElement.add(namespace);
}
thisElement.remove(namespace);
}
}
for (Element child : (List<Element>) thisElement.elements()) {
moveNameSpacesToRootElement(child, rootElement);
}
}
Oh, I just realized that you need attributes, not elements. However, the change is trivial, so I'll leave that for you.

There is at least no documented feature in JAXB to control on which element the namespace prefix declaration should be placed. You should however be aware that the two XML snippets are semantically identical (it does not matter if the namespace prefix is declared on the same node or on any ancestor), so you should opt to fix the broken client code or get someone with control of the client code to fix it.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Merging multiple XML files in Java - java

Related

Problem in converting Java objects to XML based on a given XSD file

Leave entities as-is when parsing XML with Woodstox

Can't make getElementById() method working correctly in a java serializer program

Get annotations from ObjectPropertyAssertion OWLAPI

Can I put all namspace definition to top level element with JAXB

Categories

Resources