I'd like to take an XML file, heavily structured and about half gig in size, and create from it another XML file, containing only selected elements of the original one.
1) How can I do that?
2) can it be done with DOM Parser? What is the size limit of the DOM parser?
Thanks!
If you have a very large source XML (like your 0.5 GB file), and wish to extract information from it, possibly creating a new XML, you might consider using an event-based parser which does not require loading the entire XML in memory. The simplest of these implementations is the SAX parser, which requires that you write an event listener which will capture events like document-start, element-start, element-end, etc, where you can inspect the data you are reading (the name of the element, the attributes, etc.) and decide if you are going to ignore it or do something with the data.
Search for a SAX tutorial using JAXP and you should find several examples. Another strategy which you might want to consider, depending on what you want to do is StAX.
Here is a simple example using SAX to read data from a XML file and extract some information based on search criteria. It's a very simple example I use to teach SAX processing. I think it might help your understanding of how it works. The search criteria is hardwired and consists of names of movie directors to search in a giant XML with a movie selection generated from IMDB data.
XML Source example ("source.xml" ~300MB file)
<Movies>
...
<Movie>
<Imdb>tt1527186</Imdb>
<Title>Melancholia</Title>
<Director>Lars von Trier</Director>
<Year>2011</Year>
<Duration>136</Duration>
</Movie>
<Movie>
<Imdb>tt0060390</Imdb>
<Title>Fahrenheit 451</Title>
<Director>François Truffaut</Director>
<Year>1966</Year>
<Duration>112</Duration>
</Movie>
<Movie>
<Imdb>tt0062622</Imdb>
<Title>2001: A Space Odyssey</Title>
<Director>Stanley Kubrick</Director>
<Year>1968</Year>
<Duration>160</Duration>
</Movie>
...
</Movies>
Here is an example of an event handler. It selects the Movie elements by matching strings. I extended DefaultHandler and implemented startElement() (called when an opening tag is found), characters() (called when a block of characters are read), endElement() (called when an end tag is found) and endDocument() (called once, when the document finished). Since the data that is read is not retained in memory, you have to save the data you are interested in yourself. I used some boolean flags and instance variables to save the current tag, current data, etc.
class ExtractMovieSaxHandler extends DefaultHandler {
// These are some parameters for the search which will select
// the subtrees (they will receive data when we set up the parser)
private String tagToMatch;
private String tagContents; // OR match
private boolean strict = false; // if strict matches will be exact
/**
* Sets criteria to select and copy Movie elements from source XML.
*
* #param tagToMatch Must contain text only
* #param tagContents Text contents of the tag
* #param strict If true, match must be exact
*/
public void setSearchCriteria(String tagToMatch, String tagContents, boolean strict) {
this.tagToMatch = tagToMatch;
this.tagContents = tagContents;
this.strict = strict;
}
// These are the temporary values we store as we parse the file
private String currentElement;
private StringBuilder contents = null; // if not null we are in Movie tag
private String currentData;
List<String> result = new ArrayList<String>(); // store resulting nodes here
private boolean skip = false;
...
These methods are the implementation of the ContentHandler. The first one detects an element was found (start tag). We save the name of the tag (child of Movie) in a variable, because it might be one we use in the search:
...
#Override
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
// Store the current element that started now
currentElement = qName;
// If this is a Movie tag, save the contents because we might need it
if (qName.equals("Movie")) {
contents = new StringBuilder();
}
}
...
This one is called every time a block of characters is called. We check if those characters are occurring inside an element which interests us. If it is, we match the contents and save it if it matches.
...
#Override
public void characters(char[] ch, int start, int length) throws SAXException {
// if we discovered that we don't need this data, we skip it
if (skip || currentElement == null) {
return;
}
// If we are inside the tag we want to search, save the contents
currentData = new String(ch, start, length);
if (currentElement.equals(tagToMatch)) {
boolean discard = true;
if (strict) {
if (currentData.equals(tagContents)) { // exact match
discard = false;
}
} else {
if (currentData.toLowerCase().indexOf(tagContents.toLowerCase()) >= 0) { // matches occurrence of substring
discard = false;
}
}
if (discard) {
skip = true;
}
}
}
...
This is called when an end tag is found. We can now append it to the document we are building in memory if we wish.
...
#Override
public void endElement(String uri, String localName, String qName) throws SAXException {
// Rebuild the XML if it's a node we didn't skip
if (qName.equals("Movie")) {
if (!skip) {
result.add(contents.insert(0, "<Movie>").append("</Movie>").toString());
}
// reset the variables so we can check the next node
contents = null;
skip = false;
} else if (contents != null && !skip) {
contents.append("<").append(qName).append(">")
.append(currentData)
.append("</").append(qName).append(">");
}
currentElement = null;
}
...
Finally, this one is called when the document ends. I also used it to print the result at the end.
...
#Override
public void endDocument() throws SAXException {
StringBuilder resultFile = new StringBuilder();
resultFile.append("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
resultFile.append("<Movies>");
for (String childNode : result) {
resultFile.append(childNode.toString());
}
resultFile.append("</Movies>");
System.out.println("=== Resulting XML containing Movies where " + tagToMatch + " is one of " + tagContents + " ===");
System.out.println(resultFile.toString());
}
}
Here is a small Java application which loads that file, and uses an event handler to extract the data.
public class SAXReaderExample {
public static final String PATH = "src/main/resources"; // this is where I put the XML file
public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException {
// Obtain XML Reader
SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser sp = spf.newSAXParser();
XMLReader reader = sp.getXMLReader();
// Instantiate SAX handler
ExtractMovieSaxHandler handler = new ExtractMovieSaxHandler();
// set search criteria
handler.setSearchCriteria("Director", "Kubrick", false);
// Register handler with XML reader
reader.setContentHandler(handler);
// Parse the XML
reader.parse(new InputSource(new FileInputStream(new File(PATH, "source.xml"))));
}
}
Here is the resulting file, after processing:
<?xml version="1.0" encoding="UTF-8"?>
<Movies>
<Movie>
<Imdb>tt0062622</Imdb>
<Title>2001: A Space Odyssey</Title>
<Director>Stanley Kubrick</Director>
<Year>1968</Year>
<Duration>160</Duration>
</Movie>
<Movie>
<Imdb>tt0066921</Imdb>
<Title>A Clockwork Orange</Title>
<Director>Stanley Kubrick</Director>
<Year>1972</Year>
<Duration>136</Duration>
</Movie>
<Movie>
<Imdb>tt0081505</Imdb>
<Title>The Shining</Title>
<Director>Stanley Kubrick</Director>
<Year>1980</Year>
<Duration>144</Duration>
</Movie>
...
</Movies>
Your scenario might be different, but this example shows a general solution which you can probably adapt to your problem. You can find more information in tutorials about SAX and JAXP.
500Mb is well within the limits of what can be achieved using XSLT. It depends a little bit on how much effort you want to expend to develop an optimum solution: i.e., which is more expensive, your time or the machine's time?
Related
I want to read an XML file in Java and then update certain elements in that file with new values. My file is > 200mb and performance is important, so the DOM model cannot be used.
I feel that a StaX Parser is the solution, but there is no decent literature on using Java StaX to read and then write XML back to the same file.
(For reference I have been using the java tutorial and this helpful tutorial to get what I have so far)
I am using Java 7, but there doesn't seem to be any updates to the XML parsing API since...a long time ago. So this probably isn't relevant.
Currently I have this:
public static String readValueFromXML(final File xmlFile, final String value) throws FileNotFoundException, XMLStreamException
{
XMLEventReader reader = new XMLInputFactory.newFactory().createXMLEventReader(new FileReader(xmlFile));
String found = "";
boolean read = false;
while (reader.hasNext())
{
XMLEvent event = reader.nextEvent();
if (event.isStartElement() &&
event.asStartElement().getName().getLocalPart().equals(value))
{
read = true;
}
if (event.isCharacters() && read)
{
found = event.asCharacters().getData();
break;
}
}
return found;
}
which will read the XMLFile and return the value of the selected element. However, I have another method updateXMLFile(final File xmlFile, final String value) which I want to use in conjunction with this.
So my question is threefold:
Is there a StaX implementation for editing XML
Will XPath be any help? Can that be used without converting my file to a Document?
(More Generally) Why doesn't Java have a better XML API?
There are two things you may want to look at. The first is to use JAXB to bind the XML to POJOs which you can then have your way with and serialize the structure back to XML when needed.
The second is a JDBC driver for XML, there are several available for a fee, not sure if there are any open source ones or not. In my experience JAXB is the better choice. If the XML file is too large to handle efficiently with JAXB I think you need to look at using a database as a replacement for the XML file.
This is my approach, which reads events from the file using StaX and writes them to another file. The values are updated as the loop passes over the correctly named elements.
public void read(String key, String value)
{
try (FileReader fReader = new FileReader(inputFile); FileWriter fWriter = new FileWriter(outputFile))
{
XMLEventFactory factory = XMLEventFactory.newInstance();
XMLEventReader reader = XMLInputFactory.newFactory().createXMLEventReader(fReader);
XMLEventWriter writer = XMLOutputFactory.newFactory().createXMLEventWriter(fWriter);
while (reader.hasNext())
{
XMLEvent event = reader.nextEvent();
boolean update = false;
if (event.isStartElement() && event.asStartElement().getName().getLocalPart().equals(key))
{
update = true;
}
else if (event.isCharacters() && update)
{
Characters characters = factory.createCharacters(value);
event = characters;
update = false;
}
writer.add(event);
}
}
catch (XMLStreamException | FactoryConfigurationError | IOException e)
{
e.printStackTrace();
}
}
I'm reading in an XML configuration file that I don't control the format of, and the data I need is in the last element. Unfortunately, that element is a base64 encoded serialised Java class (yes, I know) that is 31200 characters in length.
Some experimenting seems to show that not only can the Java XML/XPath libraries not see the value in this element (they silently set the value to a blank string), if I just read the file into a string and print it out to console, everything (even a closing element on the next line) gets printed, but not this one element.
Finally, if I manually go into the file and break the line into rows, Java can see the line, although this obviously breaks XML parsing and deserialisation. It also isn't practical as I want to make a tool that will work across many such files.
Is there some line length limit in Java that stops this working? Can I get around it with a third party library?
EDIT: here's the XML-related code:
FileInputStream fstream = new FileInputStream("path/to/xml/file.xml");
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document d = db.parse(fstream);
String s = XPathFactory.newInstance().newXPath().compile("//el1").evaluate(d);
For reading a large xml file, you can use SAX parser.
In addition to that reading the values inside the "characters" in the SAX parser should be read using "String Buffer" instead of String.
You can check out the SAX parser here.
I wondered if it might be possible to do some pre-processing to the XML as you read it in.
I've been having a play to see if I could break down the long element into a list of sub-elements. Then this could be parsed and the sub-elements could be built back into a string. My testing threw up the fact that my initial guess of 4500 characters per sub element was still a bit high for my XML parsing to cope with, so I just arbitrarily picked 1000 and it seems to cope with that.
Anyway, this might help, it might not, but here's what I came up with:
private static final String ELEMENT_TO_BREAK_UP_OPEN = "<element>";
private static final String ELEMENT_TO_BREAK_UP_CLOSE = "</element>";
private static final String SUB_ELEMENT_OPEN = "<subelement>";
private static final String SUB_ELEMENT_CLOSE = "</subelement>";
private static final int SUB_ELEMENT_SIZE_LIMIT = 1000;
public static void main(final String[] args) {
try {
/* The XML currently looks like this:
*
* <root>
* <element> ... Super long input with 30000+ characters ... </element>
* </root>
*
*/
final File file = new File("src\\main\\java\\longxml\\test.xml");
final BufferedReader reader = new BufferedReader(new FileReader(file));
final StringBuffer buffer = new StringBuffer();
String line = reader.readLine();
while( line != null ) {
if( line.contains(ELEMENT_TO_BREAK_UP_OPEN) ) {
buffer.append(ELEMENT_TO_BREAK_UP_OPEN);
String substring = line.substring(ELEMENT_TO_BREAK_UP_OPEN.length(), (line.length() - ELEMENT_TO_BREAK_UP_CLOSE.length()) );
while( substring.length() > SUB_ELEMENT_SIZE_LIMIT ) {
buffer.append(SUB_ELEMENT_OPEN);
buffer.append( substring.substring(0, SUB_ELEMENT_SIZE_LIMIT) );
buffer.append(SUB_ELEMENT_CLOSE);
substring = substring.substring(SUB_ELEMENT_SIZE_LIMIT);
}
if( substring.length() > 0 ) {
buffer.append(SUB_ELEMENT_OPEN);
buffer.append(substring);
buffer.append(SUB_ELEMENT_CLOSE);
}
buffer.append(ELEMENT_TO_BREAK_UP_CLOSE);
}
else {
buffer.append(line);
}
line = reader.readLine();
}
reader.close();
/* The XML now looks something like this:
*
* <root>
* <element>
* <subElement> ... First Part of Data ... </subElement>
* <subElement> ... Second Part of Data ... </subElement>
* ... Multiple Other SubElements of Data ..
* <subElement> ... Final Part of Data ... </subElement>
* </element>
* </root>
*/
//This parses the xml with the new subElements in
final InputSource src = new InputSource(new StringReader(buffer.toString()));
final Node document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(src).getFirstChild();
//This gives us the first child (element) then that's children (subelements)
final NodeList childNodes = document.getFirstChild().getChildNodes();
//Then concatenate them back into a big string.
final StringBuilder finalElementValue = new StringBuilder();
for( int i = 0; i < childNodes.getLength(); i++ ) {
final Node node = childNodes.item(i);
finalElementValue.append( node.getFirstChild().getNodeValue() );
}
//At this point do whatever you need to do. Decode, Deserialize, etc...
System.out.println(finalElementValue.toString());
}
catch (final Exception e) {
e.printStackTrace();
}
}
There are a few issues with this in terms of it's general application:
It does rely on the element you want to break up being uniquely identifiable. (But I'm guessing the logic to find the element can be improved quite a bit)
It relies on knowing the format of the XML and hoping that doesn't change. (Only in the latter parsing section, you could potentially parse it better with xPath once it has been broken into subelements)
Having said all of that, you do end up with a parsable XML string, which you can build your encoded string from, so this might help you on your way to a solution.
I have some xml I am reading here it is.
<application>
<client>website</client>
<register>
<name>
<first>Tommy</first>
<second>Jay</second>
</name>
<address>
<firstLine>line1</firstLine>
<secondLine>line2</secondLine>
<city>city1</city>
<county>county1</county>
<postcode>YY12 9UY</postcode>
</address>
</register>
</application>
Anyway when I read it with the xmlStreamReader as below
public XMLElementALT getNextElement()
{
element = new XMLElementALT();
int event;
try
{
event = reader.next();
}
catch (XMLStreamException ex)
{
return null;
}
if (event == XMLStreamConstants.START_ELEMENT)
{
element.setTag(reader.getLocalName());
}
else if (event == XMLStreamConstants.CHARACTERS)
{
element.setAttribute(reader.getText());
}
else if (event == XMLStreamConstants.END_ELEMENT)
{
element.setEndTag(reader.getLocalName());
}
else if (event == XMLStreamConstants.END_DOCUMENT)
{
element.setFinished();
}
return element;
}
This all goes well! However the problem that I have is that after reading the tag the next event I get is the event XMLStreamConstants.CHARACHTERS and reports that I have the attribute("\n ") which is the space between the tag and the next tag . How can I remove this? I want to have the next event as XMLStreamConstants.START_ELEMENT.I know I could put my XML in all on one line but I like to have the gaps when I input it so that I can see the structure. I also have an xsd to validate against and this validates the xml successfully, is their something in their I can do in the xsd to make it remove the spaces?
Thanks
You can ignore CHARACTERS events that contain only whitespace, either within your getNextElement method or by using a filter when you create the reader
XMLInputFactory factory = XMLInputFactory.newFactory();
XMLStreamReader rawReader = factory.createXMLStreamReader(...);
XMLStreamReader filteredReader = factory.createFilteredReader(rawReader,
new StreamFilter() {
public boolean accept(XMLStreamReader r) {
return !r.isWhiteSpace();
}
});
The isWhiteSpace method returns true if the current event is a CHARACTERS event consisting entirely of whitespace. It returns false if it's not a CHARACTERS event, or if it is CHARACTERS but not all white space.
However, it is important to note that an XMLStreamReader is not guaranteed to return all the text content of an element in one single CHARACTERS event, it is allowed to give you several separate blocks of characters which you must concatenate together yourself.
I am getting XML with the following tags. What I do is, read the XML file with Java using Sax parser and save them to database. but it seems that spaces are there after the p tag like below.
<Inclusions><![CDATA[<p> </p><ul> <li>Small group walking tour</li> <li>Entrance fees</li> <li>Professional guide </li> <li>Guaranteed to skip the long lines</li> <li>Headsets to hear the guide clearly</li> </ul>
<p></p>]]></Inclusions>
But when we insert the read string to the database(PostgreSQL 8) it is printing bad charactors like below for those spaces.
\011\011\011\011\011\011\011\011\011\011\011\011 Small
group walking tour Entrance fees Professional guide
Guaranteed to skip the long lines Headsets to hear
the guide clearly \012\011\011\011\011\011
I want to know why it is printing bad characters(011\011) like that ?
What is the best way to remove spaces inside XML tags with java? (Or how to prevent those bad characters.)
I have checked samples and most of them with python samples.
This is how the XML reads with SAX in my program,
Method 1
// ResultHandler is the class that used to read the XML.
ResultHandler handler = new ResultHandler();
// Use the default parser
SAXParserFactory factory = SAXParserFactory.newInstance();
// Retrieve the XML file
FileInputStream in = new FileInputStream(new File(inputFile)); // input file is XML.
// Parse the XML input
SAXParser saxParser = factory.newSAXParser();
saxParser.parse( in , handler);
This is how the ResultHandler class used to read the XML as Sax parser with Method-1
import org.apache.log4j.Logger;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
// other imports
class ResultHandler extends DefaultHandler {
public void startDocument ()
{
logger.debug("Start document");
}
public void endDocument ()
{
logger.debug("End document");
}
public void startElement(String namespaceURI, String localName, String qName, Attributes attribs)
throws SAXException {
strValue = "";
// add logic with start of tag.
}
public void characters(char[] ch, int start, int length)
throws SAXException {
//logger.debug("characters");
strValue += new String(ch, start, length);
//logger.debug("strValue-->"+strValue);
}
public void endElement(String namespaceURI, String localName, String qName)
throws SAXException {
// add logic to end of tag.
}
}
So that need to know, how to set setIgnoringElementContentWhitespace(true) or similar with sax parser.
You can try to set for your DocumentBuilderFactory
setIgnoringElementContentWhitespace(true)
because of this:
Due to reliance on the content model this setting requires the parser
to be in validating mode
you also need to set
setValidating(true)
Or the str= str.replaceAll("\\s+", ""); might as well work
I'm also finding an exact answer. But think this will help for u.
The C/Modula-3 octal notation; vs there meaning in this link
It says
\011 is for Horizontal tab (ASCII HT)
\012 is for Line feed (ASCII NL, newline)
You can replace multiple spaces with one space as follows
str = str.replaceAll("\s([\s])+", " ");
xml looks like so:
<statements>
<statement account="123">
...stuff...
</statement>
<statement account="456">
...stuff...
</statement>
</statements>
I'm using stax to process one "<statement>" at a time and I got that working. I need to get that entire statement node as a string so I can create "123.xml" and "456.xml" or maybe even load it into a database table indexed by account.
using this approach: http://www.devx.com/Java/Article/30298/1954
I'm looking to do something like this:
String statementXml = staxXmlReader.getNodeByName("statement");
//load statementXml into database
I had a similar task and although the original question is older than a year, I couldn't find a satisfying answer. The most interesting answer up to now was Blaise Doughan's answer, but I couldn't get it running on the XML I am expecting (maybe some parameters for the underlying parser could change that?). Here the XML, very simplyfied:
<many-many-tags>
<description>
...
<p>Lorem ipsum...</p>
Devils inside...
...
</description>
</many-many-tags>
My solution:
public static String readElementBody(XMLEventReader eventReader)
throws XMLStreamException {
StringWriter buf = new StringWriter(1024);
int depth = 0;
while (eventReader.hasNext()) {
// peek event
XMLEvent xmlEvent = eventReader.peek();
if (xmlEvent.isStartElement()) {
++depth;
}
else if (xmlEvent.isEndElement()) {
--depth;
// reached END_ELEMENT tag?
// break loop, leave event in stream
if (depth < 0)
break;
}
// consume event
xmlEvent = eventReader.nextEvent();
// print out event
xmlEvent.writeAsEncodedUnicode(buf);
}
return buf.getBuffer().toString();
}
Usage example:
XMLEventReader eventReader = ...;
while (eventReader.hasNext()) {
XMLEvent xmlEvent = eventReader.nextEvent();
if (xmlEvent.isStartElement()) {
StartElement elem = xmlEvent.asStartElement();
String name = elem.getName().getLocalPart();
if ("DESCRIPTION".equals(name)) {
String xmlFragment = readElementBody(eventReader);
// do something with it...
System.out.println("'" + fragment + "'");
}
}
else if (xmlEvent.isEndElement()) {
// ...
}
}
Note that the extracted XML fragment will contain the complete extracted body content, including white space and comments. Filtering those on demand, or making the buffer size parametrizable have been left out for code brevity:
'
<description>
...
<p>Lorem ipsum...</p>
Devils inside...
...
</description>
'
You can use StAX for this. You just need to advance the XMLStreamReader to the start element for statement. Check the account attribute to get the file name. Then use the javax.xml.transform APIs to transform the StAXSource to a StreamResult wrapping a File. This will advance the XMLStreamReader and then just repeat this process.
import java.io.File;
import java.io.FileReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;
public class Demo {
public static void main(String[] args) throws Exception {
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
xsr.nextTag(); // Advance to statements element
while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
File file = new File("out" + xsr.getAttributeValue(null, "account") + ".xml");
t.transform(new StAXSource(xsr), new StreamResult(file));
}
}
}
Stax is a low-level access API, and it does not have either lookups or methods that access content recursively. But what you actually trying to do? And why are you considering Stax?
Beyond using a tree model (DOM, XOM, JDOM, Dom4j), which would work well with XPath, best choice when dealing with data is usually data binding library like JAXB. With it you can pass Stax or SAX reader and ask it to bind xml data into Java beans and instead of messing with xml process Java objects. This is often more convenient, and it is usually quite performance.
Only trick with larger files is that you do not want to bind the whole thing at once, but rather bind each sub-tree (in your case, one 'statement' at a time).
This is easiest done by iterating Stax XmlStreamReader, then using JAXB to bind.
I've been googling and this seems painfully difficult.
given my xml I think it might just be simpler to:
StringBuilder buffer = new StringBuilder();
for each line in file {
buffer.append(line)
if(line.equals(STMT_END_TAG)){
parse(buffer.toString())
buffer.delete(0,buffer.length)
}
}
private void parse(String statement){
//saxParser.parse( new InputSource( new StringReader( xmlText ) );
// do stuff
// save string
}
Why not just use xpath for this?
You could have a fairly simple xpath to get all 'statement' nodes.
Like so:
//statement
EDIT #1: If possible, take a look at dom4j. You could read the String and get all 'statement' nodes fairly simply.
EDIT #2: Using dom4j, this is how you would do it:
(from their cookbook)
String text = "your xml here";
Document document = DocumentHelper.parseText(text);
public void bar(Document document) {
List list = document.selectNodes( "//statement" );
// loop through node data
}
I had the similar problem and found the solution.
I used the solution proposed by #t0r0X but it does not work well in the current implementation in Java 11, the method xmlEvent.writeAsEncodedUnicode creates the invalid string representation of the start element (in the StartElementEvent class) in the result XML fragment, so I had to modify it, but then it seems to work well, what I could immediatelly verify by the parsing of the fragment by DOM and JaxBMarshaller to specific data containers.
In my case I had the huge structure
<Orders>
<ns2:SyncOrder xmlns:ns2="..." xmlns:ns3="....." ....>
.....
</ns2:SyncOrder>
<ns2:SyncOrder xmlns:ns2="..." xmlns:ns3="....." ....>
.....
</ns2:SyncOrder>
...
</Orders>
in the file of multiple hundred megabytes (a lot of repeating "SyncOrder" structures), so the usage of DOM would lead to a large memory consumption and slow evaluation. Therefore I used the StAX to split the huge XML to smaller XML pieces, which I have analyzed with DOM and used the JaxbElements generated from the xsd definition of the element SyncOrder (This infrastructure I had from the webservice, which uses the same structure, but it is not important).
In this code there can be seen Where the XML fragment has een created and could be used, I used it directly in other processing...
private static <T> List<T> unmarshallMultipleSyncOrderXmlData(
InputStream aOrdersXmlContainingSyncOrderItems,
Function<SyncOrderType, T> aConversionFunction) throws XMLStreamException, ParserConfigurationException, IOException, SAXException {
DocumentBuilderFactory locDocumentBuilderFactory = DocumentBuilderFactory.newInstance();
locDocumentBuilderFactory.setNamespaceAware(true);
DocumentBuilder locDocBuilder = locDocumentBuilderFactory.newDocumentBuilder();
List<T> locResult = new ArrayList<>();
XMLInputFactory locFactory = XMLInputFactory.newFactory();
XMLEventReader locReader = locFactory.createXMLEventReader(aOrdersXmlContainingSyncOrderItems);
boolean locIsInSyncOrder = false;
QName locSyncOrderElementQName = null;
StringWriter locXmlTextBuffer = new StringWriter();
int locDepth = 0;
while (locReader.hasNext()) {
XMLEvent locEvent = locReader.nextEvent();
if (locEvent.isStartElement()) {
if (locDepth == 0 && Objects.equals(locEvent.asStartElement().getName().getLocalPart(), "Orders")) {
locDepth++;
} else {
if (locDepth <= 0)
throw new IllegalStateException("There has been passed invalid XML stream intot he function. "
+ "Expecting the element 'Orders' as the root alament of the document, but found was '"
+ locEvent.asStartElement().getName().getLocalPart() + "'.");
locDepth++;
if (locSyncOrderElementQName == null) {
/* First element after the "Orders" has passed, so we retrieve
* the name of the element with the namespace prefix: */
locSyncOrderElementQName = locEvent.asStartElement().getName();
}
if(Objects.equals(locEvent.asStartElement().getName(), locSyncOrderElementQName)) {
locIsInSyncOrder = true;
}
}
} else if (locEvent.isEndElement()) {
locDepth--;
if(locDepth == 1 && Objects.equals(locEvent.asEndElement().getName(), locSyncOrderElementQName)) {
locEvent.writeAsEncodedUnicode(locXmlTextBuffer);
/* at this moment the call of locXmlTextBuffer.toString() gets the complete fragment
* of XML containing the valid SyncOrder element, but I have continued to other processing,
* which immediatelly validates the produced XML fragment is valid and passes the values
* to communication object: */
Document locDocument = locDocBuilder.parse(new ByteArrayInputStream(locXmlTextBuffer.toString().getBytes()));
SyncOrderType locItem = unmarshallSyncOrderDomNodeToCo(locDocument);
locResult.add(aConversionFunction.apply(locItem));
locXmlTextBuffer = new StringWriter();
locIsInSyncOrder = false;
}
}
if (locIsInSyncOrder) {
if (locEvent.isStartElement()) {
/* here replaced the standard implementation of startElement's method writeAsEncodedUnicode: */
locXmlTextBuffer.write(startElementToStrng(locEvent.asStartElement()));
} else {
locEvent.writeAsEncodedUnicode(locXmlTextBuffer);
}
}
}
return locResult;
}
private static String startElementToString(StartElement aStartElement) {
StringBuilder locStartElementBuffer = new StringBuilder();
// open element
locStartElementBuffer.append("<");
String locNameAsString = null;
if ("".equals(aStartElement.getName().getNamespaceURI())) {
locNameAsString = aStartElement.getName().getLocalPart();
} else if (aStartElement.getName().getPrefix() != null
&& !"".equals(aStartElement.getName().getPrefix())) {
locNameAsString = aStartElement.getName().getPrefix()
+ ":" + aStartElement.getName().getLocalPart();
} else {
locNameAsString = aStartElement.getName().getLocalPart();
}
locStartElementBuffer.append(locNameAsString);
// add any attributes
Iterator<Attribute> locAttributeIterator = aStartElement.getAttributes();
Attribute attr;
while (locAttributeIterator.hasNext()) {
attr = locAttributeIterator.next();
locStartElementBuffer.append(" ");
locStartElementBuffer.append(attributeToString(attr));
}
// add any namespaces
Iterator<Namespace> locNamespaceIterator = aStartElement.getNamespaces();
Namespace locNamespace;
while (locNamespaceIterator.hasNext()) {
locNamespace = locNamespaceIterator.next();
locStartElementBuffer.append(" ");
locStartElementBuffer.append(attributeToString(locNamespace));
}
// close start tag
locStartElementBuffer.append(">");
// return StartElement as a String
return locStartElementBuffer.toString();
}
private static String attributeToString(Attribute aAttr) {
if( aAttr.getName().getPrefix() != null && aAttr.getName().getPrefix().length() > 0 )
return aAttr.getName().getPrefix() + ":" + aAttr.getName().getLocalPart() + "='" + aAttr.getValue() + "'";
else
return aAttr.getName().getLocalPart() + "='" + aAttr.getValue() + "'";
}
public static SyncOrderType unmarshallSyncOrderDomNodeToCo(
Node aSyncOrderItemNode) {
Source locSource = new DOMSource(aSyncOrderItemNode);
Object locUnmarshalledObject = getMarshallerAndUnmarshaller().unmarshal(locSource);
SyncOrderType locCo = ((JAXBElement<SyncOrderType>) locUnmarshalledObject).getValue();
return locCo;
}