Java Plist XML Parsing - java

I'm parsing a (not well formed) Apple Plist File with java.
My Code looks like this:
InputStream in = new FileInputStream( "foo" );
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader parser = factory.createXMLEventReader( in );
while (parser.hasNext()){
XMLEvent event = parser.nextEvent();
//code to navigate the nodes
}
The parts I"m parsing are looking like this:
<dict>
<key>foo</key><integer>123</integer>
<key>bar</key><string>Boom & Shroom</string>
</dict>
My problem is now, that nodes containing a ampersand are not parsed like they should because the ampersand is representing a entity.
What can i do to get the value of the node as a complete String, instead of broken parts?
Thank you in advance.

You should be able to solve your problem by setting the IS_COALESCING property on the XMLInputFactory (I also prefer XMLStreamReader over XMLEventReader, but ymmv):
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE);
InputStream in = // ...
xmlReader = factory.createXMLStreamReader(in, "UTF-8");
Incidentally, to the best of my knowledge none of the JDK parsers will handle "not well formed" XML without choking. Your XML is, in fact, well-formed: it uses an entity rather than a raw ampersand.

There is a predefined method getElementText(), which is buggy in jdk1.6.0_15, but works ok with jdk1.6.0_19. A complete program to easily parse the plist file is this:
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.XMLEvent;
public class Parser {
public static void main(String[] args) throws XMLStreamException, IOException {
InputStream in = new FileInputStream("foo.xml");
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader parser = factory.createXMLEventReader(in);
assert parser.nextEvent().isStartDocument();
XMLEvent event = parser.nextTag();
assert event.isStartElement();
final String name1 = event.asStartElement().getName().getLocalPart();
if (name1.equals("dict")) {
while ((event = parser.nextTag()).isStartElement()) {
final String name2 = event.asStartElement().getName().getLocalPart();
if (name2.equals("key")) {
String key = parser.getElementText();
System.out.println("key: " + key);
} else if (name2.equals("integer")) {
String number = parser.getElementText();
System.out.println("integer: " + number);
} else if (name2.equals("string")) {
String str = parser.getElementText();
System.out.println("string: " + str);
}
}
}
assert parser.nextEvent().isEndDocument();
}
}

This library enables your Java application to handle property lists of various formats.
Read / write property lists from / to files, streams or byte arrays
Convert between property list formats
Property list contents are provided as objects from the NeXTSTEP environment (NSDictionary, NSArray, NSString, etc.)
Serialize native java data structures to property list objects
Deserialize from property list objects to native java data structures
<dependency>
<groupId>com.googlecode.plist</groupId>
<artifactId>dd-plist</artifactId>
<version>1.26</version>
</dependency>
dd-plist

Related

XML to csv in java using DOM

I want to convert XML file to csv that is comma separate file for that i use DOM parser in java.
The output of below code is - AAA123456
The Desiered output is -AAA,123,456
This is what i develop so far.Hope i separate with node name as csv.
public class Main {
static public final String SEPARATOR = ",";
private static String decodeDetailOutputRecordXML(String str) throws ParserConfigurationException, IOException, SAXException {
str = "<a><b><c>AAA</c><d>123</d><e>456</e></b></a>";
Document doc =DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new ByteArrayInputStream(str.getBytes()));
DocumentTraversal traversal = (DocumentTraversal) doc;
NodeIterator iterator = traversal.createNodeIterator(doc.getDocumentElement(), NodeFilter.SHOW_ELEMENT, null, true);
for (Node n = iterator.nextNode(); n != null; n = iterator.nextNode()) {
out.println(n.getTextContent());
}
return "";
}
public static void main(String[] args) throws Exception {
decodeDetailOutputRecordXML(null);
return;
}
}
This answer is to demonstrate the DOM API usage to convert the XML format under consideration to CSV. The example code below used DOM API directly and OpenCSV to write the CSV file.
The Example XML
<?xml version="1.0" encoding="UTF-8"?>
<a>
<b>
<c>Somedata0</c>
<d>Somedata1</d>
<e>Somedata2</e>
</b>
<b>
<c>Xdata0</c>
<d>Xdata1</d>
<e>Xdata2</e>
</b>
</a>
The routine that converts the XML to CSV
package org.test;
import java.io.FileInputStream;
import java.io.FileWriter;
import org.apache.xerces.parsers.DOMParser;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import com.opencsv.CSVWriter;
public class XMLToCSVTest {
public static void main(String[] args) throws Exception{
String inputFilePath="D:\\workspaces\\mtplatform\\TechTest\\testfiles\\testdata.xml";
String outputFilePath="D:\\workspaces\\mtplatform\\TechTest\\testfiles\\testdataOut.csv";
/*
* We assume that we know the structure and the column names of the CSV file
*/
String[] csvHeaders=new String[] {"c","d","e"};
/*
* Using Xerces DOM parser directly, same can also be achieved through JAXP
*/
DOMParser parser=new DOMParser();
try(FileInputStream fis=new FileInputStream(inputFilePath);
CSVWriter writer=new CSVWriter(new FileWriter(outputFilePath));){
/*
* Write the CSV headers
*/
writer.writeNext(csvHeaders);
InputSource source=new InputSource(fis);
parser.parse(source);
Element documentElement=parser.getDocument().getDocumentElement();
/*
* We assume that we know the structure of the XML completely and we also assume the data is actually there, that is
* no elements are missing being optional.
*/
NodeList elementBList=documentElement.getElementsByTagName("b");
for(int i=0;i<elementBList.getLength();i++) {
Element elementB=(Element)elementBList.item(i);
Element elementC=(Element)elementB.getElementsByTagName("c").item(0);
Element elementD=(Element)elementB.getElementsByTagName("d").item(0);
Element elementE=(Element)elementB.getElementsByTagName("e").item(0);
String[] line=new String[] {elementC.getFirstChild().getNodeValue(),
elementD.getFirstChild().getNodeValue(),
elementE.getFirstChild().getNodeValue()};
writer.writeNext(line);
}//for closing
writer.flush();
}catch(Exception e) {e.printStackTrace();}
}//main closing
}//class closing
The CSV output
"c","d","e"
"Somedata0","Somedata1","Somedata2"
"Xdata0","Xdata1","Xdata2"
NOTE: The above is one way to convert an XML to CSV with DOM API directly. While direct DOM API gives lot of flexibility, it is also slightly complicated to use. XML being an hierarchical data could sometimes be difficult to express as CSV, which is a flat data structure without either some loss of fidelity or a more complicated CSV structure, a case in point is multiple occurrence of a specific child element (in general multi-value). The actual CSV output also could be written as part of the routine, however, it would be tedious and error prone, OpenCSV has been used for that reason.

XML file reading in Java

Is it necessary to know the structure and tags of an XML file completely before reading it in Java?
areaElement.getElementsByTagName("checked").item(0).getTextContent()
I don't know the field name "checked" before I read the file. Is there any way to list all the tags in the XML file, basically the file structure?
I had prepared this DOM parser by myself, using recursion which will parse your xml without having knowledge of single tag. It will give you each node's text content if exist, in a sequence. You can remove commented section in following code to get node name also. Hope it would help.
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
public class RecDOMP {
public static void main(String[] args) throws Exception{
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false);
DocumentBuilder db = dbf.newDocumentBuilder();
// replace following path with your input xml path
Document doc = db.parse(new FileInputStream(new File ("D:\\ambuj\\ATT\\apip\\APIP_New.xml")));
// replace following path with your output xml path
File OutputDOM = new File("D:\\ambuj\\ATT\\apip\\outapip1.txt");
FileOutputStream fostream = new FileOutputStream(OutputDOM);
OutputStreamWriter oswriter = new OutputStreamWriter (fostream);
BufferedWriter bwriter = new BufferedWriter(oswriter);
// if file doesnt exists, then create it
if (!OutputDOM.exists()) {
OutputDOM.createNewFile();}
visitRecursively(doc,bwriter);
bwriter.close(); oswriter.close(); fostream.close();
System.out.println("Done");
}
public static void visitRecursively(Node node, BufferedWriter bw) throws IOException{
// get all child nodes
NodeList list = node.getChildNodes();
for (int i=0; i<list.getLength(); i++) {
// get child node
Node childNode = list.item(i);
if (childNode.getNodeType() == Node.TEXT_NODE)
{
//System.out.println("Found Node: " + childNode.getNodeName()
// + " - with value: " + childNode.getNodeValue()+" Node type:"+childNode.getNodeType());
String nodeValue= childNode.getNodeValue();
nodeValue=nodeValue.replace("\n","").replaceAll("\\s","");
if (!nodeValue.isEmpty())
{
System.out.println(nodeValue);
bw.write(nodeValue);
bw.newLine();
}
}
visitRecursively(childNode,bw);
}
}
}
You should definitely check out libraries for this, like dom4j (http://dom4j.sourceforge.net/). They can parse the whole XML document and let you not only list things like elements but do XPath queries and other such cool stuff on them.
There is a performance hit, especially in large XML documents, so you will want to check on the performance hit for your use case before committing to a library. This is especially true if you only need a small bit out of the XML document (and you kind of know what you are looking for already).
The answer to your question is no, it is not necessary to know any element names in advance. For example, you can walk the tree to discover the element names. But it all depends what you are actually trying to do.
For the vast majority of applications, incidentally, the Java DOM is one of the worst ways to solve the problem. But I won't comment further without knowing your project requirements.

How to preserve whitespace in attributes when using XMLStreamWriter?

When using the javax.xml.XMLStreamWriter, is there any way to preserve the whitespace within attributes? I understand that the XMLStreamReader will perform Attribute-Value Normalization, converting \r\n\t in the XML into a space, so it's up to the writer to emit entity references (e.g.
) to preserve the whitespace. Is there any way to tell the writer to use entity references for whitespace? Can I add entity references to attributes myself?
The following JUnit3 test passes. When I encode "Hello,\r\n\tworld", I want to get the same thing back out. But instead, the decoded value is "Hello world" (two spaces).
import java.io.StringReader;
import java.io.StringWriter;
import javax.xml.stream.*;
import junit.framework.TestCase;
public class XmlStreamTest extends TestCase {
public void testAttribute() throws XMLStreamException {
StringWriter stringWriter = new StringWriter();
XMLStreamWriter xmlStreamWriter = XMLOutputFactory.newFactory().createXMLStreamWriter(stringWriter);
xmlStreamWriter.writeStartDocument();
xmlStreamWriter.writeStartElement("root");
xmlStreamWriter.writeAttribute("a", "Hello,\r\n\tWorld! ");
xmlStreamWriter.writeEndElement();
xmlStreamWriter.writeEndDocument();
xmlStreamWriter.close();
assertEquals("<?xml version=\"1.0\" ?><root a=\"Hello,\r\n\tWorld! \"></root>", stringWriter.toString());
StringReader stringReader = new StringReader(stringWriter.toString());
XMLStreamReader xmlStreamReader = XMLInputFactory.newInstance().createXMLStreamReader(stringReader);
assertEquals(XMLStreamConstants.START_DOCUMENT, xmlStreamReader.getEventType());
assertEquals(XMLStreamConstants.START_ELEMENT, xmlStreamReader.next());
// This is not what I want! I want the value to be the same as I originally gave!
assertEquals("Hello, World! ", xmlStreamReader.getAttributeValue(null, "a"));
assertEquals(XMLStreamConstants.END_ELEMENT, xmlStreamReader.next());
assertEquals(XMLStreamConstants.END_DOCUMENT, xmlStreamReader.next());
}
}

Reading Java Properties file without escaping values

My application needs to use a .properties file for configuration.
In the properties files, users are allow to specify paths.
Problem
Properties files need values to be escaped, eg
dir = c:\\mydir
Needed
I need some way to accept a properties file where the values are not escaped, so that the users can specify:
dir = c:\mydir
Why not simply extend the properties class to incorporate stripping of double forward slashes. A good feature of this will be that through the rest of your program you can still use the original Properties class.
public class PropertiesEx extends Properties {
public void load(FileInputStream fis) throws IOException {
Scanner in = new Scanner(fis);
ByteArrayOutputStream out = new ByteArrayOutputStream();
while(in.hasNext()) {
out.write(in.nextLine().replace("\\","\\\\").getBytes());
out.write("\n".getBytes());
}
InputStream is = new ByteArrayInputStream(out.toByteArray());
super.load(is);
}
}
Using the new class is a simple as:
PropertiesEx p = new PropertiesEx();
p.load(new FileInputStream("C:\\temp\\demo.properties"));
p.list(System.out);
The stripping code could also be improved upon but the general principle is there.
Two options:
use the XML properties format instead
Writer your own parser for a modified .properties format without escapes
You can "preprocess" the file before loading the properties, for example:
public InputStream preprocessPropertiesFile(String myFile) throws IOException{
Scanner in = new Scanner(new FileReader(myFile));
ByteArrayOutputStream out = new ByteArrayOutputStream();
while(in.hasNext())
out.write(in.nextLine().replace("\\","\\\\").getBytes());
return new ByteArrayInputStream(out.toByteArray());
}
And your code could look this way
Properties properties = new Properties();
properties.load(preprocessPropertiesFile("path/myfile.properties"));
Doing this, your .properties file would look like you need, but you will have the properties values ready to use.
*I know there should be better ways to manipulate files, but I hope this helps.
The right way would be to provide your users with a property file editor (or a plugin for their favorite text editor) which allows them entering the text as pure text, and would save the file in the property file format.
If you don't want this, you are effectively defining a new format for the same (or a subset of the) content model as the property files have.
Go the whole way and actually specify your format, and then think about a way to either
transform the format to the canonical one, and then use this for loading the files, or
parse this format and populate a Properties object from it.
Both of these approaches will only work directly if you actually can control your property object's creation, otherwise you will have to store the transformed format with your application.
So, let's see how we can define this. The content model of normal property files is simple:
A map of string keys to string values, both allowing arbitrary Java strings.
The escaping which you want to avoid serves just to allow arbitrary Java strings, and not just a subset of these.
An often sufficient subset would be:
A map of string keys (not containing any whitespace, : or =) to string values (not containing any leading or trailing white space or line breaks).
In your example dir = c:\mydir, the key would be dir and the value c:\mydir.
If we want our keys and values to contain any Unicode character (other than the forbidden ones mentioned), we should use UTF-8 (or UTF-16) as the storage encoding - since we have no way to escape characters outside of the storage encoding. Otherwise, US-ASCII or ISO-8859-1 (as normal property files) or any other encoding supported by Java would be enough, but make sure to include this in your specification of the content model (and make sure to read it this way).
Since we restricted our content model so that all "dangerous" characters are out of the way, we can now define the file format simply as this:
<simplepropertyfile> ::= (<line> <line break> )*
<line> ::= <comment> | <empty> | <key-value>
<comment> ::= <space>* "#" < any text excluding line breaks >
<key-value> ::= <space>* <key> <space>* "=" <space>* <value> <space>*
<empty> ::= <space>*
<key> ::= < any text excluding ':', '=' and whitespace >
<value> ::= < any text starting and ending not with whitespace,
not including line breaks >
<space> ::= < any whitespace, but not a line break >
<line break> ::= < one of "\n", "\r", and "\r\n" >
Every \ occurring in either key or value now is a real backslash, not anything which escapes something else.
Thus, for transforming it into the original format, we simply need to double it, like Grekz proposed, for example in a filtering reader:
public DoubleBackslashFilter extends FilterReader {
private boolean bufferedBackslash = false;
public DoubleBackslashFilter(Reader org) {
super(org);
}
public int read() {
if(bufferedBackslash) {
bufferedBackslash = false;
return '\\';
}
int c = super.read();
if(c == '\\')
bufferedBackslash = true;
return c;
}
public int read(char[] buf, int off, int len) {
int read = 0;
if(bufferedBackslash) {
buf[off] = '\\';
read++;
off++;
len --;
bufferedBackslash = false;
}
if(len > 1) {
int step = super.read(buf, off, len/2);
for(int i = 0; i < step; i++) {
if(buf[off+i] == '\\') {
// shift everything from here one one char to the right.
System.arraycopy(buf, i, buf, i+1, step - i);
// adjust parameters
step++; i++;
}
}
read += step;
}
return read;
}
}
Then we would pass this Reader to our Properties object (or save the contents to a new file).
Instead, we could simply parse this format ourselves.
public Properties parse(Reader in) {
BufferedReader r = new BufferedReader(in);
Properties prop = new Properties();
Pattern keyValPattern = Pattern.compile("\s*=\s*");
String line;
while((line = r.readLine()) != null) {
line = line.trim(); // remove leading and trailing space
if(line.equals("") || line.startsWith("#")) {
continue; // ignore empty and comment lines
}
String[] kv = line.split(keyValPattern, 2);
// the pattern also grabs space around the separator.
if(kv.length < 2) {
// no key-value separator. TODO: Throw exception or simply ignore this line?
continue;
}
prop.setProperty(kv[0], kv[1]);
}
r.close();
return prop;
}
Again, using Properties.store() after this, we can export it in the original format.
Based on #Ian Harrigan, here is a complete solution to get Netbeans properties file (and other escaping properties file) right from and to ascii text-files :
import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.io.Reader;
import java.io.Writer;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Properties;
/**
* This class allows to handle Netbeans properties file.
* It is based on the work of : http://stackoverflow.com/questions/6233532/reading-java-properties-file-without-escaping-values.
* It overrides both load methods in order to load a netbeans property file, taking into account the \ that
* were escaped by java properties original load methods.
* #author stephane
*/
public class NetbeansProperties extends Properties {
#Override
public synchronized void load(Reader reader) throws IOException {
BufferedReader bfr = new BufferedReader( reader );
ByteArrayOutputStream out = new ByteArrayOutputStream();
String readLine = null;
while( (readLine = bfr.readLine()) != null ) {
out.write(readLine.replace("\\","\\\\").getBytes());
out.write("\n".getBytes());
}//while
InputStream is = new ByteArrayInputStream(out.toByteArray());
super.load(is);
}//met
#Override
public void load(InputStream is) throws IOException {
load( new InputStreamReader( is ) );
}//met
#Override
public void store(Writer writer, String comments) throws IOException {
PrintWriter out = new PrintWriter( writer );
if( comments != null ) {
out.print( '#' );
out.println( comments );
}//if
List<String> listOrderedKey = new ArrayList<String>();
listOrderedKey.addAll( this.stringPropertyNames() );
Collections.sort(listOrderedKey );
for( String key : listOrderedKey ) {
String newValue = this.getProperty(key);
out.println( key+"="+newValue );
}//for
}//met
#Override
public void store(OutputStream out, String comments) throws IOException {
store( new OutputStreamWriter(out), comments );
}//met
}//class
You could try using guava's Splitter: split on '=' and build a map from resulting Iterable.
The disadvantage of this solution is that it does not support comments.
#pdeva: one more solution
//Reads entire file in a String
//available in java1.5
Scanner scan = new Scanner(new File("C:/workspace/Test/src/myfile.properties"));
scan.useDelimiter("\\Z");
String content = scan.next();
//Use apache StringEscapeUtils.escapeJava() method to escape java characters
ByteArrayInputStream bi=new ByteArrayInputStream(StringEscapeUtils.escapeJava(content).getBytes());
//load properties file
Properties properties = new Properties();
properties.load(bi);
It's not an exact answer to your question, but a different solution that may be appropriate to your needs. In Java, you can use / as a path separator and it'll work on both Windows, Linux, and OSX. This is specially useful for relative paths.
In your example, you could use:
dir = c:/mydir

stax - get xml node as string

xml looks like so:
<statements>
<statement account="123">
...stuff...
</statement>
<statement account="456">
...stuff...
</statement>
</statements>
I'm using stax to process one "<statement>" at a time and I got that working. I need to get that entire statement node as a string so I can create "123.xml" and "456.xml" or maybe even load it into a database table indexed by account.
using this approach: http://www.devx.com/Java/Article/30298/1954
I'm looking to do something like this:
String statementXml = staxXmlReader.getNodeByName("statement");
//load statementXml into database
I had a similar task and although the original question is older than a year, I couldn't find a satisfying answer. The most interesting answer up to now was Blaise Doughan's answer, but I couldn't get it running on the XML I am expecting (maybe some parameters for the underlying parser could change that?). Here the XML, very simplyfied:
<many-many-tags>
<description>
...
<p>Lorem ipsum...</p>
Devils inside...
...
</description>
</many-many-tags>
My solution:
public static String readElementBody(XMLEventReader eventReader)
throws XMLStreamException {
StringWriter buf = new StringWriter(1024);
int depth = 0;
while (eventReader.hasNext()) {
// peek event
XMLEvent xmlEvent = eventReader.peek();
if (xmlEvent.isStartElement()) {
++depth;
}
else if (xmlEvent.isEndElement()) {
--depth;
// reached END_ELEMENT tag?
// break loop, leave event in stream
if (depth < 0)
break;
}
// consume event
xmlEvent = eventReader.nextEvent();
// print out event
xmlEvent.writeAsEncodedUnicode(buf);
}
return buf.getBuffer().toString();
}
Usage example:
XMLEventReader eventReader = ...;
while (eventReader.hasNext()) {
XMLEvent xmlEvent = eventReader.nextEvent();
if (xmlEvent.isStartElement()) {
StartElement elem = xmlEvent.asStartElement();
String name = elem.getName().getLocalPart();
if ("DESCRIPTION".equals(name)) {
String xmlFragment = readElementBody(eventReader);
// do something with it...
System.out.println("'" + fragment + "'");
}
}
else if (xmlEvent.isEndElement()) {
// ...
}
}
Note that the extracted XML fragment will contain the complete extracted body content, including white space and comments. Filtering those on demand, or making the buffer size parametrizable have been left out for code brevity:
'
<description>
...
<p>Lorem ipsum...</p>
Devils inside...
...
</description>
'
You can use StAX for this. You just need to advance the XMLStreamReader to the start element for statement. Check the account attribute to get the file name. Then use the javax.xml.transform APIs to transform the StAXSource to a StreamResult wrapping a File. This will advance the XMLStreamReader and then just repeat this process.
import java.io.File;
import java.io.FileReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;
public class Demo {
public static void main(String[] args) throws Exception {
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
xsr.nextTag(); // Advance to statements element
while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
File file = new File("out" + xsr.getAttributeValue(null, "account") + ".xml");
t.transform(new StAXSource(xsr), new StreamResult(file));
}
}
}
Stax is a low-level access API, and it does not have either lookups or methods that access content recursively. But what you actually trying to do? And why are you considering Stax?
Beyond using a tree model (DOM, XOM, JDOM, Dom4j), which would work well with XPath, best choice when dealing with data is usually data binding library like JAXB. With it you can pass Stax or SAX reader and ask it to bind xml data into Java beans and instead of messing with xml process Java objects. This is often more convenient, and it is usually quite performance.
Only trick with larger files is that you do not want to bind the whole thing at once, but rather bind each sub-tree (in your case, one 'statement' at a time).
This is easiest done by iterating Stax XmlStreamReader, then using JAXB to bind.
I've been googling and this seems painfully difficult.
given my xml I think it might just be simpler to:
StringBuilder buffer = new StringBuilder();
for each line in file {
buffer.append(line)
if(line.equals(STMT_END_TAG)){
parse(buffer.toString())
buffer.delete(0,buffer.length)
}
}
private void parse(String statement){
//saxParser.parse( new InputSource( new StringReader( xmlText ) );
// do stuff
// save string
}
Why not just use xpath for this?
You could have a fairly simple xpath to get all 'statement' nodes.
Like so:
//statement
EDIT #1: If possible, take a look at dom4j. You could read the String and get all 'statement' nodes fairly simply.
EDIT #2: Using dom4j, this is how you would do it:
(from their cookbook)
String text = "your xml here";
Document document = DocumentHelper.parseText(text);
public void bar(Document document) {
List list = document.selectNodes( "//statement" );
// loop through node data
}
I had the similar problem and found the solution.
I used the solution proposed by #t0r0X but it does not work well in the current implementation in Java 11, the method xmlEvent.writeAsEncodedUnicode creates the invalid string representation of the start element (in the StartElementEvent class) in the result XML fragment, so I had to modify it, but then it seems to work well, what I could immediatelly verify by the parsing of the fragment by DOM and JaxBMarshaller to specific data containers.
In my case I had the huge structure
<Orders>
<ns2:SyncOrder xmlns:ns2="..." xmlns:ns3="....." ....>
.....
</ns2:SyncOrder>
<ns2:SyncOrder xmlns:ns2="..." xmlns:ns3="....." ....>
.....
</ns2:SyncOrder>
...
</Orders>
in the file of multiple hundred megabytes (a lot of repeating "SyncOrder" structures), so the usage of DOM would lead to a large memory consumption and slow evaluation. Therefore I used the StAX to split the huge XML to smaller XML pieces, which I have analyzed with DOM and used the JaxbElements generated from the xsd definition of the element SyncOrder (This infrastructure I had from the webservice, which uses the same structure, but it is not important).
In this code there can be seen Where the XML fragment has een created and could be used, I used it directly in other processing...
private static <T> List<T> unmarshallMultipleSyncOrderXmlData(
InputStream aOrdersXmlContainingSyncOrderItems,
Function<SyncOrderType, T> aConversionFunction) throws XMLStreamException, ParserConfigurationException, IOException, SAXException {
DocumentBuilderFactory locDocumentBuilderFactory = DocumentBuilderFactory.newInstance();
locDocumentBuilderFactory.setNamespaceAware(true);
DocumentBuilder locDocBuilder = locDocumentBuilderFactory.newDocumentBuilder();
List<T> locResult = new ArrayList<>();
XMLInputFactory locFactory = XMLInputFactory.newFactory();
XMLEventReader locReader = locFactory.createXMLEventReader(aOrdersXmlContainingSyncOrderItems);
boolean locIsInSyncOrder = false;
QName locSyncOrderElementQName = null;
StringWriter locXmlTextBuffer = new StringWriter();
int locDepth = 0;
while (locReader.hasNext()) {
XMLEvent locEvent = locReader.nextEvent();
if (locEvent.isStartElement()) {
if (locDepth == 0 && Objects.equals(locEvent.asStartElement().getName().getLocalPart(), "Orders")) {
locDepth++;
} else {
if (locDepth <= 0)
throw new IllegalStateException("There has been passed invalid XML stream intot he function. "
+ "Expecting the element 'Orders' as the root alament of the document, but found was '"
+ locEvent.asStartElement().getName().getLocalPart() + "'.");
locDepth++;
if (locSyncOrderElementQName == null) {
/* First element after the "Orders" has passed, so we retrieve
* the name of the element with the namespace prefix: */
locSyncOrderElementQName = locEvent.asStartElement().getName();
}
if(Objects.equals(locEvent.asStartElement().getName(), locSyncOrderElementQName)) {
locIsInSyncOrder = true;
}
}
} else if (locEvent.isEndElement()) {
locDepth--;
if(locDepth == 1 && Objects.equals(locEvent.asEndElement().getName(), locSyncOrderElementQName)) {
locEvent.writeAsEncodedUnicode(locXmlTextBuffer);
/* at this moment the call of locXmlTextBuffer.toString() gets the complete fragment
* of XML containing the valid SyncOrder element, but I have continued to other processing,
* which immediatelly validates the produced XML fragment is valid and passes the values
* to communication object: */
Document locDocument = locDocBuilder.parse(new ByteArrayInputStream(locXmlTextBuffer.toString().getBytes()));
SyncOrderType locItem = unmarshallSyncOrderDomNodeToCo(locDocument);
locResult.add(aConversionFunction.apply(locItem));
locXmlTextBuffer = new StringWriter();
locIsInSyncOrder = false;
}
}
if (locIsInSyncOrder) {
if (locEvent.isStartElement()) {
/* here replaced the standard implementation of startElement's method writeAsEncodedUnicode: */
locXmlTextBuffer.write(startElementToStrng(locEvent.asStartElement()));
} else {
locEvent.writeAsEncodedUnicode(locXmlTextBuffer);
}
}
}
return locResult;
}
private static String startElementToString(StartElement aStartElement) {
StringBuilder locStartElementBuffer = new StringBuilder();
// open element
locStartElementBuffer.append("<");
String locNameAsString = null;
if ("".equals(aStartElement.getName().getNamespaceURI())) {
locNameAsString = aStartElement.getName().getLocalPart();
} else if (aStartElement.getName().getPrefix() != null
&& !"".equals(aStartElement.getName().getPrefix())) {
locNameAsString = aStartElement.getName().getPrefix()
+ ":" + aStartElement.getName().getLocalPart();
} else {
locNameAsString = aStartElement.getName().getLocalPart();
}
locStartElementBuffer.append(locNameAsString);
// add any attributes
Iterator<Attribute> locAttributeIterator = aStartElement.getAttributes();
Attribute attr;
while (locAttributeIterator.hasNext()) {
attr = locAttributeIterator.next();
locStartElementBuffer.append(" ");
locStartElementBuffer.append(attributeToString(attr));
}
// add any namespaces
Iterator<Namespace> locNamespaceIterator = aStartElement.getNamespaces();
Namespace locNamespace;
while (locNamespaceIterator.hasNext()) {
locNamespace = locNamespaceIterator.next();
locStartElementBuffer.append(" ");
locStartElementBuffer.append(attributeToString(locNamespace));
}
// close start tag
locStartElementBuffer.append(">");
// return StartElement as a String
return locStartElementBuffer.toString();
}
private static String attributeToString(Attribute aAttr) {
if( aAttr.getName().getPrefix() != null && aAttr.getName().getPrefix().length() > 0 )
return aAttr.getName().getPrefix() + ":" + aAttr.getName().getLocalPart() + "='" + aAttr.getValue() + "'";
else
return aAttr.getName().getLocalPart() + "='" + aAttr.getValue() + "'";
}
public static SyncOrderType unmarshallSyncOrderDomNodeToCo(
Node aSyncOrderItemNode) {
Source locSource = new DOMSource(aSyncOrderItemNode);
Object locUnmarshalledObject = getMarshallerAndUnmarshaller().unmarshal(locSource);
SyncOrderType locCo = ((JAXBElement<SyncOrderType>) locUnmarshalledObject).getValue();
return locCo;
}

Categories