Split XML using SAX parser - java

I have following xml file.
<Engineers>
<Engineer>
<Name>JOHN</Name>
<Position>STL</Position>
<Team>SS</Team>
</Engineer>
<Engineer>
<Name>UDAY</Name>
<Position>TL</Position>
<Team>SG</Team>
</Engineer>
<Engineer>
<Name>INDRA</Name>
<Position>Director</Position>
<Team>PP</Team>
</Engineer>
</Engineers>
I need to split this xml into smaller xml strings when Xpath is given as Engineers/Enginner.
Smaller xml strings are as follows
<Engineers>
<Engineer>
<Name>INDRA</Name>
<Position>Director</Position>
<Team>PP</Team>
</Engineer>
</Engineers>
<Engineers>
<Engineer>
<Name>JOHN</Name>
<Position>STL</Position>
<Team>SS</Team>
</Engineer>
</Engineers>
I have implemented following so far using SAX which we can get the elements inside XML but not as what I want.How can I proceed??
public class ReadSAX
{
public static void main( String[] args )
{
try {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
DefaultHandler handler = new DefaultHandler() {
public void startElement(String uri, String localName,
String qName, Attributes attributes)
throws SAXException {
System.out.println("Start Element :" + qName);
public void endElement(String uri, String localName,
String qName)
throws SAXException {
System.out.println("End Element :" + qName);
}
public void characters(char ch[], int start, int length)
throws SAXException {
System.out.println(new String(ch, start, length));
}
};
File file = new File("c:\\file.xml");
InputStream inputStream= new FileInputStream(file);
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.parse(is, handler);
} catch (Exception e) {
e.printStackTrace();
}
}
}

Why use such a low-level coding approach?
In XSLT 2.0 it's simply
<xsl:template match="/">
<xsl:for-each select="Engineers/Engineer">
<xsl:result-document select="{position()}.xml">
<Engineers>
<xsl:copy-of select="."/>
</Engineers>
</xsl:result-document>
</xsl:for-each>
</xsl:template>
and if that takes too much memory, get a streaming XSLT 3.0 processor which will solve the problem.

I think what you need to do is to use VTD-XML's cut and paste ability... this paper, entitled performance analysis of java apis for xml processing, will tell you more on vtd-xml..
http://sdiwc.us/digitlib/journal_paper.php?paper=00000582.pdf
import com.ximpleware.*;
import java.io.*;
public class splitXML {
public static void main(String[] args) throws VTDException, IOException {
VTDGen vg = new VTDGen();
if (!vg.parseFile("d:\\xml\\input.xml", false)){
System.out.println("error");
return;
}
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/engineers/engineer");
int i=0,n=0;
FileOutputStream fos =null;
byte[] stag="<engineers>".getBytes();
byte[] etag="</engineers>".getBytes();
while((i=ap.evalXPath())!=-1){
fos.write(stag);
fos = new FileOutputStream("d:\\xml\\output"+(++n)+".xml");
long l = vn.getElementFragment();
fos.write(vn.getXML().getBytes(), (int)l, (int)(l>>32));
fos.write(etag);
fos.close();
}
}
}

Related

Java Replace words within xml

I have the following xml
<some tag>
<some_nested_tag attr="Hello"> Text </some_nested_tag>
Hello world Hello Programming
</some tag>
From the above xml, I want to replace the occurances of the word "Hello" which are part of the tag content but not part of tag attribute.
I want the following output (Replacing Hello by HI):
<some tag>
<some_nested_tag attr="Hello"> Text </some_nested_tag>
HI world HI Programming
</some tag>
I tried java regex and also some of the DOM parser tutorials, but without any luck. I am posting here for help as I have limited time available to fix this in my project. Help would be appreciated.
That can be done by using a negative lookbehind.
Try this regex:
(?<!attr=")Hello
It will match Hello that is not preceded by attr=.
So you could try this:
str = str.replaceAll("(?<!attr=")Hello", "Hi");
It can also be done by negative lookahead:
Hello(?!([^<]+)?>)
string.replaceAll("(?i)\\shello\\s", " HI ");
Regex Explanation:
\sHello\s
Options: Case insensitive
Match a single character that is a “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s»
Match the character string “Hello” literally (case insensitive) «Hello»
Match a single character that is a “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s»
hi
Insert the character string “ HI ” literally « HI »
Regex101 Demo
XSLT is a language for transforming XML documents into other XML documents. You can match all the text nodes containing 'Hello' and replace the content of those particular nodes.
A small example of using XSLT in Java:
import javax.xml.transform.*;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
import java.io.File;
import java.io.IOException;
import java.net.URISyntaxException;
public class TestMain {
public static void main(String[] args) throws IOException, URISyntaxException, TransformerException {
TransformerFactory factory = TransformerFactory.newInstance();
Source xslt = new StreamSource(new File("transform.xslt"));
Transformer transformer = factory.newTransformer(xslt);
Source text = new StreamSource(new File("input.xml"));
transformer.transform(text, new StreamResult(new File("output.xml")));
}
}
There was a good question on replacing string using XSLT - you can find an example of XSLT template there:
XSLT string replace
Here is a fully functional example using SAX parser. It is adapted to your case with minimal changes from this example
The actual replacement takes place in MyCopyHandler#endElement() and MyCopyHandler#startElement() and the XML element text content is collected in MyCopyHandler#characters(). Note the buffer maintenance too - it is important in handling mixed element content (text and child elements)
I know XSLT solution is also possible, but it is not that portable.
public class XMLReplace {
/**
* #param args
* #throws SAXException
* #throws ParserConfigurationException
*/
public static void main(String[] args) throws Exception {
final String str = "<root> Hello <nested attr='Hello'> Text </nested> Hello world Hello Programming </root>";
SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser parser = spf.newSAXParser();
XMLReader reader = parser.getXMLReader();
reader.setErrorHandler(new MyErrorHandler());
ByteArrayOutputStream baos = new ByteArrayOutputStream();
PrintWriter out = new PrintWriter(baos);
MyCopyHandler duper = new MyCopyHandler(out);
reader.setContentHandler(duper);
InputSource is = new InputSource(new StringReader(str));
reader.parse(is);
out.close();
System.out.println(baos);
}
}
class MyCopyHandler implements ContentHandler {
private boolean namespaceBegin = false;
private String currentNamespace;
private String currentNamespaceUri;
private Locator locator;
private final PrintWriter out;
private final StringBuilder buffer = new StringBuilder();
public MyCopyHandler(PrintWriter out) {
this.out = out;
}
public void setDocumentLocator(Locator locator) {
this.locator = locator;
}
public void startDocument() {
}
public void endDocument() {
}
public void startPrefixMapping(String prefix, String uri) {
namespaceBegin = true;
currentNamespace = prefix;
currentNamespaceUri = uri;
}
public void endPrefixMapping(String prefix) {
}
public void startElement(String namespaceURI, String localName, String qName, Attributes atts) {
// Flush buffer - needed in case of mixed content (text + elements)
out.print(buffer.toString().replaceAll("Hello", "HI"));
// Prepare to collect element text content
this.buffer.setLength(0);
out.print("<" + qName);
if (namespaceBegin) {
out.print(" xmlns:" + currentNamespace + "=\"" + currentNamespaceUri + "\"");
namespaceBegin = false;
}
for (int i = 0; i < atts.getLength(); i++) {
out.print(" " + atts.getQName(i) + "=\"" + atts.getValue(i) + "\"");
}
out.print(">");
}
public void endElement(String namespaceURI, String localName, String qName) {
// Process text content
out.print(buffer.toString().replaceAll("Hello", "HI"));
out.print("</" + qName + ">");
// Reset buffer
buffer.setLength(0);
}
public void characters(char[] ch, int start, int length) {
// Store chunk of text - parser is allowed to provide text content in chunks for performance reasons
buffer.append(Arrays.copyOfRange(ch, start, start + length));
}
public void ignorableWhitespace(char[] ch, int start, int length) {
for (int i = start; i < start + length; i++)
out.print(ch[i]);
}
public void processingInstruction(String target, String data) {
out.print("<?" + target + " " + data + "?>");
}
public void skippedEntity(String name) {
out.print("&" + name + ";");
}
}
class MyErrorHandler implements ErrorHandler {
public void warning(SAXParseException e) throws SAXException {
show("Warning", e);
throw (e);
}
public void error(SAXParseException e) throws SAXException {
show("Error", e);
throw (e);
}
public void fatalError(SAXParseException e) throws SAXException {
show("Fatal Error", e);
throw (e);
}
private void show(String type, SAXParseException e) {
System.out.println(type + ": " + e.getMessage());
System.out.println("Line " + e.getLineNumber() + " Column " + e.getColumnNumber());
System.out.println("System ID: " + e.getSystemId());
}
}

Reading XML for getting entities

I'm using SAX (Simple API for XML) to parse an XML document. My purpose is to parse the document so that i can separate entities from the the XML and create an ER Diagram from these entities (which i will create manually after i get all the entities the file have).
Although i'm on very initial stage of coding every thing i have discussed above, but i' just stuck at this particular problem right now.
here is my code:
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class Parser extends DefaultHandler {
public void getXml() {
try {
SAXParserFactory saxParserFactory = SAXParserFactory.newInstance();
SAXParser saxParser = saxParserFactory.newSAXParser();
final MySet openingTagList = new MySet();
final MySet closingTagList = new MySet();
DefaultHandler defaultHandler = new DefaultHandler() {
public void startDocument() throws SAXException {
System.out.println("Starting Parsing...\n");
}
public void endDocument() throws SAXException {
System.out.print("\n\nDone Parsing!");
}
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException {
if (!openingTagList.contains(qName)) {
openingTagList.add(qName);
System.out.print("<" + qName + ">");
}
}
public void characters(char ch[], int start, int length)
throws SAXException {
for (int i = start; i < (start + length); i++) {
System.out.print(ch[i]);
}
}
public void endElement(String uri, String localName, String qName)
throws SAXException {
if (!closingTagList.contains(qName)) {
closingTagList.add(qName);
System.out.print("</" + qName + ">");
}
}
};
saxParser.parse("student.xml", defaultHandler);
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String args[]) {
Parser readXml = new Parser();
readXml.getXml();
}
}
What i'm trying to achieve is when the startElement method detects that the tag was already traversed it should skip the tag as well all the other entities inside the tag, but i'm confused about how to implement that part.
Note: Purpose is to read the tags, i don't care about the records in between them. MySet is just an abstraction which contains method like contains (if the set has the passed data) etc nothing much.
Any help would be appropriated. Thanks
Due to the nature of xml it's not possible to know which tags will appear later in the file. So there is no 'skip the next x bytes'-trick.
Just ask for reasonable sized files - maybe there is a possibility to split the data.
In my opinion reading a xml file with more than 1 gb is no fun - regardless of the used library.

Getting Parent Child Hierarchy in Sax XML parser

I'm using SAX (Simple API for XML) to parse an XML document. I'm getting output for all the tags the file have, but i want it to show the tags in parent child hierarchy.
For Example:
This is my output
<dblp>
<www>
<author>
</author><title>
</title><url>
</url><year>
</year></www><inproceedings>
<month>
</month><pages>
</pages><booktitle>
</booktitle><note>
</note><cdrom>
</cdrom></inproceedings><article>
<journal>
</journal><volume>
</volume></article><ee>
</ee><book>
<publisher>
</publisher><isbn>
</isbn></book><incollection>
<crossref>
</crossref></incollection><editor>
</editor><series>
</series></dblp>
But i want it to display the output like this (it displays the children with extra spacing (that's how i want it to be))
<dblp>
<www>
<author>
</author>
<title>
</title>
<url>
</url>
<year>
</year>
</www>
<inproceedings>
<month>
</month>
<pages>
</pages>
<booktitle>
</booktitle>
<note>
</note>
<cdrom>
</cdrom>
</inproceedings>
<article>
<journal>
</journal>
<volume>
</volume>
</article>
<ee>
</ee>
<book>
<publisher>
</publisher>
<isbn>
</isbn>
</book>
<incollection>
<crossref>
</crossref>
</incollection>
<editor>
</editor>
<series>
</series>
</dblp>
But i can't figure out how can i detect that parser is parsing a parent tag or a children.
here is my code:
package com.teamincredibles.sax;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class Parser extends DefaultHandler {
public void getXml() {
try {
SAXParserFactory saxParserFactory = SAXParserFactory.newInstance();
SAXParser saxParser = saxParserFactory.newSAXParser();
final MySet openingTagList = new MySet();
final MySet closingTagList = new MySet();
DefaultHandler defaultHandler = new DefaultHandler() {
public void startDocument() throws SAXException {
System.out.println("Starting Parsing...\n");
}
public void endDocument() throws SAXException {
System.out.print("\n\nDone Parsing!");
}
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException {
if (!openingTagList.contains(qName)) {
openingTagList.add(qName);
System.out.print("<" + qName + ">\n");
}
}
public void characters(char ch[], int start, int length)
throws SAXException {
/*for(int i=start; i<(start+length);i++){
System.out.print(ch[i]);
}*/
}
public void endElement(String uri, String localName, String qName)
throws SAXException {
if (!closingTagList.contains(qName)) {
closingTagList.add(qName);
System.out.print("</" + qName + ">");
}
}
};
saxParser.parse("xml/sample.xml", defaultHandler);
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String args[]) {
Parser readXml = new Parser();
readXml.getXml();
}
}
You can consider a StAX implementation:
package be.duo.stax;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
public class StaxExample {
public void getXml() {
InputStream is = null;
try {
is = new FileInputStream("c:\\dev\\sample.xml");
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
XMLStreamReader reader = inputFactory.createXMLStreamReader(is);
parse(reader, 0);
} catch(Exception ex) {
System.out.println(ex.getMessage());
} finally {
if(is != null) {
try {
is.close();
} catch(IOException ioe) {
System.out.println(ioe.getMessage());
}
}
}
}
private void parse(XMLStreamReader reader, int depth) throws XMLStreamException {
while(true) {
if(reader.hasNext()) {
switch(reader.next()) {
case XMLStreamConstants.START_ELEMENT:
writeBeginTag(reader.getLocalName(), depth);
parse(reader, depth+1);
break;
case XMLStreamConstants.END_ELEMENT:
writeEndTag(reader.getLocalName(), depth-1);
return;
}
}
}
}
private void writeBeginTag(String tag, int depth) {
for(int i = 0; i < depth; i++) {
System.out.print(" ");
}
System.out.println("<" + tag + ">");
}
private void writeEndTag(String tag, int depth) {
for(int i = 0; i < depth; i++) {
System.out.print(" ");
}
System.out.println("</" + tag + ">");
}
public static void main(String[] args) {
StaxExample app = new StaxExample();
app.getXml();
}
}
There is an idiom for StAX with a loop like this for every tag in the XML:
private MyTagObject parseMyTag(XMLStreamReader reader, String myTag) throws XMLStreamException {
MyTagObject myTagObject = new MyTagObject();
while (true) {
switch (reader.next()) {
case XMLStreamConstants.START_ELEMENT:
String localName = reader.getLocalName();
if(localName.equals("myOtherTag1")) {
myTagObject.setMyOtherTag1(parseMyOtherTag1(reader, localName));
} else if(localName.equals("myOtherTag2")) {
myTagObject.setMyOtherTag2(parseMyOtherTag2(reader, localName));
}
// and so on
break;
case XMLStreamConstants.END_ELEMENT:
if(reader.getLocalName().equals(myTag) {
return myTagObject;
}
break;
}
}
well what have you tried? you should use a transformer found here: How to pretty print XML from Java?
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
//initialize StreamResult with File object to save to file
StreamResult result = new StreamResult(new StringWriter());
DOMSource source = new DOMSource(doc);
transformer.transform(source, result);
String xmlString = result.getWriter().toString();
System.out.println(xmlString);
Almost any useful SAX application needs to maintain a stack. When startElement is called, you push information to the stack, when endElement is called, you pop the stack. Exactly what you put on the stack depends on the application; it's often the element name. For your application, you don't actually need a full stack, you only need to know its depth. You could get by with maintaining this using depth++ in startElement and depth-- in endElement(). Then you just output depth spaces before the element name.

How to read a big XML file in JAVA and Spilt it into small XML files Based on Tag?

I am new to JAVA programming, now I in need of JAVA program to read a big XML file that containing .. tags. Sample input as follows.
Input.xml
<row>
<Name>Filename1</Name>
</row>
<row>
<Name>Filename2</Name>
</row>
<row>
<Name>Filename3</Name>
</row>
<row>
<Name>Filename4</Name>
</row>
<row>
<Name>Filename5</Name>
</row>
<row>
<Name>Filename6</Name>
</row>
.
.
I need output as first <row> </row> as a single .xml file with filename as filename1.xml
and second <row>..</row> as filename2.xml and so.
Can anyone tell the steps how to do it in simple way with Java, it will be very useful if you give any sample codes ?
You could do the following with StAX because you said your xml is large
Code for Your Use Case
The following code uses StAX APIs to break up the document as outlined in your question:
import java.io.*;
import java.util.*;
import javax.xml.namespace.QName;
import javax.xml.stream.*;
import javax.xml.stream.events.*;
public class Demo {
public static void main(String[] args) throws Exception {
Demo demo = new Demo();
demo.split("src/forum7408938/input.xml", "nickname");
//demo.split("src/forum7408938/input.xml", null);
}
private void split(String xmlResource, String condition) throws Exception {
XMLEventFactory xef = XMLEventFactory.newFactory();
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLEventReader xer = xif.createXMLEventReader(new FileReader(xmlResource));
StartElement rootStartElement = xer.nextTag().asStartElement(); // Advance to statements element
StartDocument startDocument = xef.createStartDocument();
EndDocument endDocument = xef.createEndDocument();
XMLOutputFactory xof = XMLOutputFactory.newFactory();
while(xer.hasNext() && !xer.peek().isEndDocument()) {
boolean metCondition;
XMLEvent xmlEvent = xer.nextTag();
if(!xmlEvent.isStartElement()) {
break;
}
// Be able to split XML file into n parts with x split elements(from
// the dummy XML example staff is the split element).
StartElement breakStartElement = xmlEvent.asStartElement();
List<XMLEvent> cachedXMLEvents = new ArrayList<XMLEvent>();
// BOUNTY CRITERIA
// I'd like to be able to specify condition that must be in the
// split element i.e. I want only staff which have nickname, I want
// to discard those without nicknames. But be able to also split
// without conditions while running split without conditions.
if(null == condition) {
cachedXMLEvents.add(breakStartElement);
metCondition = true;
} else {
cachedXMLEvents.add(breakStartElement);
xmlEvent = xer.nextEvent();
metCondition = false;
while(!(xmlEvent.isEndElement() && xmlEvent.asEndElement().getName().equals(breakStartElement.getName()))) {
cachedXMLEvents.add(xmlEvent);
if(xmlEvent.isStartElement() && xmlEvent.asStartElement().getName().getLocalPart().equals(condition)) {
metCondition = true;
break;
}
xmlEvent = xer.nextEvent();
}
}
if(metCondition) {
// Create a file for the fragment, the name is derived from the value of the id attribute
FileWriter fileWriter = null;
fileWriter = new FileWriter("src/forum7408938/" + breakStartElement.getAttributeByName(new QName("id")).getValue() + ".xml");
// A StAX XMLEventWriter will be used to write the XML fragment
XMLEventWriter xew = xof.createXMLEventWriter(fileWriter);
xew.add(startDocument);
// BOUNTY CRITERIA
// The content of the spitted files should be wrapped in the
// root element from the original file(like in the dummy example
// company)
xew.add(rootStartElement);
// Write the XMLEvents that were cached while when we were
// checking the fragment to see if it matched our criteria.
for(XMLEvent cachedEvent : cachedXMLEvents) {
xew.add(cachedEvent);
}
// Write the XMLEvents that we still need to parse from this
// fragment
xmlEvent = xer.nextEvent();
while(xer.hasNext() && !(xmlEvent.isEndElement() && xmlEvent.asEndElement().getName().equals(breakStartElement.getName()))) {
xew.add(xmlEvent);
xmlEvent = xer.nextEvent();
}
xew.add(xmlEvent);
// Close everything we opened
xew.add(xef.createEndElement(rootStartElement.getName(), null));
xew.add(endDocument);
fileWriter.close();
}
}
}
}
I can suggest using SAXParser and extending the DefaultHandler class' methods.
You can use a few booleans to keep a track of which tag you are in.
DefaultHandler will let you know when you are in a particular tag by the startElement() method. Then, you will be given the contents of the tag by the characters() method and finally you will be notified of the end of a tag by the endElement() method.
As soon as you are notified of the end of a <row>, you can get the contents of the tag you just saved and create a file out of it.
Looking at your example, you just need a couple of boolean values -- boolean inRow and boolean inName so this should not be a hard task =)
Example from Mykong (I am leaving out the actual code, you must do it on your own. It is fairly trivial):
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class ReadXMLFile {
public static void main(String argv[]) {
try {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
DefaultHandler handler = new DefaultHandler() {
boolean bfname = false;
boolean blname = false;
boolean bnname = false;
boolean bsalary = false;
public void startElement(String uri, String localName,String qName,
Attributes attributes) throws SAXException {
System.out.println("Start Element :" + qName);
if (qName.equalsIgnoreCase("FIRSTNAME")) {
bfname = true;
}
if (qName.equalsIgnoreCase("LASTNAME")) {
blname = true;
}
if (qName.equalsIgnoreCase("NICKNAME")) {
bnname = true;
}
if (qName.equalsIgnoreCase("SALARY")) {
bsalary = true;
}
}
public void endElement(String uri, String localName,
String qName) throws SAXException {
System.out.println("End Element :" + qName);
}
public void characters(char ch[], int start, int length) throws SAXException {
if (bfname) {
System.out.println("First Name : " + new String(ch, start, length));
bfname = false;
}
if (blname) {
System.out.println("Last Name : " + new String(ch, start, length));
blname = false;
}
if (bnname) {
System.out.println("Nick Name : " + new String(ch, start, length));
bnname = false;
}
if (bsalary) {
System.out.println("Salary : " + new String(ch, start, length));
bsalary = false;
}
}
};
saxParser.parse("c:\\file.xml", handler);
} catch (Exception e) {
e.printStackTrace();
}
}
}
The best approach is JAXB MArshal and unmarshaller to read and create xml fils.
Here is example
Assuming that your file have element that contains those rows:
<root>
<row><Name>Filename1</Name></row>
<row><Name>Filename2</Name></row>
<row><Name>Filename3</Name></row>
<row><Name>Filename4</Name></row>
<row><Name>Filename5</Name></row>
<row><Name>Filename6</Name></row>
</root>
This code will do the trick:
package com.example;
import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
public class Main {
public static String readXmlFromFile(String fileName) throws Exception {
BufferedReader reader = new BufferedReader(new FileReader(fileName));
String line = null;
StringBuilder stringBuilder = new StringBuilder();
String lineSeparator = System.getProperty("line.separator");
while ((line = reader.readLine()) != null) {
stringBuilder.append(line);
stringBuilder.append(lineSeparator);
}
return stringBuilder.toString();
}
public static List<String> divideXmlByTag(String xml, String tag) throws Exception {
List<String> list = new ArrayList<String>();
Document document = loadXmlDocument(xml);
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
NodeList rowList = document.getElementsByTagName(tag);
for(int i=0; i<rowList.getLength(); i++) {
Node rowNode = rowList.item(i);
if (rowNode.getNodeType() == Node.ELEMENT_NODE) {
DOMSource source = new DOMSource(rowNode);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
StreamResult streamResult = new StreamResult(baos);
transformer.transform(source, streamResult);
list.add(baos.toString());
}
}
return list;
}
private static Document loadXmlDocument(String xml) throws SAXException, IOException, ParserConfigurationException {
return loadXmlDocument(new ByteArrayInputStream(xml.getBytes()));
}
private static Document loadXmlDocument(InputStream inputStream) throws SAXException, IOException, ParserConfigurationException {
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(true);
DocumentBuilder documentBuilder = null;
documentBuilder = documentBuilderFactory.newDocumentBuilder();
Document document = documentBuilder.parse(inputStream);
inputStream.close();
return document;
}
public static void main(String[] args) throws Exception {
String xmlString = readXmlFromFile("d:/test.xml");
System.out.println("original xml:\n" + xmlString + "\n");
System.out.println("divided xml:\n");
List<String> dividedXmls = divideXmlByTag(xmlString, "row");
for (String xmlPart : dividedXmls) {
System.out.println(xmlPart + "\n");
}
}
}
You only need to write this xml parts to separates files.
Since the user requested one more solution posting other way.
use a StAX parser for this situation. It will prevent the entire document from being read into memory at one time.
Advance the XMLStreamReader to the local root element of the sub-fragment.
You can then use the javax.xml.transform APIs to produce a new document from this XML fragment. This will advance the XMLStreamReader to the end of that fragment.
Repeat step 1 for the next fragment.
Code Example
For the following XML, output each "statement" section into a file named after the "account attributes value":
<statements>
<statement account="123">
...stuff...
</statement>
<statement account="456">
...stuff...
</statement>
import java.io.File;
import java.io.FileReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;
public class Demo {
public static void main(String[] args) throws Exception {
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
xsr.nextTag(); // Advance to statements element
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
File file = new File("out/" + xsr.getAttributeValue(null, "account") + ".xml");
t.transform(new StAXSource(xsr), new StreamResult(file));
}
}
}
If you're new to Java then the people recommending SAX and StAX parsing are throwing you in at the deep end! This is pretty low-level stuff, highly efficient, but not designed for beginners. You said the file is "big" and they've all assumed that to mean "very big", but in my experience an unquantified "big" can mean anything from 1Mb to 20Gb, so designing a solution based on that description is somewhat premature.
It's much easier to do this with XSLT 2.0 than with Java. All it takes is a stylesheet like this:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:template match="row">
<xsl:result-document href="{FileName}">
<xsl:copy-of select="."/>
</xsl:result-document>
</xsl:template>
</xsl:stylesheet>
And if it has to be within a Java application, you can easily invoke the transformation from Java using an API.
Try out this,
import java.io.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
public class Test{
static public void main(String[] arg) throws Exception{
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse("foo.xml");
TransformerFactory tranFactory = TransformerFactory.newInstance();
Transformer aTransformer = tranFactory.newTransformer();
NodeList list = doc.getFirstChild().getChildNodes();
for (int i=0; i<list.getLength(); i++){
Node element = list.item(i).cloneNode(true);
if(element.hasChildNodes()){
Source src = new DOMSource(element);
FileOutputStream fs=new FileOutputStream("k" + i + ".xml");
Result dest = new StreamResult(fs);
aTransformer.transform(src, dest);
fs.close();
}
}
}
}
Source: Related Answer

How do I parse my simple XML file with Java and SAX?

I am trying to parse the file below. I want to print the id and name of each passenger. Can you give me code to parse it ?
<?xml version="1.0" encoding="utf-8"?>
<root xmlns:android="www.google.com">
<passenger id = "001">
<name>Tom Cruise</name>
</passenger>
<passenger id = "002">
<name>Tom Hanks</name>
</passenger>
</root>
UPDATE
This is what i had tried. Code, problems etc mentioned here -
Error in output of a simple SAX parser
Here is a working example to start with, though I suggest you to use StAX instead, you will see that SAX is not very convenient
import java.io.File;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class SAX2 {
public static void main(String[] args) throws Exception {
SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
parser.parse(new File("test.xml"), new DefaultHandler() {
#Override
public void startElement(String uri, String localName,
String qName, Attributes atts) throws SAXException {
if (qName.equals("passenger")) {
System.out.println("id = " + atts.getValue(0));
}
}
#Override
public void endElement(String uri, String localName, String qName)
throws SAXException {
}
#Override
public void characters(char[] ch, int start, int length)
throws SAXException {
String text = new String(ch, start, length);
if (!text.trim().isEmpty()) {
System.out.println("name " + text);
}
}
});
}
}
output
id = 001
name Tom Cruise
id = 002
name Tom Hanks
Create a DocumentBuilderFactory.
Obtain a DocumentBuilder from the factory.
Use one of the parse() methods of the builder to create a Document.
Once you have a Document, you can get the passenger Elements with Document's getElementsByTagName() method.
I'm sure you'll be able to work out the rest.
SAXParserFactory factory = SAXParserFactory.newInstance();
try {
InputStream xmlInput = new FileInputStream("theFile.xml");
SAXParser saxParser = factory.newSAXParser();
DefaultHandler handler = new SaxHandler();
saxParser.parse(xmlInput, handler);
} catch (Throwable err) {
err.printStackTrace ();
}
String str = "<?xml version=\"1.0\" encoding=\"utf-8\"?> " +
"<root xmlns:android=\"www.google.com\">" +
"<passenger id = \"001\">" +
"<name>Tom Cruise</name>" +
"</passenger>" +
"<passenger id = \"002\">" +
"<name>Tom Hanks</name>" +
"</passenger>" +
"</root>";
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource(new StringReader(str));
final Document document = db.parse(is);
System.out.println("node Name " + document.getChildNodes().item(0).getChildNodes().item(1).getNodeName());

Categories