Java 8 - Split huge XML file using Stax gives unexpected results

Java 8 - Split huge XML file using Stax gives unexpected results - java

When splitting a huge XML file I saw a very nice solution using Stax and Transformer.transform(). Nice BUT I see that some tags got lost. Why is that?
An XML file with Name... gives the following result. In the EVENT occasions the element tag is ommited.
Element: <?xml version="1.0" encoding="UTF-8"?><car><name>car1</name></car>
Element: <?xml version="1.0" encoding="UTF-8"?><name>car2</name>
Element: <?xml version="1.0" encoding="UTF-8"?><car><name>car3</name></car>
Element: <?xml version="1.0" encoding="UTF-8"?><name>car4</name>
How can I get the right elements? Has this to do with that transform( s, r) interferes with the input stream reading?
This is my code (which I saw in many places like this one). There is no change when using a StringReader or a FileReader.
I expected this: loop { advance to start-tag; get access to that element }
What I see is: 1st: the element + 2nd: parts of the element + repeated.
String testCars = "<root><car><name>car1</name></car><car><name>car2</name></car><car><name>car3</name></car><car><name>car4</name></car></root>";
String element = "car";
try {
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader streamReader = factory.createXMLStreamReader(new StringReader(testCars));
streamReader.nextTag();
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
while(streamReader.nextTag() == XMLStreamConstants.START_ELEMENT) {
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
t.transform(new StAXSource(streamReader), result);
System.out.println("Element: " + writer.toString());
}
} catch (Exception e) { ... }

Thanks to Andreas, this is the solution:
String testCars = "<root><car><name>car1</name></car><other><something>Unknown</something></other><car><name>car2</name></car></root>";
XMLInputFactory factory = XMLInputFactory.newInstance();
try {
XMLStreamReader streamReader = factory.createXMLStreamReader(new StringReader(testCars));
streamReader.nextTag();
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
streamReader.nextTag();
while ( streamReader.isStartElement() ||
( ! streamReader.hasNext() && streamReader.nextTag() == XMLStreamConstants.START_ELEMENT)) {
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
t.transform(new StAXSource(streamReader), result);
System.out.println( "XmlElement: " + writer.toString());
}
} catch (Exception e) { ... }
Input is:
<root>
<car>
<name>car1</name>
</car>
<other>
<something>Unknown</something>
</other>
<car>
<name>car2</name>
</car>
</root>
Output is:
XmlElement: <?xml version="1.0" encoding="UTF-8"?><car><name>car1</name></car>
XmlElement: <?xml version="1.0" encoding="UTF-8"?><other><something>Unknown</something></other>
XmlElement: <?xml version="1.0" encoding="UTF-8"?><car><name>car2</name></car>

Related

Java - Writing to XML file indents everything except the first element

Using JAVA, I am trying, after having opened a .xml file, to append the creation of a new node using a SWING Application. Every new node gets entered correctly EXCEPT the first element which always get stuck at the far left of the file, with no identation.
schedule.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Schedule>
<Lesson>
<Title>Artificial Intelligence</Title>
<Lecture>
<Day>Thursday</Day>
</Lecture>
<Professor>John Doe</Professor>
</Lesson>
<Lesson>
<Title>Constraint Satisfaction Problems</Title>
<Lecture>
<Day>Monday</Day>
</Lecture>
</Lesson>
</Schedule>
My attempt to write to the file :
try {
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
Document document = documentBuilder.parse("schedule.xml");
Element root = document.getDocumentElement();
Element newLesson = document.createElement("Lesson");
Element newTitle = document.createElement("Title");
newTitle.appendChild(document.createTextNode("myLesson"));
newLesson.appendChild(newTitle);
Element newLecture = document.createElement("Lecture");
newLesson.appendChild(newLecture);
Element newDay = document.createElement("Day");
newDay.appendChild(document.createTextNode("myDay"));
newLecture.appendChild(newDay);
Element newProfessor = document.createElement("Professor");
newProfessor.appendChild(document.createTextNode("myProfessor"));
newLesson.appendChild(newProfessor);
root.appendChild(newLesson);
DOMSource source = new DOMSource(document);
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
StreamResult result = new StreamResult("schedule.xml");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "8");
transformer.transform(source, result);
} catch (Exception e) {
e.printStackTrace();
}
Output
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Schedule>
<Lesson>
<Title>Artificial Intelligence</Title>
<Lecture>
<Day>Thursday</Day>
</Lecture>
<Professor>John Doe</Professor>
</Lesson>
<Lesson>
<Title>Constraint Satisfaction Problems</Title>
<Lecture>
<Day>Monday</Day>
</Lecture>
</Lesson>
<Lesson>
<Title>myLesson</Title>
<Lecture>
<Day>myDay</Day>
</Lecture>
<Professor>myProfessor</Professor>
</Lesson>
</Schedule>

Solution: used a function for space trimming from here
Function:
private static void removeEmptyText(Node node){
Node child = node.getFirstChild();
while(child!=null){
Node sibling = child.getNextSibling();
if(child.getNodeType()==Node.TEXT_NODE){
if(child.getTextContent().trim().isEmpty())
node.removeChild(child);
}else
removeEmptyText(child);
child = sibling;
}
}

Extract XML element as string including attribute namespace using StAX

Given the following XML string
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:a="http://a" xmlns:b="http://b">
<a:element b:attribute="value">
<subelement/>
</a:element>
</root>
I'd like to extract the element a:element as an XML string while preserving the used namespaces using StAX. So I would expect
<?xml version="1.0" encoding="UTF-8"?>
<a:element xmlns:a="http://a" xmlns:b="http://b" b:attribute="value">
<subelement/>
</a:element>
Following answers like https://stackoverflow.com/a/5170415/2391901 and https://stackoverflow.com/a/4353531/2391901, I already have the following code:
final ByteArrayInputStream inputStream = new ByteArrayInputStream(inputString.getBytes(StandardCharsets.UTF_8));
final XMLInputFactory xmlInputFactory = XMLInputFactory.newFactory();
final XMLStreamReader xmlStreamReader = xmlInputFactory.createXMLStreamReader(inputStream);
xmlStreamReader.nextTag();
xmlStreamReader.nextTag();
final TransformerFactory transformerFactory = TransformerFactory.newInstance();
final Transformer transformer = transformerFactory.newTransformer();
final ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
transformer.transform(new StAXSource(xmlStreamReader), new StreamResult(outputStream));
final String outputString = outputStream.toString(StandardCharsets.UTF_8.name());
However, the result does not contain the namespace http://b of the attribute b:attribute (using either the default StAX parser of Java 8 or the StAX parser of Aalto XML):
<?xml version="1.0" encoding="UTF-8"?>
<a:element xmlns:a="http://a" b:attribute="value">
<subelement/>
</a:element>
How do I get the expected result using StAX?

It would be cleaner to use an xslt transform to do this. You're already using an identity transformer to perform output - just set it up to copy the target element instead of everything:
public static void main(String[] args) throws TransformerException {
String inputString =
"<root xmlns:a='http://a' xmlns:b='http://b'>" +
" <a:element b:attribute='value'>" +
" <subelement/>" +
" </a:element>" +
"</root>";
String xslt =
"<xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform' xmlns:a='http://a'>" +
" <xsl:template match='/root'>" +
" <xsl:copy-of select='a:element'/>" +
" </xsl:template>" +
"</xsl:stylesheet>";
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer(new StreamSource(new StringReader(xslt)));
transformer.transform(new StreamSource(new StringReader(inputString)), new StreamResult(System.out));
}
The stax subtree transform that you're using relies on some iffy behaviour of the transformer that ships with the jdk. It didn't work when I tried it with the Saxon transformer (which complained about the trailing </root>).

Trying to read the full XML file as a String in Java

I am trying to read the whole XML file in Java. Below is my XML file-
<?xml version="1.0" encoding="UTF-8"?>
<app hash='nv', name='Tech', package = '1.0', version='13', filesize='200', create_date='01-03-1987', upate_date='07-09-2013' >
<url>
<name>RJ</name>
<score>10</score>
</url>
<url>
<name>ABC</name>
<score>20</score>
</url>
</app>
And below is my code, I am using to read the full XML file as shown above and then get hash, name, package etc value from that XML file.
public static void main(String[] args) {
try {
File fXmlFile = new File("C:\\ResourceFile\\app.xml");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
System.out.println(doc);
} catch (Exception e) {
}
}
And as soon as I am running the above program. I am always getting the below excpetion-
[Fatal Error] app.xml:2:22: Element type "app" must be followed by either attribute specifications, ">" or "/>".
Any idea why it is happening?

If you don't want to parse it as XML and only to show as a String maybe you want to use
a BufferedReader and readLine() store it in a StringBuilder and then show it. How to read a file
Example:
public String readFile(String path) throws IOException{
StringBuilder sb = new StringBuilder();
try (BufferedReader br = new BufferedReader(new FileReader(path))){
while ((String sCurrentLine = br.readLine()) != null) {
sb.append(sCurrentLine);
}
}
return sb.toString();
}
EDIT In java 8 you can just simply use
String xml = Files.lines(Paths.getPath(path)).collect(Collectors.joining("\n"));

There is syntax error in your xml. The attributes of the element should not be separated by a comma. It should be like,
<?xml version="1.0" encoding="UTF-8"?>
<app hash='nv' name='Tech' package='1.0' version='13' filesize='200' create_date='01-03-1987' upate_date='07-09-2013' >
<url>
<name>RJ</name>
<score>10</score>
</url>
<url>
<name>ABC</name>
<score>20</score>
</url>
</app>

try{
InputStream is = getAssets().open("HeadWork_JackWell.xml");
DocumentBuilderFactory dFactory= DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder= dFactory.newDocumentBuilder();
Document doc= dBuilder.parse(is);
try {
StringWriter sw = new StringWriter();
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(new DOMSource(doc), new StreamResult(sw));
String s=sw.toString();
System.out.println(s);
} catch (Exception ex) {
throw new RuntimeException("Error converting to String", ex);
}

How to unformat xml file

I have a method which returns a String with a formatted xml. The method reads the xml from a file on the server and parses it into the string:
Esentially what the method currently does is:
private ServletConfig config;
InputStream xmlIn = null ;
xmlIn = config.getServletContext().getResourceAsStream(filename + ".xml") ;
String xml = IOUtils.toString(xmlIn);
IOUtils.closeQuietly(xmlIn);
return xml;
What I need to do is add a new input argument, and based on that value, continue returning the formatted xml, or return unformatted xml.
What I mean with formatted xml is something like:
<xml>
<root>
<elements>
<elem1/>
<elem2/>
<elements>
<root>
</xml>
And what I mean with unformatted xml is something like:
<xml><root><elements><elem1/><elem2/><elements><root></xml>
or:
<xml>
<root>
<elements>
<elem1/>
<elem2/>
<elements>
<root>
</xml>
Is there a simple way to do this?

Strip all newline characters with String xml = IOUtils.toString(xmlIn).replace("\n", ""). Or \t to keep several lines but without indentation.

if you are sure that the formatted xml like:
<xml>
<root>
<elements>
<elem1/>
<elem2/>
<elements>
<root>
</xml>
you can replace all group 1 in ^(\s*)< to "". in this way, the text in xml won't be changed.

an empty transformer with a parameter setting the indent params like so
public static String getStringFromDocument(Document dom, boolean indented) {
String signedContent = null;
try {
StringWriter sw = new StringWriter();
DOMSource domSource = new DOMSource(dom);
TransformerFactory tf = new TransformerFactoryImpl();
Transformer trans = tf.newTransformer();
trans = tf.newTransformer();
trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
trans.setOutputProperty(OutputKeys.INDENT, indented ? "yes" : "no");
trans.transform(domSource, new StreamResult(sw));
sw.flush();
signedContent = sw.toString();
} catch (TransformerException e) {
e.printStackTrace();
}
return signedContent;
}
works for me.
the key lies in this line
trans.setOutputProperty(OutputKeys.INDENT, indented ? "yes" : "no");

Try something like the following:
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer(
new StreamSource(new StringReader(
"<xsl:stylesheet version=\"1.0\"" +
" xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\">" +
"<xsl:output method=\"xml\" omit-xml-declaration=\"yes\"/>" +
" <xsl:strip-space elements=\"*\"/>" +
" <xsl:template match=\"#*|node()\">" +
" <xsl:copy>" +
" <xsl:apply-templates select=\"#*|node()\"/>" +
" </xsl:copy>" +
" </xsl:template>" +
"</xsl:stylesheet>"
))
);
Source source = new StreamSource(new StringReader("xml string here"));
StreamResult result = new StreamResult(System.out);
transformer.transform(source, result);
Instead of source being StreamSource in the second instance, it can also be DOMSource if you have an in-memory Document, if you want to modify the DOM before saving.
DOMSource source = new DOMSource(document);
To read an XML file into a Document object:
File file = new File("c:\\MyXMLFile.xml");
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(file);
doc.getDocumentElement().normalize();
Enjoy :)

If you fancy trying your hand with JAXB then the marshaller has a handy property for setting whether to format (use new lines and indent) the output or not.
JAXBContext jc = JAXBContext.newInstance(packageName);
Marshaller m = jc.createMarshaller();
m.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, Boolean.TRUE);
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
m.marshal(element, outputStream);
Quite an overhead to get to that stage though... perhaps a good option if you already have a solid xsd

You can:
1) remove all consecutive whitespaces (but not single whitespace) and then replace all >(whitespace)< by ><
applicable only if usefull content does not have multiple consecutive significant whitespaces
2) read it in some dom tree and serialize it using some nonpretty serialization
SAXReader reader = new SAXReader();
Reader r = new StringReader(data);
Document document = reader.read(r);
OutputFormat format = OutputFormat.createCompactFormat();
StringWriter sw = new StringWriter();
XMLWriter writer = new XMLWriter(sw, format);
writer.write(document);
String string = writer.toString();
3) use Canonicalization (but you must somehow explain to it that those whitespaces you want to remove are insignificant)

Kotlin.
An indentation will usually come after new line and formatted as one space or more. Hence, to make everything in the same column, we will replace all of the new lines, following one or more spaces:
xmlTag = xmlTag.replace("(\n +)".toRegex(), " ")

Java:XML Parser

I have a response XML something like this -
<Response> <aa> <Fromhere> <a1>Content</a1> <a2>Content</a2> </Fromhere> </aa> </Response>
I want to extract the whole content from <Fromhere> to </Fromhere> in a string. Is it possible to do that through any string function or through XML parser?
Please advice.

You could try an XPath approach for simpleness in XML parsing:
InputStream response = new ByteArrayInputStream("<Response> <aa> "
+ "<Fromhere> <a1>Content</a1> <a2>Content</a2> </Fromhere> "
+ "</aa> </Response>".getBytes()); /* Or whatever. */
DocumentBuilder builder = DocumentBuilderFactory
.newInstance().newDocumentBuilder();
Document doc = builder.parse(response);
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile("string(/Response/aa/FromHere)");
String result = (String)expr.evaluate(doc, XPathConstants.STRING);
Note that I haven't tried this code. It may need tweaking.

Through an XML parser. Using string functions to parse XML is a bad idea...
Beside the Sun tutorials pointed out above, you can check the DZone Refcardz on Java and XML, I found it was a good, terse explanation how to do it.
But well, there is probably plenty of Web resources on the topic, including on this very site.

You can apply an XSLT stylesheet to extract the desired content.
This stylesheet should fit your example:
<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/Response/aa/Fromhere/*">
<xsl:copy>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Apply it with something like the following (exception handling not included):
String xml = "<Response> <aa> <Fromhere> <a1>Content</a1> <a2>Content</a2> </Fromhere> </aa> </Response>";
Source xsl = new StreamSource(new FileReader("/path/to/file.xsl");
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer(xsl);
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
StringWriter out = new StringWriter();
transformer.transform(new StreamSource(new StringReader(xml)), new StreamResult(out));
System.out.println(out.toString());
This should work with any version of Java starting with 1.4.

This should work
import java.util.regex.*
Pattern p = Pattern.compile("<Fromhere>.*</Fromhere>");
Matcher m = p.matcher(responseString);
String whatYouWant = m.group();
It would be a little more verbose to use Scanner, but that could work too.
Whether this is a good idea is for someone more experienced than I.

One option is to use a StreamFilter:
class MyFilter implements StreamFilter {
private boolean on;
#Override
public boolean accept(XMLStreamReader reader) {
final String element = "Fromhere";
if (reader.isStartElement() && element.equals(reader.getLocalName())) {
on = true;
} else if (reader.isEndElement()
&& element.equals(reader.getLocalName())) {
on = false;
return true;
}
return on;
}
}
Combined with a Transformer, you can use this to safely parse logically-equivalent markup like this:
<Response>
<!-- <Fromhere></Fromhere> -->
<aa>
<Fromhere>
<a1>Content</a1> <a2>Content</a2>
</Fromhere>
</aa>
</Response>
Demo:
StringWriter writer = new StringWriter();
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
XMLStreamReader reader = inputFactory
.createXMLStreamReader(new StringReader(xmlString));
reader = inputFactory.createFilteredReader(reader, new MyFilter());
TransformerFactory transFactory = TransformerFactory.newInstance();
Transformer transformer = transFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.transform(new StAXSource(reader), new StreamResult(writer));
System.out.println(writer.toString());
This is a programmatic variation on Massimiliano Fliri's approach.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java 8 - Split huge XML file using Stax gives unexpected results - java

Related

Java - Writing to XML file indents everything except the first element

Extract XML element as string including attribute namespace using StAX

Trying to read the full XML file as a String in Java

How to unformat xml file

Java:XML Parser

Categories

Resources