How to preserve whitespace in attributes when using XMLStreamWriter?

How to preserve whitespace in attributes when using XMLStreamWriter? - java

When using the javax.xml.XMLStreamWriter, is there any way to preserve the whitespace within attributes? I understand that the XMLStreamReader will perform Attribute-Value Normalization, converting \r\n\t in the XML into a space, so it's up to the writer to emit entity references (e.g.
) to preserve the whitespace. Is there any way to tell the writer to use entity references for whitespace? Can I add entity references to attributes myself?
The following JUnit3 test passes. When I encode "Hello,\r\n\tworld", I want to get the same thing back out. But instead, the decoded value is "Hello world" (two spaces).
import java.io.StringReader;
import java.io.StringWriter;
import javax.xml.stream.*;
import junit.framework.TestCase;
public class XmlStreamTest extends TestCase {
public void testAttribute() throws XMLStreamException {
StringWriter stringWriter = new StringWriter();
XMLStreamWriter xmlStreamWriter = XMLOutputFactory.newFactory().createXMLStreamWriter(stringWriter);
xmlStreamWriter.writeStartDocument();
xmlStreamWriter.writeStartElement("root");
xmlStreamWriter.writeAttribute("a", "Hello,\r\n\tWorld! ");
xmlStreamWriter.writeEndElement();
xmlStreamWriter.writeEndDocument();
xmlStreamWriter.close();
assertEquals("<?xml version=\"1.0\" ?><root a=\"Hello,\r\n\tWorld! \"></root>", stringWriter.toString());
StringReader stringReader = new StringReader(stringWriter.toString());
XMLStreamReader xmlStreamReader = XMLInputFactory.newInstance().createXMLStreamReader(stringReader);
assertEquals(XMLStreamConstants.START_DOCUMENT, xmlStreamReader.getEventType());
assertEquals(XMLStreamConstants.START_ELEMENT, xmlStreamReader.next());
// This is not what I want! I want the value to be the same as I originally gave!
assertEquals("Hello, World! ", xmlStreamReader.getAttributeValue(null, "a"));
assertEquals(XMLStreamConstants.END_ELEMENT, xmlStreamReader.next());
assertEquals(XMLStreamConstants.END_DOCUMENT, xmlStreamReader.next());
}
}

Related

XML to csv in java using DOM

I want to convert XML file to csv that is comma separate file for that i use DOM parser in java.
The output of below code is - AAA123456
The Desiered output is -AAA,123,456
This is what i develop so far.Hope i separate with node name as csv.
public class Main {
static public final String SEPARATOR = ",";
private static String decodeDetailOutputRecordXML(String str) throws ParserConfigurationException, IOException, SAXException {
str = "<a><b><c>AAA</c><d>123</d><e>456</e></b></a>";
Document doc =DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new ByteArrayInputStream(str.getBytes()));
DocumentTraversal traversal = (DocumentTraversal) doc;
NodeIterator iterator = traversal.createNodeIterator(doc.getDocumentElement(), NodeFilter.SHOW_ELEMENT, null, true);
for (Node n = iterator.nextNode(); n != null; n = iterator.nextNode()) {
out.println(n.getTextContent());
}
return "";
}
public static void main(String[] args) throws Exception {
decodeDetailOutputRecordXML(null);
return;
}
}

This answer is to demonstrate the DOM API usage to convert the XML format under consideration to CSV. The example code below used DOM API directly and OpenCSV to write the CSV file.
The Example XML
<?xml version="1.0" encoding="UTF-8"?>
<a>
<b>
<c>Somedata0</c>
<d>Somedata1</d>
<e>Somedata2</e>
</b>
<b>
<c>Xdata0</c>
<d>Xdata1</d>
<e>Xdata2</e>
</b>
</a>
The routine that converts the XML to CSV
package org.test;
import java.io.FileInputStream;
import java.io.FileWriter;
import org.apache.xerces.parsers.DOMParser;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import com.opencsv.CSVWriter;
public class XMLToCSVTest {
public static void main(String[] args) throws Exception{
String inputFilePath="D:\\workspaces\\mtplatform\\TechTest\\testfiles\\testdata.xml";
String outputFilePath="D:\\workspaces\\mtplatform\\TechTest\\testfiles\\testdataOut.csv";
/*
* We assume that we know the structure and the column names of the CSV file
*/
String[] csvHeaders=new String[] {"c","d","e"};
/*
* Using Xerces DOM parser directly, same can also be achieved through JAXP
*/
DOMParser parser=new DOMParser();
try(FileInputStream fis=new FileInputStream(inputFilePath);
CSVWriter writer=new CSVWriter(new FileWriter(outputFilePath));){
/*
* Write the CSV headers
*/
writer.writeNext(csvHeaders);
InputSource source=new InputSource(fis);
parser.parse(source);
Element documentElement=parser.getDocument().getDocumentElement();
/*
* We assume that we know the structure of the XML completely and we also assume the data is actually there, that is
* no elements are missing being optional.
*/
NodeList elementBList=documentElement.getElementsByTagName("b");
for(int i=0;i<elementBList.getLength();i++) {
Element elementB=(Element)elementBList.item(i);
Element elementC=(Element)elementB.getElementsByTagName("c").item(0);
Element elementD=(Element)elementB.getElementsByTagName("d").item(0);
Element elementE=(Element)elementB.getElementsByTagName("e").item(0);
String[] line=new String[] {elementC.getFirstChild().getNodeValue(),
elementD.getFirstChild().getNodeValue(),
elementE.getFirstChild().getNodeValue()};
writer.writeNext(line);
}//for closing
writer.flush();
}catch(Exception e) {e.printStackTrace();}
}//main closing
}//class closing
The CSV output
"c","d","e"
"Somedata0","Somedata1","Somedata2"
"Xdata0","Xdata1","Xdata2"
NOTE: The above is one way to convert an XML to CSV with DOM API directly. While direct DOM API gives lot of flexibility, it is also slightly complicated to use. XML being an hierarchical data could sometimes be difficult to express as CSV, which is a flat data structure without either some loss of fidelity or a more complicated CSV structure, a case in point is multiple occurrence of a specific child element (in general multi-value). The actual CSV output also could be written as part of the routine, however, it would be tedious and error prone, OpenCSV has been used for that reason.

javax.xml.transform.Transformer line endings no longer respect system property "line.separator"

A comment at How to control line endings that javax.xml.transform.Transformer creates? suggests setting system property "line.separator". This worked for me (and is acceptable for my task at hand) in Java 8 (Oracle JDK 1.8.0_171), but not in Java 11 (openjdk 11.0.1).
From ticket XALANJ-2137 I made an (uneducated, as I don't even know which javax.xml implementation I am using) guess to try setOutputProperty("{http://xml.apache.org/xslt}line-separator", ..) or maybe setOutputProperty("{http://xml.apache.org/xalan}line-separator", ..), but neither works.
How can I control the transformer's line breaks in Java 11?
Here's some demo code which prints "... #13 #10 ..." under Windows with Java 11, where it should print "... #10 ..." only.
package test.xml;
import java.io.StringReader;
import java.io.StringWriter;
import java.util.stream.Collectors;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
public class TestXmlTransformerLineSeparator {
public static void main(String[] args) throws Exception {
String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><root><foo/></root>";
final String lineSep = "\n";
String oldLineSep = System.setProperty("line.separator", lineSep);
try {
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
transformer.setOutputProperty("{http://xml.apache.org/xalan}line-separator", lineSep);
transformer.setOutputProperty("{http://xml.apache.org/xslt}line-separator", lineSep);
StreamSource source = new StreamSource(new StringReader(xml));
StringWriter writer = new StringWriter();
StreamResult target = new StreamResult(writer);
transformer.transform(source, target);
System.out.println(writer.toString().chars().mapToObj(c -> c <= ' ' ? "#" + c : "" + Character.valueOf((char) c))
.collect(Collectors.joining(" ")));
System.out.println(writer);
} finally {
System.setProperty("line.separator", oldLineSep);
}
}
}

As far as I can tell, the only way that you can control the line separator that the default Java implementation of Transformer interface uses in Java 11 is to set the line.separator property on the Java command line. For the simple example program here, you could do that by creating a text file named javaArgs reading
-Dline.separator="\n"
and executing the program with the command line
java #javaArgs TestXmlTransformerLineSeparator
The # syntax that was introduced in Java 9 is useful here because the #-file is parsed in a way that will convert the "\n" into the LF line separator. It's possible to accomplish the same thing without an #-file, but the only ways I know of require more complicated OS-dependent syntax to define a variable that contains the line separator you want and having the java command line expand the variable.
If the line separator that you want is CRLF, then the javaArgs file would instead read
-Dline.separator="\r\n"
Within a larger program changing the line.separator variable for the entire application may well be unacceptable. To avoid setting the line.separator for an entire application, it would be possible to launch a separate Java process with the command line just discussed, but the overhead of launching the process and communicating with the separate process to transfer the data that the Transformer is supposed to write to a stream would probably make that an undesirable solution.
So realistically, a better solution would probably be to implement a FilterWriter that filters the output stream to convert the line separator to the line separator that you want. This solution does not change the line separator used within the transformer itself and might be considered post-processing the result of the transformer, so in a sense it is not an answer to your specific question, but I think it does give the desired result without a lot of overhead. Here is an example that uses a FilterWriter to remove all CR characters (that is, carriage returns) from the output writer.
import java.io.FilterWriter;
import java.io.IOException;
import java.io.StringReader;
import java.io.StringWriter;
import java.io.Writer;
import java.util.stream.Collectors;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
public class TransformWithFilter {
private static class RemoveCRFilterWriter extends FilterWriter {
RemoveCRFilterWriter(Writer wrappedWriter) {
super(wrappedWriter);
}
#Override
public void write(int c) throws IOException {
if (c != (int)('\r')) {
super.write(c);
}
}
#Override
public void write(char[] cbuf, int offset, int length) throws IOException {
int localOffset = offset;
for (int i = localOffset; i < offset + length; ++i) {
if (cbuf[i] == '\r') {
if (i > localOffset) {
super.write(cbuf, localOffset, i - localOffset);
}
localOffset = i + 1;
}
}
if (localOffset < offset + length) {
super.write(cbuf, localOffset, offset + length - localOffset);
}
}
#Override
public void write(String str, int offset, int length) throws IOException {
int localOffset = offset;
for (int i = localOffset; i < offset + length; ++i) {
if (str.charAt(i) == '\r') {
if (i > localOffset) {
super.write(str, localOffset, i - localOffset);
}
localOffset = i + 1;
}
}
if (localOffset < offset + length) {
super.write(str, localOffset, offset + length - localOffset);
}
}
}
public static void main(String[] args) throws Exception {
String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><root><foo/></root>";
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
StreamSource source = new StreamSource(new StringReader(xml));
StringWriter stringWriter = new StringWriter();
FilterWriter writer = new RemoveCRFilterWriter(stringWriter);
StreamResult target = new StreamResult(writer);
transformer.transform(source, target);
System.out.println(stringWriter.toString().chars().mapToObj(c -> c <= ' ' ? "#" + c : "" + Character.valueOf((char) c))
.collect(Collectors.joining(" ")));
System.out.println(stringWriter);
}
}
Another practical solution to the problem of serializing XML is to obtain a DOM representation of the XML either by using the Transformer to get a DOMResult or by directly parsing into a DOM and writing out the DOM with an LSSerializer, which provides explicit support for setting the line separator. Since that moves away from using the Transformer and there are other examples of it on Stack Overflow, I will not discuss it further here.
What might be useful, though, is reviewing what changed in Java 11 and why I think there isn't another way to control the line separator used by Java's default implementation of the Transformer. Java's default implementation of the Transformer interface uses the ToXMLStream class that inherits from com.sun.org.apache.xml.internal.serializer.ToStream and is implemented in the same package. Reviewing the commit history of OpenJDK, I found that src/java.xml/share/classes/com/sun/org/apache/xml/internal/serializer/ToStream.java was changed here from reading the line.separator property as currently defined in the system properties to instead reading System.lineSeparator(), which corresponds to the line separator at initialization of the Java virtual machine. This commit was first released in Java 11, so the code in the question should behave the same as it did in Java 8 up to and including Java 10.
If you spend some time reading ToStream.java as it existed after the commit that changed how the line separator is read (accessible here), especially focusing on lines 135 to 140 and 508 to 514, you will notice that the serializer implementation does support using other line separators, and in fact, the output property identified as
{http://xml.apache.org/xalan}line-separator
is supposed to be a way to control which line separator is used.
Why doesn't the example in the question work, then? Answer: In the current Java default implementation of the Transformer interface, only a specific few of the properties that the user sets are transferred to the serializer. These are primarily the properties that are defined in the XSLT specification, but the special indent-amount property is also transferred. The line separator output property, though, is not one of the properties that is transferred to the serializer.
Output properties that are explicitly set on the Transformer itself using setOutputProperty are transferred to the serializer by the setOutputProperties method defined on lines 1029-1128 of com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl (accessible here). If you instead define an explicit XSLT transform and use its <xsl:output> tag to set the output properties, the properties that are transferred to the serializer are filtered first of all by the parseContents method defined on lines 139-312 of com.sun.org.apache.xalan.internal.xsltc.compiler.Output (accessible here) and filtered again in the transferOutputSettings method defined on lines 671-715 of com.sun.org.apache.xalan.internal.xsltc.runtime.AbstractTranslet (accessible here).
So to summarize, it appears that there is no output property that you can set on the default Java implementation of the Transformer interface to control the line separators that it uses. There may well be other providers of Transformer implementations that do provide control of the line separator, but I have no experience with any implementation of the Transformer interface in Java 11 other than the default implementation that is provided with the OpenJDK release.

XML file reading in Java

Is it necessary to know the structure and tags of an XML file completely before reading it in Java?
areaElement.getElementsByTagName("checked").item(0).getTextContent()
I don't know the field name "checked" before I read the file. Is there any way to list all the tags in the XML file, basically the file structure?

I had prepared this DOM parser by myself, using recursion which will parse your xml without having knowledge of single tag. It will give you each node's text content if exist, in a sequence. You can remove commented section in following code to get node name also. Hope it would help.
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
public class RecDOMP {
public static void main(String[] args) throws Exception{
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false);
DocumentBuilder db = dbf.newDocumentBuilder();
// replace following path with your input xml path
Document doc = db.parse(new FileInputStream(new File ("D:\\ambuj\\ATT\\apip\\APIP_New.xml")));
// replace following path with your output xml path
File OutputDOM = new File("D:\\ambuj\\ATT\\apip\\outapip1.txt");
FileOutputStream fostream = new FileOutputStream(OutputDOM);
OutputStreamWriter oswriter = new OutputStreamWriter (fostream);
BufferedWriter bwriter = new BufferedWriter(oswriter);
// if file doesnt exists, then create it
if (!OutputDOM.exists()) {
OutputDOM.createNewFile();}
visitRecursively(doc,bwriter);
bwriter.close(); oswriter.close(); fostream.close();
System.out.println("Done");
}
public static void visitRecursively(Node node, BufferedWriter bw) throws IOException{
// get all child nodes
NodeList list = node.getChildNodes();
for (int i=0; i<list.getLength(); i++) {
// get child node
Node childNode = list.item(i);
if (childNode.getNodeType() == Node.TEXT_NODE)
{
//System.out.println("Found Node: " + childNode.getNodeName()
// + " - with value: " + childNode.getNodeValue()+" Node type:"+childNode.getNodeType());
String nodeValue= childNode.getNodeValue();
nodeValue=nodeValue.replace("\n","").replaceAll("\\s","");
if (!nodeValue.isEmpty())
{
System.out.println(nodeValue);
bw.write(nodeValue);
bw.newLine();
}
}
visitRecursively(childNode,bw);
}
}
}

You should definitely check out libraries for this, like dom4j (http://dom4j.sourceforge.net/). They can parse the whole XML document and let you not only list things like elements but do XPath queries and other such cool stuff on them.
There is a performance hit, especially in large XML documents, so you will want to check on the performance hit for your use case before committing to a library. This is especially true if you only need a small bit out of the XML document (and you kind of know what you are looking for already).

The answer to your question is no, it is not necessary to know any element names in advance. For example, you can walk the tree to discover the element names. But it all depends what you are actually trying to do.
For the vast majority of applications, incidentally, the Java DOM is one of the worst ways to solve the problem. But I won't comment further without knowing your project requirements.

remove whitespaces inside XML tag with java

I am getting XML with the following tags. What I do is, read the XML file with Java using Sax parser and save them to database. but it seems that spaces are there after the p tag like below.
<Inclusions><![CDATA[<p> </p><ul> <li>Small group walking tour</li> <li>Entrance fees</li> <li>Professional guide </li> <li>Guaranteed to skip the long lines</li> <li>Headsets to hear the guide clearly</li> </ul>
<p></p>]]></Inclusions>
But when we insert the read string to the database(PostgreSQL 8) it is printing bad charactors like below for those spaces.
\011\011\011\011\011\011\011\011\011\011\011\011 Small
group walking tour Entrance fees Professional guide
Guaranteed to skip the long lines Headsets to hear
the guide clearly \012\011\011\011\011\011
I want to know why it is printing bad characters(011\011) like that ?
What is the best way to remove spaces inside XML tags with java? (Or how to prevent those bad characters.)
I have checked samples and most of them with python samples.
This is how the XML reads with SAX in my program,
Method 1
// ResultHandler is the class that used to read the XML.
ResultHandler handler = new ResultHandler();
// Use the default parser
SAXParserFactory factory = SAXParserFactory.newInstance();
// Retrieve the XML file
FileInputStream in = new FileInputStream(new File(inputFile)); // input file is XML.
// Parse the XML input
SAXParser saxParser = factory.newSAXParser();
saxParser.parse( in , handler);
This is how the ResultHandler class used to read the XML as Sax parser with Method-1
import org.apache.log4j.Logger;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
// other imports
class ResultHandler extends DefaultHandler {
public void startDocument ()
{
logger.debug("Start document");
}
public void endDocument ()
{
logger.debug("End document");
}
public void startElement(String namespaceURI, String localName, String qName, Attributes attribs)
throws SAXException {
strValue = "";
// add logic with start of tag.
}
public void characters(char[] ch, int start, int length)
throws SAXException {
//logger.debug("characters");
strValue += new String(ch, start, length);
//logger.debug("strValue-->"+strValue);
}
public void endElement(String namespaceURI, String localName, String qName)
throws SAXException {
// add logic to end of tag.
}
}
So that need to know, how to set setIgnoringElementContentWhitespace(true) or similar with sax parser.

You can try to set for your DocumentBuilderFactory
setIgnoringElementContentWhitespace(true)
because of this:
Due to reliance on the content model this setting requires the parser
to be in validating mode
you also need to set
setValidating(true)
Or the str= str.replaceAll("\\s+", ""); might as well work

I'm also finding an exact answer. But think this will help for u.
The C/Modula-3 octal notation; vs there meaning in this link
It says
\011 is for Horizontal tab (ASCII HT)
\012 is for Line feed (ASCII NL, newline)
You can replace multiple spaces with one space as follows
str = str.replaceAll("\s([\s])+", " ");

Java Plist XML Parsing

I'm parsing a (not well formed) Apple Plist File with java.
My Code looks like this:
InputStream in = new FileInputStream( "foo" );
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader parser = factory.createXMLEventReader( in );
while (parser.hasNext()){
XMLEvent event = parser.nextEvent();
//code to navigate the nodes
}
The parts I"m parsing are looking like this:
<dict>
<key>foo</key><integer>123</integer>
<key>bar</key><string>Boom & Shroom</string>
</dict>
My problem is now, that nodes containing a ampersand are not parsed like they should because the ampersand is representing a entity.
What can i do to get the value of the node as a complete String, instead of broken parts?
Thank you in advance.

You should be able to solve your problem by setting the IS_COALESCING property on the XMLInputFactory (I also prefer XMLStreamReader over XMLEventReader, but ymmv):
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE);
InputStream in = // ...
xmlReader = factory.createXMLStreamReader(in, "UTF-8");
Incidentally, to the best of my knowledge none of the JDK parsers will handle "not well formed" XML without choking. Your XML is, in fact, well-formed: it uses an entity rather than a raw ampersand.

There is a predefined method getElementText(), which is buggy in jdk1.6.0_15, but works ok with jdk1.6.0_19. A complete program to easily parse the plist file is this:
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.XMLEvent;
public class Parser {
public static void main(String[] args) throws XMLStreamException, IOException {
InputStream in = new FileInputStream("foo.xml");
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader parser = factory.createXMLEventReader(in);
assert parser.nextEvent().isStartDocument();
XMLEvent event = parser.nextTag();
assert event.isStartElement();
final String name1 = event.asStartElement().getName().getLocalPart();
if (name1.equals("dict")) {
while ((event = parser.nextTag()).isStartElement()) {
final String name2 = event.asStartElement().getName().getLocalPart();
if (name2.equals("key")) {
String key = parser.getElementText();
System.out.println("key: " + key);
} else if (name2.equals("integer")) {
String number = parser.getElementText();
System.out.println("integer: " + number);
} else if (name2.equals("string")) {
String str = parser.getElementText();
System.out.println("string: " + str);
}
}
}
assert parser.nextEvent().isEndDocument();
}
}

This library enables your Java application to handle property lists of various formats.
Read / write property lists from / to files, streams or byte arrays
Convert between property list formats
Property list contents are provided as objects from the NeXTSTEP environment (NSDictionary, NSArray, NSString, etc.)
Serialize native java data structures to property list objects
Deserialize from property list objects to native java data structures
<dependency>
<groupId>com.googlecode.plist</groupId>
<artifactId>dd-plist</artifactId>
<version>1.26</version>
</dependency>
dd-plist

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to preserve whitespace in attributes when using XMLStreamWriter? - java

Related

XML to csv in java using DOM

javax.xml.transform.Transformer line endings no longer respect system property "line.separator"

XML file reading in Java

remove whitespaces inside XML tag with java

Java Plist XML Parsing

Categories

Resources