XML to csv in java using DOM

XML to csv in java using DOM - java

I want to convert XML file to csv that is comma separate file for that i use DOM parser in java.
The output of below code is - AAA123456
The Desiered output is -AAA,123,456
This is what i develop so far.Hope i separate with node name as csv.
public class Main {
static public final String SEPARATOR = ",";
private static String decodeDetailOutputRecordXML(String str) throws ParserConfigurationException, IOException, SAXException {
str = "<a><b><c>AAA</c><d>123</d><e>456</e></b></a>";
Document doc =DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new ByteArrayInputStream(str.getBytes()));
DocumentTraversal traversal = (DocumentTraversal) doc;
NodeIterator iterator = traversal.createNodeIterator(doc.getDocumentElement(), NodeFilter.SHOW_ELEMENT, null, true);
for (Node n = iterator.nextNode(); n != null; n = iterator.nextNode()) {
out.println(n.getTextContent());
}
return "";
}
public static void main(String[] args) throws Exception {
decodeDetailOutputRecordXML(null);
return;
}
}

This answer is to demonstrate the DOM API usage to convert the XML format under consideration to CSV. The example code below used DOM API directly and OpenCSV to write the CSV file.
The Example XML
<?xml version="1.0" encoding="UTF-8"?>
<a>
<b>
<c>Somedata0</c>
<d>Somedata1</d>
<e>Somedata2</e>
</b>
<b>
<c>Xdata0</c>
<d>Xdata1</d>
<e>Xdata2</e>
</b>
</a>
The routine that converts the XML to CSV
package org.test;
import java.io.FileInputStream;
import java.io.FileWriter;
import org.apache.xerces.parsers.DOMParser;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import com.opencsv.CSVWriter;
public class XMLToCSVTest {
public static void main(String[] args) throws Exception{
String inputFilePath="D:\\workspaces\\mtplatform\\TechTest\\testfiles\\testdata.xml";
String outputFilePath="D:\\workspaces\\mtplatform\\TechTest\\testfiles\\testdataOut.csv";
/*
* We assume that we know the structure and the column names of the CSV file
*/
String[] csvHeaders=new String[] {"c","d","e"};
/*
* Using Xerces DOM parser directly, same can also be achieved through JAXP
*/
DOMParser parser=new DOMParser();
try(FileInputStream fis=new FileInputStream(inputFilePath);
CSVWriter writer=new CSVWriter(new FileWriter(outputFilePath));){
/*
* Write the CSV headers
*/
writer.writeNext(csvHeaders);
InputSource source=new InputSource(fis);
parser.parse(source);
Element documentElement=parser.getDocument().getDocumentElement();
/*
* We assume that we know the structure of the XML completely and we also assume the data is actually there, that is
* no elements are missing being optional.
*/
NodeList elementBList=documentElement.getElementsByTagName("b");
for(int i=0;i<elementBList.getLength();i++) {
Element elementB=(Element)elementBList.item(i);
Element elementC=(Element)elementB.getElementsByTagName("c").item(0);
Element elementD=(Element)elementB.getElementsByTagName("d").item(0);
Element elementE=(Element)elementB.getElementsByTagName("e").item(0);
String[] line=new String[] {elementC.getFirstChild().getNodeValue(),
elementD.getFirstChild().getNodeValue(),
elementE.getFirstChild().getNodeValue()};
writer.writeNext(line);
}//for closing
writer.flush();
}catch(Exception e) {e.printStackTrace();}
}//main closing
}//class closing
The CSV output
"c","d","e"
"Somedata0","Somedata1","Somedata2"
"Xdata0","Xdata1","Xdata2"
NOTE: The above is one way to convert an XML to CSV with DOM API directly. While direct DOM API gives lot of flexibility, it is also slightly complicated to use. XML being an hierarchical data could sometimes be difficult to express as CSV, which is a flat data structure without either some loss of fidelity or a more complicated CSV structure, a case in point is multiple occurrence of a specific child element (in general multi-value). The actual CSV output also could be written as part of the routine, however, it would be tedious and error prone, OpenCSV has been used for that reason.

Related

bidi string can't be read from Word (Apache POI)

I'm writing a bidi String to an MS Word file using Apache POI after wrapping it with the sequence
aString = "\u202E" + aString + "\u202C";
The text renders correctly in the file, and reads fine when I retrieve the string again. But if I modify the file in anyway, suddenly, reading that string returns true with isBlank().
Thank you in advance for any suggestions/help!

When Microsoft Word stores bidirectional text in it's Office Open XML *.docx format, then it sometimes uses special text run elements w:bdo (bi directional orientation). Apache poi does not read those elements until now. So if a XWPFParagraph contains such elements, then paragraph.getText() will return an empty string.
One could using org.apache.xmlbeans.XmlCursor to really get all text from all XWPFParagraphs like so:
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.apache.xmlbeans.XmlCursor;
public class ReadWordParagraphs {
static String getAllTextFromParagraph(XWPFParagraph paragraph) {
XmlCursor cursor = paragraph.getCTP().newCursor();
return cursor.getTextValue();
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("WordDocument.docx"));
for (XWPFParagraph paragraph : document.getParagraphs()) {
System.out.println(paragraph.getText()); // will not return text in w:bdo elements
System.out.println(getAllTextFromParagraph(paragraph)); // will return all text content of paragraph
}
}
}

Reading equations & formula from Word (Docx) to html and save database using java

I have a word/docx file which has equations as under images
I want read data of file word/docx and save to my database
and when need I can get data from database and show on my html page
I used apache Poi for read data form docx file but It can't take equations
Please help me!

Word *.docx files are ZIP archives containing XML files which are Office Open XML. The formulas contained in Word *.docx documents are Office MathML (OMML).
Unfortunately this XML format is not really well known outside Microsoft Office. So it is not directly usable in HTML for example. But fortunately it is XML and as such it is transformable using Transforming XML Data with XSLT. So we can transform that OMML into MathML for example, which is usable in a wider area of use cases.
A transformation process via XSLT mainly bases on a XSL definition of the transformation. Unfortunately creating a such is also not really easy. But fortunately Microsoft has done that already and if you have a current Microsoft Office installed, you can find this file OMML2MML.XSL in the Microsoft Office program directory in %ProgramFiles%\. If you don't find it, do a web research to get it.
So if we are knowing this all, we can getting the OMML from the XWPFDocument, transforming it into MathML and then saving that for later usage.
My example stores the found formulas as MathML in a ArrayList of strings. You should also be able storing this strings in your data base.
The example needs the full ooxml-schemas-1.3.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025. This is because it uses CTOMath which is not shipped with the smaller poi-ooxml-schemas jar.
Word document:
Java code:
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;
import org.w3c.dom.Node;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;
import java.awt.Desktop;
import java.util.List;
import java.util.ArrayList;
/*
needs the full ooxml-schemas-1.3.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025
*/
public class WordReadFormulas {
static File stylesheet = new File("OMML2MML.XSL");
static TransformerFactory tFactory = TransformerFactory.newInstance();
static StreamSource stylesource = new StreamSource(stylesheet);
static String getMathML(CTOMath ctomath) throws Exception {
Transformer transformer = tFactory.newTransformer(stylesource);
Node node = ctomath.getDomNode();
DOMSource source = new DOMSource(node);
StringWriter stringwriter = new StringWriter();
StreamResult result = new StreamResult(stringwriter);
transformer.setOutputProperty("omit-xml-declaration", "yes");
transformer.transform(source, result);
String mathML = stringwriter.toString();
stringwriter.close();
//The native OMML2MML.XSL transforms OMML into MathML as XML having special name spaces.
//We don't need this since we want using the MathML in HTML, not in XML.
//So ideally we should changing the OMML2MML.XSL to not do so.
//But to take this example as simple as possible, we are using replace to get rid of the XML specialities.
mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");
mathML = mathML.replaceAll("xmlns:mml", "xmlns");
mathML = mathML.replaceAll("mml:", "");
return mathML;
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("Formula.docx"));
//storing the found MathML in a AllayList of strings
List<String> mathMLList = new ArrayList<String>();
//getting the formulas out of all body elements
for (IBodyElement ibodyelement : document.getBodyElements()) {
if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {
XWPFParagraph paragraph = (XWPFParagraph)ibodyelement;
for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {
for (CTOMath ctomath : ctomathpara.getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
}
} else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {
XWPFTable table = (XWPFTable)ibodyelement;
for (XWPFTableRow row : table.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
for (XWPFParagraph paragraph : cell.getParagraphs()) {
for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {
for (CTOMath ctomath : ctomathpara.getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
}
}
}
}
}
}
document.close();
//creating a sample HTML file
String encoding = "UTF-8";
FileOutputStream fos = new FileOutputStream("result.html");
OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
writer.write("<!DOCTYPE html>\n");
writer.write("<html lang=\"en\">");
writer.write("<head>");
writer.write("<meta charset=\"utf-8\"/>");
//using MathJax for helping all browsers to interpret MathML
writer.write("<script type=\"text/javascript\"");
writer.write(" async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");
writer.write(">");
writer.write("</script>");
writer.write("</head>");
writer.write("<body>");
writer.write("<p>Following formulas was found in Word document: </p>");
int i = 1;
for (String mathML : mathMLList) {
writer.write("<p>Formula" + i++ + ":</p>");
writer.write(mathML);
writer.write("<p/>");
}
writer.write("</body>");
writer.write("</html>");
writer.close();
Desktop.getDesktop().browse(new File("result.html").toURI());
}
}
Result:
Just tested this code using apache poi 5.0.0 and it works. You need poi-ooxml-full-5.0.0.jar for apache poi 5.0.0. Please read https://poi.apache.org/help/faq.html#faq-N10025 for what ooxml libraries are needed for what apache poi version.

Adding to #Axel Richter answer, I found it really hard to find the required set of dependencies
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>ooxml-schemas</artifactId>
<version>1.4</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.15</version>
</dependency>
And with Office 2019 I guess they don't provide OMML2MML.XSL so here's the link for it https://github.com/Versal/word2markdown/blob/master/libs/omml2mml.xsl

XML file reading in Java

Is it necessary to know the structure and tags of an XML file completely before reading it in Java?
areaElement.getElementsByTagName("checked").item(0).getTextContent()
I don't know the field name "checked" before I read the file. Is there any way to list all the tags in the XML file, basically the file structure?

I had prepared this DOM parser by myself, using recursion which will parse your xml without having knowledge of single tag. It will give you each node's text content if exist, in a sequence. You can remove commented section in following code to get node name also. Hope it would help.
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
public class RecDOMP {
public static void main(String[] args) throws Exception{
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false);
DocumentBuilder db = dbf.newDocumentBuilder();
// replace following path with your input xml path
Document doc = db.parse(new FileInputStream(new File ("D:\\ambuj\\ATT\\apip\\APIP_New.xml")));
// replace following path with your output xml path
File OutputDOM = new File("D:\\ambuj\\ATT\\apip\\outapip1.txt");
FileOutputStream fostream = new FileOutputStream(OutputDOM);
OutputStreamWriter oswriter = new OutputStreamWriter (fostream);
BufferedWriter bwriter = new BufferedWriter(oswriter);
// if file doesnt exists, then create it
if (!OutputDOM.exists()) {
OutputDOM.createNewFile();}
visitRecursively(doc,bwriter);
bwriter.close(); oswriter.close(); fostream.close();
System.out.println("Done");
}
public static void visitRecursively(Node node, BufferedWriter bw) throws IOException{
// get all child nodes
NodeList list = node.getChildNodes();
for (int i=0; i<list.getLength(); i++) {
// get child node
Node childNode = list.item(i);
if (childNode.getNodeType() == Node.TEXT_NODE)
{
//System.out.println("Found Node: " + childNode.getNodeName()
// + " - with value: " + childNode.getNodeValue()+" Node type:"+childNode.getNodeType());
String nodeValue= childNode.getNodeValue();
nodeValue=nodeValue.replace("\n","").replaceAll("\\s","");
if (!nodeValue.isEmpty())
{
System.out.println(nodeValue);
bw.write(nodeValue);
bw.newLine();
}
}
visitRecursively(childNode,bw);
}
}
}

You should definitely check out libraries for this, like dom4j (http://dom4j.sourceforge.net/). They can parse the whole XML document and let you not only list things like elements but do XPath queries and other such cool stuff on them.
There is a performance hit, especially in large XML documents, so you will want to check on the performance hit for your use case before committing to a library. This is especially true if you only need a small bit out of the XML document (and you kind of know what you are looking for already).

The answer to your question is no, it is not necessary to know any element names in advance. For example, you can walk the tree to discover the element names. But it all depends what you are actually trying to do.
For the vast majority of applications, incidentally, the Java DOM is one of the worst ways to solve the problem. But I won't comment further without knowing your project requirements.

How to preserve whitespace in attributes when using XMLStreamWriter?

When using the javax.xml.XMLStreamWriter, is there any way to preserve the whitespace within attributes? I understand that the XMLStreamReader will perform Attribute-Value Normalization, converting \r\n\t in the XML into a space, so it's up to the writer to emit entity references (e.g.
) to preserve the whitespace. Is there any way to tell the writer to use entity references for whitespace? Can I add entity references to attributes myself?
The following JUnit3 test passes. When I encode "Hello,\r\n\tworld", I want to get the same thing back out. But instead, the decoded value is "Hello world" (two spaces).
import java.io.StringReader;
import java.io.StringWriter;
import javax.xml.stream.*;
import junit.framework.TestCase;
public class XmlStreamTest extends TestCase {
public void testAttribute() throws XMLStreamException {
StringWriter stringWriter = new StringWriter();
XMLStreamWriter xmlStreamWriter = XMLOutputFactory.newFactory().createXMLStreamWriter(stringWriter);
xmlStreamWriter.writeStartDocument();
xmlStreamWriter.writeStartElement("root");
xmlStreamWriter.writeAttribute("a", "Hello,\r\n\tWorld! ");
xmlStreamWriter.writeEndElement();
xmlStreamWriter.writeEndDocument();
xmlStreamWriter.close();
assertEquals("<?xml version=\"1.0\" ?><root a=\"Hello,\r\n\tWorld! \"></root>", stringWriter.toString());
StringReader stringReader = new StringReader(stringWriter.toString());
XMLStreamReader xmlStreamReader = XMLInputFactory.newInstance().createXMLStreamReader(stringReader);
assertEquals(XMLStreamConstants.START_DOCUMENT, xmlStreamReader.getEventType());
assertEquals(XMLStreamConstants.START_ELEMENT, xmlStreamReader.next());
// This is not what I want! I want the value to be the same as I originally gave!
assertEquals("Hello, World! ", xmlStreamReader.getAttributeValue(null, "a"));
assertEquals(XMLStreamConstants.END_ELEMENT, xmlStreamReader.next());
assertEquals(XMLStreamConstants.END_DOCUMENT, xmlStreamReader.next());
}
}

Java Plist XML Parsing

I'm parsing a (not well formed) Apple Plist File with java.
My Code looks like this:
InputStream in = new FileInputStream( "foo" );
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader parser = factory.createXMLEventReader( in );
while (parser.hasNext()){
XMLEvent event = parser.nextEvent();
//code to navigate the nodes
}
The parts I"m parsing are looking like this:
<dict>
<key>foo</key><integer>123</integer>
<key>bar</key><string>Boom & Shroom</string>
</dict>
My problem is now, that nodes containing a ampersand are not parsed like they should because the ampersand is representing a entity.
What can i do to get the value of the node as a complete String, instead of broken parts?
Thank you in advance.

You should be able to solve your problem by setting the IS_COALESCING property on the XMLInputFactory (I also prefer XMLStreamReader over XMLEventReader, but ymmv):
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE);
InputStream in = // ...
xmlReader = factory.createXMLStreamReader(in, "UTF-8");
Incidentally, to the best of my knowledge none of the JDK parsers will handle "not well formed" XML without choking. Your XML is, in fact, well-formed: it uses an entity rather than a raw ampersand.

There is a predefined method getElementText(), which is buggy in jdk1.6.0_15, but works ok with jdk1.6.0_19. A complete program to easily parse the plist file is this:
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.XMLEvent;
public class Parser {
public static void main(String[] args) throws XMLStreamException, IOException {
InputStream in = new FileInputStream("foo.xml");
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader parser = factory.createXMLEventReader(in);
assert parser.nextEvent().isStartDocument();
XMLEvent event = parser.nextTag();
assert event.isStartElement();
final String name1 = event.asStartElement().getName().getLocalPart();
if (name1.equals("dict")) {
while ((event = parser.nextTag()).isStartElement()) {
final String name2 = event.asStartElement().getName().getLocalPart();
if (name2.equals("key")) {
String key = parser.getElementText();
System.out.println("key: " + key);
} else if (name2.equals("integer")) {
String number = parser.getElementText();
System.out.println("integer: " + number);
} else if (name2.equals("string")) {
String str = parser.getElementText();
System.out.println("string: " + str);
}
}
}
assert parser.nextEvent().isEndDocument();
}
}

This library enables your Java application to handle property lists of various formats.
Read / write property lists from / to files, streams or byte arrays
Convert between property list formats
Property list contents are provided as objects from the NeXTSTEP environment (NSDictionary, NSArray, NSString, etc.)
Serialize native java data structures to property list objects
Deserialize from property list objects to native java data structures
<dependency>
<groupId>com.googlecode.plist</groupId>
<artifactId>dd-plist</artifactId>
<version>1.26</version>
</dependency>
dd-plist

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

XML to csv in java using DOM - java

Related

bidi string can't be read from Word (Apache POI)

Reading equations & formula from Word (Docx) to html and save database using java

XML file reading in Java

How to preserve whitespace in attributes when using XMLStreamWriter?

Java Plist XML Parsing

Categories

Resources