XML file reading in Java - java

Is it necessary to know the structure and tags of an XML file completely before reading it in Java?
areaElement.getElementsByTagName("checked").item(0).getTextContent()
I don't know the field name "checked" before I read the file. Is there any way to list all the tags in the XML file, basically the file structure?

I had prepared this DOM parser by myself, using recursion which will parse your xml without having knowledge of single tag. It will give you each node's text content if exist, in a sequence. You can remove commented section in following code to get node name also. Hope it would help.
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
public class RecDOMP {
public static void main(String[] args) throws Exception{
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false);
DocumentBuilder db = dbf.newDocumentBuilder();
// replace following path with your input xml path
Document doc = db.parse(new FileInputStream(new File ("D:\\ambuj\\ATT\\apip\\APIP_New.xml")));
// replace following path with your output xml path
File OutputDOM = new File("D:\\ambuj\\ATT\\apip\\outapip1.txt");
FileOutputStream fostream = new FileOutputStream(OutputDOM);
OutputStreamWriter oswriter = new OutputStreamWriter (fostream);
BufferedWriter bwriter = new BufferedWriter(oswriter);
// if file doesnt exists, then create it
if (!OutputDOM.exists()) {
OutputDOM.createNewFile();}
visitRecursively(doc,bwriter);
bwriter.close(); oswriter.close(); fostream.close();
System.out.println("Done");
}
public static void visitRecursively(Node node, BufferedWriter bw) throws IOException{
// get all child nodes
NodeList list = node.getChildNodes();
for (int i=0; i<list.getLength(); i++) {
// get child node
Node childNode = list.item(i);
if (childNode.getNodeType() == Node.TEXT_NODE)
{
//System.out.println("Found Node: " + childNode.getNodeName()
// + " - with value: " + childNode.getNodeValue()+" Node type:"+childNode.getNodeType());
String nodeValue= childNode.getNodeValue();
nodeValue=nodeValue.replace("\n","").replaceAll("\\s","");
if (!nodeValue.isEmpty())
{
System.out.println(nodeValue);
bw.write(nodeValue);
bw.newLine();
}
}
visitRecursively(childNode,bw);
}
}
}

You should definitely check out libraries for this, like dom4j (http://dom4j.sourceforge.net/). They can parse the whole XML document and let you not only list things like elements but do XPath queries and other such cool stuff on them.
There is a performance hit, especially in large XML documents, so you will want to check on the performance hit for your use case before committing to a library. This is especially true if you only need a small bit out of the XML document (and you kind of know what you are looking for already).

The answer to your question is no, it is not necessary to know any element names in advance. For example, you can walk the tree to discover the element names. But it all depends what you are actually trying to do.
For the vast majority of applications, incidentally, the Java DOM is one of the worst ways to solve the problem. But I won't comment further without knowing your project requirements.

Related

XML to csv in java using DOM

I want to convert XML file to csv that is comma separate file for that i use DOM parser in java.
The output of below code is - AAA123456
The Desiered output is -AAA,123,456
This is what i develop so far.Hope i separate with node name as csv.
public class Main {
static public final String SEPARATOR = ",";
private static String decodeDetailOutputRecordXML(String str) throws ParserConfigurationException, IOException, SAXException {
str = "<a><b><c>AAA</c><d>123</d><e>456</e></b></a>";
Document doc =DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new ByteArrayInputStream(str.getBytes()));
DocumentTraversal traversal = (DocumentTraversal) doc;
NodeIterator iterator = traversal.createNodeIterator(doc.getDocumentElement(), NodeFilter.SHOW_ELEMENT, null, true);
for (Node n = iterator.nextNode(); n != null; n = iterator.nextNode()) {
out.println(n.getTextContent());
}
return "";
}
public static void main(String[] args) throws Exception {
decodeDetailOutputRecordXML(null);
return;
}
}
This answer is to demonstrate the DOM API usage to convert the XML format under consideration to CSV. The example code below used DOM API directly and OpenCSV to write the CSV file.
The Example XML
<?xml version="1.0" encoding="UTF-8"?>
<a>
<b>
<c>Somedata0</c>
<d>Somedata1</d>
<e>Somedata2</e>
</b>
<b>
<c>Xdata0</c>
<d>Xdata1</d>
<e>Xdata2</e>
</b>
</a>
The routine that converts the XML to CSV
package org.test;
import java.io.FileInputStream;
import java.io.FileWriter;
import org.apache.xerces.parsers.DOMParser;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import com.opencsv.CSVWriter;
public class XMLToCSVTest {
public static void main(String[] args) throws Exception{
String inputFilePath="D:\\workspaces\\mtplatform\\TechTest\\testfiles\\testdata.xml";
String outputFilePath="D:\\workspaces\\mtplatform\\TechTest\\testfiles\\testdataOut.csv";
/*
* We assume that we know the structure and the column names of the CSV file
*/
String[] csvHeaders=new String[] {"c","d","e"};
/*
* Using Xerces DOM parser directly, same can also be achieved through JAXP
*/
DOMParser parser=new DOMParser();
try(FileInputStream fis=new FileInputStream(inputFilePath);
CSVWriter writer=new CSVWriter(new FileWriter(outputFilePath));){
/*
* Write the CSV headers
*/
writer.writeNext(csvHeaders);
InputSource source=new InputSource(fis);
parser.parse(source);
Element documentElement=parser.getDocument().getDocumentElement();
/*
* We assume that we know the structure of the XML completely and we also assume the data is actually there, that is
* no elements are missing being optional.
*/
NodeList elementBList=documentElement.getElementsByTagName("b");
for(int i=0;i<elementBList.getLength();i++) {
Element elementB=(Element)elementBList.item(i);
Element elementC=(Element)elementB.getElementsByTagName("c").item(0);
Element elementD=(Element)elementB.getElementsByTagName("d").item(0);
Element elementE=(Element)elementB.getElementsByTagName("e").item(0);
String[] line=new String[] {elementC.getFirstChild().getNodeValue(),
elementD.getFirstChild().getNodeValue(),
elementE.getFirstChild().getNodeValue()};
writer.writeNext(line);
}//for closing
writer.flush();
}catch(Exception e) {e.printStackTrace();}
}//main closing
}//class closing
The CSV output
"c","d","e"
"Somedata0","Somedata1","Somedata2"
"Xdata0","Xdata1","Xdata2"
NOTE: The above is one way to convert an XML to CSV with DOM API directly. While direct DOM API gives lot of flexibility, it is also slightly complicated to use. XML being an hierarchical data could sometimes be difficult to express as CSV, which is a flat data structure without either some loss of fidelity or a more complicated CSV structure, a case in point is multiple occurrence of a specific child element (in general multi-value). The actual CSV output also could be written as part of the routine, however, it would be tedious and error prone, OpenCSV has been used for that reason.

Using POI To read/write a doc with the full POIFSFileSystem

I have the following issue, as everybody it seems, I want to replace some items with others in Word doc.
Issue with the issue is, the doc contains headers and footers which are part of the POIFSFileSystem (I know this because reading the FS / writing the doc back -without any changes- loses these informations, whereas reading the FS / writing it back as a new file doesn't).
Currently I do this :
POIFSFileSystem pfs = new POIFSFileSystem(fis);
HWPFDocument document = new HWPFDocument(pfs);
Range r1 = document.getRange();
…
document.write();
ByteArrayOutputStream bos = new ByteArrayOutputStream(50000);
pfs.writeFilesystem(bos);
pfs.close();
However this fails, with this error:
Opened read-only or via an InputStream, a Writeable File is required
If I don't rewrite the document, it works fine, but my changes are lost.
The other way around if I only save the document, not the filesystem, I lose the header/footer.
Now the problem is, how can I update the document while "saving as" the entire filesystem, or is there a way to force the document to contain everything from the file system?
The HWPF stuff is always in scratchpad because the DOC binary file format is the most horrible of all the Horrible formats. So it will really not be ready and also will be buggy in many cases.
But in your special case, your observations are not reproducible. Using apache poi 4.0.1 the HWPFDocument contains the header story, which also contains the footer stories, after creating from *.doc file. So the following works for me:
Source:
Code:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.hwpf.*;
import org.apache.poi.hwpf.usermodel.*;
public class ReadAndWriteDOCWithHeaderFooter {
public static void main(String[] args) throws Exception {
HWPFDocument document = new HWPFDocument(new FileInputStream("TemplateDOCWithHeaderFooter.doc"));
Range bodyRange = document.getRange();
System.out.println(bodyRange);
for (int p = 0; p < bodyRange.numParagraphs(); p++) {
System.out.println(bodyRange.getParagraph(p).text());
if (bodyRange.getParagraph(p).text().contains("<<NAME>>"))
bodyRange.getParagraph(p).replaceText("<<NAME>>", "Axel Richter");
if (bodyRange.getParagraph(p).text().contains("<<DATE>>"))
bodyRange.getParagraph(p).replaceText("<<DATE>>", "12/21/1964");
if (bodyRange.getParagraph(p).text().contains("<<AMOUNT>>"))
bodyRange.getParagraph(p).replaceText("<<AMOUNT>>", "1,234.56");
System.out.println(bodyRange.getParagraph(p).text());
}
System.out.println("==============================================================================");
Range overallRange = document.getOverallRange();
System.out.println(overallRange);
for (int p = 0; p < overallRange.numParagraphs(); p++) {
System.out.println(overallRange.getParagraph(p).text()); // contains all inclusive header and footer
}
FileOutputStream out = new FileOutputStream("ResultDOCWithHeaderFooter.doc");
document.write(out);
out.close();
document.close();
}
}
Result:
So please do checking it again and tell us exactly what is not working for you. Because we need reproducing that, please do providing a minimal, complete, and verifiable example as I have done with my code.

Reading equations & formula from Word (Docx) to html and save database using java

I have a word/docx file which has equations as under images
I want read data of file word/docx and save to my database
and when need I can get data from database and show on my html page
I used apache Poi for read data form docx file but It can't take equations
Please help me!
Word *.docx files are ZIP archives containing XML files which are Office Open XML. The formulas contained in Word *.docx documents are Office MathML (OMML).
Unfortunately this XML format is not really well known outside Microsoft Office. So it is not directly usable in HTML for example. But fortunately it is XML and as such it is transformable using Transforming XML Data with XSLT. So we can transform that OMML into MathML for example, which is usable in a wider area of use cases.
A transformation process via XSLT mainly bases on a XSL definition of the transformation. Unfortunately creating a such is also not really easy. But fortunately Microsoft has done that already and if you have a current Microsoft Office installed, you can find this file OMML2MML.XSL in the Microsoft Office program directory in %ProgramFiles%\. If you don't find it, do a web research to get it.
So if we are knowing this all, we can getting the OMML from the XWPFDocument, transforming it into MathML and then saving that for later usage.
My example stores the found formulas as MathML in a ArrayList of strings. You should also be able storing this strings in your data base.
The example needs the full ooxml-schemas-1.3.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025. This is because it uses CTOMath which is not shipped with the smaller poi-ooxml-schemas jar.
Word document:
Java code:
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;
import org.w3c.dom.Node;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;
import java.awt.Desktop;
import java.util.List;
import java.util.ArrayList;
/*
needs the full ooxml-schemas-1.3.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025
*/
public class WordReadFormulas {
static File stylesheet = new File("OMML2MML.XSL");
static TransformerFactory tFactory = TransformerFactory.newInstance();
static StreamSource stylesource = new StreamSource(stylesheet);
static String getMathML(CTOMath ctomath) throws Exception {
Transformer transformer = tFactory.newTransformer(stylesource);
Node node = ctomath.getDomNode();
DOMSource source = new DOMSource(node);
StringWriter stringwriter = new StringWriter();
StreamResult result = new StreamResult(stringwriter);
transformer.setOutputProperty("omit-xml-declaration", "yes");
transformer.transform(source, result);
String mathML = stringwriter.toString();
stringwriter.close();
//The native OMML2MML.XSL transforms OMML into MathML as XML having special name spaces.
//We don't need this since we want using the MathML in HTML, not in XML.
//So ideally we should changing the OMML2MML.XSL to not do so.
//But to take this example as simple as possible, we are using replace to get rid of the XML specialities.
mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");
mathML = mathML.replaceAll("xmlns:mml", "xmlns");
mathML = mathML.replaceAll("mml:", "");
return mathML;
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("Formula.docx"));
//storing the found MathML in a AllayList of strings
List<String> mathMLList = new ArrayList<String>();
//getting the formulas out of all body elements
for (IBodyElement ibodyelement : document.getBodyElements()) {
if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {
XWPFParagraph paragraph = (XWPFParagraph)ibodyelement;
for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {
for (CTOMath ctomath : ctomathpara.getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
}
} else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {
XWPFTable table = (XWPFTable)ibodyelement;
for (XWPFTableRow row : table.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
for (XWPFParagraph paragraph : cell.getParagraphs()) {
for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {
for (CTOMath ctomath : ctomathpara.getOMathList()) {
mathMLList.add(getMathML(ctomath));
}
}
}
}
}
}
}
document.close();
//creating a sample HTML file
String encoding = "UTF-8";
FileOutputStream fos = new FileOutputStream("result.html");
OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
writer.write("<!DOCTYPE html>\n");
writer.write("<html lang=\"en\">");
writer.write("<head>");
writer.write("<meta charset=\"utf-8\"/>");
//using MathJax for helping all browsers to interpret MathML
writer.write("<script type=\"text/javascript\"");
writer.write(" async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");
writer.write(">");
writer.write("</script>");
writer.write("</head>");
writer.write("<body>");
writer.write("<p>Following formulas was found in Word document: </p>");
int i = 1;
for (String mathML : mathMLList) {
writer.write("<p>Formula" + i++ + ":</p>");
writer.write(mathML);
writer.write("<p/>");
}
writer.write("</body>");
writer.write("</html>");
writer.close();
Desktop.getDesktop().browse(new File("result.html").toURI());
}
}
Result:
Just tested this code using apache poi 5.0.0 and it works. You need poi-ooxml-full-5.0.0.jar for apache poi 5.0.0. Please read https://poi.apache.org/help/faq.html#faq-N10025 for what ooxml libraries are needed for what apache poi version.
Adding to #Axel Richter answer, I found it really hard to find the required set of dependencies
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>ooxml-schemas</artifactId>
<version>1.4</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.15</version>
</dependency>
And with Office 2019 I guess they don't provide OMML2MML.XSL so here's the link for it https://github.com/Versal/word2markdown/blob/master/libs/omml2mml.xsl

Replace XML text node with an element node

I am using groovy, so an java implementation would also be fine.
I have
"""<TextFlow fontFamily="Arial" fontSize="20"><span>before</span>Less than 7 days<span>after</span></TextFlow>"""
I would like to wrap first level text node with a tag. So I would like to get
"""<TextFlow fontFamily="Arial" fontSize="20"><span>before</span><span>Less than 7 days</span><span>after</span></TextFlow>"""
I have looked into XmlSlurper which doesn't deal with text nodes. I have also looked into XmlParser which can handle text nodes, but I am not sure how to replace it with an xml element. Please advice.
This worked for me, hope it'd help someone else
#Grab('org.jdom:jdom2:2.0.5')
#Grab('jaxen:jaxen:1.1.4')
#GrabExclude('jdom:jdom')
import org.jdom2.*
import org.jdom2.input.*
import org.jdom2.xpath.*
import org.jdom2.output.*
def xml = """<TextFlow fontFamily="Arial" fontSize="20"><span>before</span>Less than 7 days<span>after</span></TextFlow>"""
Document doc = new SAXBuilder().build(new StringReader(xml))
def urls = XPathFactory.instance().compile('//TextFlow/text()').evaluate(doc)
for(def c in urls) {
int pos = c.parent.content.indexOf(c)
Element span = new Element("span")
span.text = c.text
c.parent.setContent(pos, span)
}
new XMLOutputter().with {
format = Format.getRawFormat()
format.setLineSeparator(LineSeparator.NONE)
// XmlOutputter can write to OutputStream or Writer, which is sufficient for most cases
output(doc, System.out)
}

Java Plist XML Parsing

I'm parsing a (not well formed) Apple Plist File with java.
My Code looks like this:
InputStream in = new FileInputStream( "foo" );
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader parser = factory.createXMLEventReader( in );
while (parser.hasNext()){
XMLEvent event = parser.nextEvent();
//code to navigate the nodes
}
The parts I"m parsing are looking like this:
<dict>
<key>foo</key><integer>123</integer>
<key>bar</key><string>Boom & Shroom</string>
</dict>
My problem is now, that nodes containing a ampersand are not parsed like they should because the ampersand is representing a entity.
What can i do to get the value of the node as a complete String, instead of broken parts?
Thank you in advance.
You should be able to solve your problem by setting the IS_COALESCING property on the XMLInputFactory (I also prefer XMLStreamReader over XMLEventReader, but ymmv):
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE);
InputStream in = // ...
xmlReader = factory.createXMLStreamReader(in, "UTF-8");
Incidentally, to the best of my knowledge none of the JDK parsers will handle "not well formed" XML without choking. Your XML is, in fact, well-formed: it uses an entity rather than a raw ampersand.
There is a predefined method getElementText(), which is buggy in jdk1.6.0_15, but works ok with jdk1.6.0_19. A complete program to easily parse the plist file is this:
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.XMLEvent;
public class Parser {
public static void main(String[] args) throws XMLStreamException, IOException {
InputStream in = new FileInputStream("foo.xml");
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader parser = factory.createXMLEventReader(in);
assert parser.nextEvent().isStartDocument();
XMLEvent event = parser.nextTag();
assert event.isStartElement();
final String name1 = event.asStartElement().getName().getLocalPart();
if (name1.equals("dict")) {
while ((event = parser.nextTag()).isStartElement()) {
final String name2 = event.asStartElement().getName().getLocalPart();
if (name2.equals("key")) {
String key = parser.getElementText();
System.out.println("key: " + key);
} else if (name2.equals("integer")) {
String number = parser.getElementText();
System.out.println("integer: " + number);
} else if (name2.equals("string")) {
String str = parser.getElementText();
System.out.println("string: " + str);
}
}
}
assert parser.nextEvent().isEndDocument();
}
}
This library enables your Java application to handle property lists of various formats.
Read / write property lists from / to files, streams or byte arrays
Convert between property list formats
Property list contents are provided as objects from the NeXTSTEP environment (NSDictionary, NSArray, NSString, etc.)
Serialize native java data structures to property list objects
Deserialize from property list objects to native java data structures
<dependency>
<groupId>com.googlecode.plist</groupId>
<artifactId>dd-plist</artifactId>
<version>1.26</version>
</dependency>
dd-plist

Categories